A user asked for suggestions on what metrics to assess in a project to do better health checks on VMware UAGs using Horizon and UAG APIs. Ideas include alerting when a connection server exceeds a threshold, and when services are not in a running state, as well as low available memory and high CPU. These metrics would also be desirable to display in ControlUp’s Solve and Console. Discussions also included Dennis’ idea to bring the data to Edge DX and a feature request for UAG data.
Read the entire ‘Health Checks for VMware UAGs Using Horizon and UAG APIs’ thread below:
I am working on a project to do better health checks on the VMware UAG’s using both Horizon API & UAG API data. In the thread I will add an xml for what I can get from the UAG API and while I can make up quite a few things to alert on what would YOU like to be alerted of?
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<accessPointStatusAndStats>
`<sessionCount>1</sessionCount>`
`<authenticatedSessionCount>1</authenticatedSessionCount>`
`<authenticatedViewSessionCount>1</authenticatedViewSessionCount>`
`<openIncomingConnectionCount>4</openIncomingConnectionCount>`
`<highWaterMark>1</highWaterMark>`
`<timeStamp>1677839931803</timeStamp>`
`<date>Fri Mar 03 10:38:51 UTC 2023</date>`
`<overAllStatus>`
`<status>RUNNING</status>`
`</overAllStatus>`
`<authentication>`
`<authBrokerStatus>`
`<reason>Reachable</reason>`
`<status>RUNNING</status>`
`</authBrokerStatus>`
`<successLogins>0</successLogins>`
`<failedLogins>0</failedLogins>`
`</authentication>`
`<viewEdgeServiceStats>`
`<backendStatus>`
`<reason>Reachable</reason>`
`<status>RUNNING</status>`
`</backendStatus>`
`<edgeServiceSessionStats>`
`<identifier>VIEW</identifier>`
`<totalSessions>1</totalSessions>`
`<highWaterMarkOfSessions>1</highWaterMarkOfSessions>`
`<authenticatedSessions>1</authenticatedSessions>`
`<unauthenticatedSessions>0</unauthenticatedSessions>`
`<failedLoginAttempts>2</failedLoginAttempts>`
`<userCount>1</userCount>`
`</edgeServiceSessionStats>`
`<edgeServiceStatus>`
`<status>RUNNING</status>`
`</edgeServiceStatus>`
`<xmlapiUnrecognizedRequestsCount>0</xmlapiUnrecognizedRequestsCount>`
`<protocol name="pcoip">`
`<status>`
`<reason>Reachable</reason>`
`<status>RUNNING</status>`
`</status>`
`<sessions>0</sessions>`
`<maxSessions>0</maxSessions>`
`<unrecognizedRequestsCount>0</unrecognizedRequestsCount>`
`</protocol>`
`<protocol name="Tunnel,RDP">`
`<status>`
`<reason>Reachable</reason>`
`<status>RUNNING</status>`
`</status>`
`<sessions>0</sessions>`
`<maxSessions>0</maxSessions>`
`<unrecognizedRequestsCount>0</unrecognizedRequestsCount>`
`</protocol>`
`<protocol name="blast">`
`<status>`
`<reason>Reachable</reason>`
`<status>RUNNING</status>`
`</status>`
`<sessions>0</sessions>`
`<maxSessions>0</maxSessions>`
`<unrecognizedRequestsCount>0</unrecognizedRequestsCount>`
`</protocol>`
`<protocol name="utserver">`
`<status>`
`<reason>Reachable</reason>`
`<status>RUNNING</status>`
`</status>`
`<sessions>0</sessions>`
`<maxSessions>0</maxSessions>`
`<unrecognizedRequestsCount>0</unrecognizedRequestsCount>`
`</protocol>`
`</viewEdgeServiceStats>`
`<edgeServiceSessionStats>`
`<identifier>Total</identifier>`
`<totalSessions>1</totalSessions>`
`<highWaterMarkOfSessions>0</highWaterMarkOfSessions>`
`<authenticatedSessions>1</authenticatedSessions>`
`<unauthenticatedSessions>0</unauthenticatedSessions>`
`<failedLoginAttempts>2</failedLoginAttempts>`
`<userCount>1</userCount>`
`</edgeServiceSessionStats>`
`<applianceStats>`
`<cpuCores>2</cpuCores>`
`<totalCpuLoadPercent>0</totalCpuLoadPercent>`
`<totalMemoryMb>3944</totalMemoryMb>`
`<freeMemoryMb>2248</freeMemoryMb>`
`<usedDiskSpacePercentage>23.0</usedDiskSpacePercentage>`
`<cpuDetailedStats>`
`<idle>99.69</idle>`
`<ioWait>0.0</ioWait>`
`<irq>0.0</irq>`
`<nice>0.0</nice>`
`<softIrq>0.0</softIrq>`
`<steal>0.0</steal>`
`<system>0.0</system>`
`<user>0.3</user>`
`</cpuDetailedStats>`
`</applianceStats>`
`<uagVersion>22.09</uagVersion>`
`<uptimeInMins>18</uptimeInMins>`
</accessPointStatusAndStats>status right now: I am connecting to a single connection server, do a complete discovery of all pods in a cloud pod setup, get details for all uag’s that have been added to horizon and get both horizon & uag api data for all of them. I just need to build the logic for alerting
Number or percent Problematic machines and a way to reboot them.
That’s horizon side, this is for the uag’s only
We already have a status for problematic machines and various sba’s to reboot or rebuild them.Ok. I misread you saying “get both horizon and uag api data”. The current problematic status doesn’t meet our needs and @member has been helping us with a new script and we are still working thru that.
Dennis probably stole some of my work otherwise he knows where to find me haha
Shots fired. 😉
Didn’t mean it like that, he knows that he can steal whatever code he needs. Wouldn’t have been blogging about the horizon api’s for ages if it wasn’t for people actually using my code.
😉
building it out but the ones starting with HZN the data is coming from the Horizon API’s while the UAG ones are proper uag api data
needs sorting obviously but gathering the data firstlooks like I have all usable metrics collected what do we want to have alerts on (@member?)
UAG-CPU-Idle : 99.69 HZN-active_connection_count : 1 UAG-pcoip-Protocol-Sessions : 0 HZN-UAGName : pod2uag1.loft.lab HZN-UAGStatus : OK UAG-BackendStatus : RUNNING PodName : Cluster-POD2CBR1 UAG-authenticatedSessionCount : 1 UAG-usedDiskSpacePercentage : 23.0 UAG-Blast-Protocol-Status : RUNNING UAG-CPU-User : 0.3 HZN-GateWay-Version : 22.09 UAG-Tunnel-RDP-Protocol-Status : RUNNING UAG-pcoip-Protocol-MaxSessions : 0 UAG-EdgeServiceStatus : RUNNING HZN-Gateway-type : UAG HZN-blast_connection_count : 0 UAG-openIncomingConnectionCount : 0 UAG-Blast-Protocol-MaxSessions : 0 UAG-Tunnel-RDP-Protocol-Sessions : 0 UAG-utserver-Protocol-Status : RUNNING UAG-utserver-Protocol-Sessions : 0 UAG-utserver-Protocol-MaxSessions : 0 UAG-totalMemoryMb : 3944 UAG-OverallStatus : RUNNING UAG-CPUCores : 2 UAG-Tunnel-RDP-Protocol-MaxSessions : 0 UAG-SessionCount : 1 UAG-CPU-System : 0.0 UAG-authenticatedViewSessionCount : 1 UAG-Blast-Protocol-Sessions : 0 UAG-Pcoip-Protocol-Status : RUNNING UAG-freeMemoryMb : 2169 UAG-totalCpuLoadPercent : 0 HZN-GateWay-Address : 10.101.0.43 HZN-pcoip_connection_count : 0The biggest issue we see right now is when the Conenction on the UAG side exceed 2048 it will reject all new ones. We would like to be able alert if it exceeds a threshold. (maybe 1800 to start?)
Other alerts:
Any of the services not in a Running State
Low available memory
High CPU
Wishlist – Display in Solve:
Version of UAG
Pod Name
CPU, Memory and disk Utilization
Display in Console number of sessions for each type (Blast, PCoiP, RDP)
Obviously this won’t be in the console/solve (yet) but from an array we can easily generate other dashboards (I already heard Dennis’ brain spinning to bring this data to edge dx)
I realize it will take a bit to get to solve…but I am going to start asking early so it gets a date on the roadmap.. Thank you for digging into this…I am optimistic!
@member see @member feature request for UAG data 😄 I have the api call for you if you want it (yes only 1). (/cc @member)
Sure!
Making some progress here, int he final script I am aiming to make this 1 event alert_type that has the details for all uag’s in the alert if possible (using some useless numbers to check my coding 🙂 )
Continue reading and comment on the thread ‘Health Checks for VMware UAGs Using Horizon and UAG APIs within ControlUp’. Not a member? Join Here!
Categories: All Archives, ControlUp Edge DX, ControlUp Scripts & Triggers