Monitors are the central component of Managed Availability. They define what data to collect, what constitutes the health of a feature, and what actions to take to restore a feature to good health. Because there are several different aspects to Monitors, it can be hard to figure out how a specific Monitor works.
All of the properties discussed in this article can be found in the Monitor’s definition event in the Microsoft.Exchange.ActiveMonitoring\MonitorDefinition crimson channel are of the Windows event log.
See this article for how these definitions can be easily collected.
What Data is Collected?
Nearly all Monitors collect one of three types of data: direct notifications, Probe results, or performance counters. Monitors that change states based on a direct notification only get data from the notification.
Monitors based on Probe results become unhealthy when some Probes fail. There are two main types of these Monitors, those based on a number of consecutive Probe failures, and those based on a number of Probes failing over an interval.
Monitors based on performance counters simply determine if a counter is higher or lower than the built-in defined threshold for the required time.
The TypeName property of a Monitor definition indicates what data it is collecting and the kind of threshold must be reached before it is considered Unhealthy. Here are the most common types with what they use:
OverallPercentSuccessMonitor | Looks at the results of all probes matching the SampleMask property and calculates the aggregate percent success over the past MonitoringIntervalSeconds. Becomes Unhealthy if the calculated percent success is less than the MonitoringThreshold. |
OverallConsecutiveProbeFailuresMonitor | Looks at the last X probe results as configured in MonitoringThreshold that match the SampleMask. Becomes Unhealthy if all of those results are failures. |
OverallXFailuresMonitor | Looks at the results of all probes matching the SampleMask property over the past MonitoringIntervalSeconds. Becomes Unhealthy if at least X results as configured in MonitoringThreshold are failures. |
OverallConsecutiveSampleValueAboveThresholdMonitor | Looks at the last X performance counter results as configured in SecondaryMonitoringThreshold matching SampleMask over the past MonitoringIntervalSeconds. Becomes Unhealthy if at least X performance counters are above the threshold configured in MonitoringThreshold. |
Healthy or Not
One more thing must happen before the Monitor will become Unhealthy. The code for individual Monitors that checks the threshold only runs every X seconds, where X is specified by the RecurrenceIntervalSeconds property. The threshold is checked only when the Monitor runs.
As soon as the Monitor runs while the threshold is met, the Monitor becomes Unhealthy. Get-ServerHealth will report that the Monitor is Degraded for the first 60 seconds, but the functional behavior of the Monitor does not have a concept of being Degraded; it is either Healthy or Unhealthy.
The Health Set that a Monitor is part of is defined by the Monitor’s ServiceName property. If any Monitor is Unhealthy, the entire Health Set will be marked as Unhealthy as viewed from Get-HealthReport or via System Center Operations Manager (SCOM).
Responder Timeline
The StateTransitionXML property of a Monitor definition indicates which Responders execute and when, as each Responder is tied to a transition state of the Monitor. Let’s consider a Monitor that has this value for its StateTransitionXML property:
<StateTransitions>
<Transition ToState="Unhealthy" TimeoutInSeconds="0" />
<Transition ToState="Unhealthy1" TimeoutInSeconds="30" />
<Transition ToState="Unhealthy2" TimeoutInSeconds="330" />
<Transition ToState="Unrecoverable" TimeoutInSeconds="1500" />
</StateTransitions>
As soon as the Monitor runs while its defined threshold is met, it will transition to the “Unhealthy” state. These transition states are only used for internal consumption. Although they share a term, the Monitor can only be Healthy or Unhealthy from an external perspective. Any Responders set to execute when this Monitor is in this transition state will now execute. After 30 more seconds, any Responders set to execute when the Monitor is in the “Unhealthy1” state will now execute. The next Responder will be 300 seconds later (for a total of 330 seconds) when the Monitor is set to the “Unhealthy2” state. The transition state each Responder is tied to is set by the TargetHealthState property on a Responder definition, which is an integer. Here are the transition states that the integer indicates:
0 | None |
1 | Healthy |
2 | Degraded |
3 | Unhealthy |
4 | Unrecoverable |
5 | Degraded1 |
6 | Degraded2 |
7 | Unhealthy1 |
8 | Unhealthy2 |
9 | Unrecoverable1 |
10 | Unrecoverable2 |
We call all these Responders that are tied to a Monitor transition states a Responder chain. As a Monitor’s threshold continues to be met, stronger and stronger Responders execute until the Monitor determines it is Healthy or an administrator is notified via event log escalation. If the code for this Monitor runs while it is in the “Unhealthy1” state and the threshold is no longer met, the Monitor will immediately transition to None. No more Responders will execute. Get-ServerHealth would again report this Monitor as Healthy.
Program Manager, Exchange Server