In addition to Workload, Health also measures
the anomalies found in the environment. vCenter Operations tries to
understand what is normal and abnormal behavior for an object in your
vSphere environment by using dynamic as well as static thresholds (see Figure 14).
Dynamic thresholds are learned over time to understand whether it is
normal for a particular VM to be busier on a Friday, for example, or
that heavier utilization normally occurs during a backup window. vCOPS
also detects trends in longer cycles, such as monthly and quarterly. It
allows you to apply some intelligence to your understanding of what is
going on; for example, if you see high workload activity but a low
number of anomalies, you know that this high workload is typical and
can be expected.
Figure 14. Anomalies.
The last minor badge under Health is Faults (see Figure 15).
Faults are pulled from vCenter Alerts and tell you about component
failures. An example of a fault is a loss of redundancy of NICs or HBAs
or possibly an HA failure. A high number of faults indicates an
unstable environment.
Figure 15. Faults.
The combination of
Faults, Anomalies, and Workload create an overall picture of Health.
From the dashboard, you can drill down to see an additional level of
detail on these minor badge categories.
The second major badge is Risk. Risk provides
information on the time remaining given the current capacity of the
environment and how much stress it is experiencing. The major badge
Risk is made up of Time Remaining, Capacity Remaining, and Stress.
Time Remaining answers the question of how
much time is available before you need to add additional capacity, such
as host servers, storage, and networking (see Figure 16).
It measures CPU, Memory, Disk, and Networking IOPS. Time Remaining is
like an early warning system for the major four resources.
Figure 16. Time Remaining.
Capacity Remaining takes the average size of a VM and determines how many can be run before capacity runs out (see Figure 17).
This badge breaks out this estimate against CPU, Memory, Disk, and
Networking IOPs so that you can see what are the limiting resources in
your environment.
Figure 17. Capacity Remaining.
Stress takes the amount of time that the
object is running under duress and breaks it out as a percentage from a
historical perspective (see Figure 18).
If an object in your vSphere environment is repeatedly running over the
threshold amount on a regular basis, that is likely an indication of
stress.
Figure 18. Stress.
The combination of Time, Capacity, and Stress
identifies the level of risk in the environment. A lower score
represents an environment running with less risk.
Efficiency is a combination of reclaimable
waste or resources that could be more efficiently utilized and a
measure of optimal density. Reclaimable Waste tells you if you are
overprovisioning resources (see Figure 19). It is broken down by CPU, Disk, and Memory.
Figure 19. Reclaimable Waste.
Density provides a picture of what your current consolidation ratio looks like and what is optimal (see Figure 20).
It is measured against the VM per host ratio as well as the number of
virtual CPUs to physical CPUs and virtual memory to physical memory.
Figure 20. Density.
You can drill down from
any of the dashboard views to understand the details. You can also see
health and performance of objects in your vSphere environment relative
to other objects. One great way to do this is to look under the
Analysis section through heat maps. Heat maps color-code objects to
allow you to focus quickly on problem areas. In virtual desktop
environments, storage IO contention can be difficult to isolate. In Figure 21,
the datastore contention is sized by IO usage grouped by datacenter.
You can see that most of the datastores are performing well, but a few
are underperforming relative to others.
Figure 21. Datastore contention sized by IO usage grouped by datacenter.