Let's all take a minute to talk about co-stop

Scheduling CPU time in a virtual environment is tricky business. Because the Operating System expects all presented CPUs to be responsive, ESXi essentially needs to schedule all allocated CPUs to a requesting VM at once. While this scheduling has been improved substantially over ESXi's development lifecycle, the essential point has always remained: If the resources aren't available for ALL allocated CPUs, then NONE will run. The VM will be in a state of "co-stop."

A little more detail on this after the break.


Ok, let's try to understand what's happening from the guest VMs point of view. In any installation that has more than one CPU present, be it physical or virtual, operating system is constantly looking at running processes  and threads to if they are making progress. These 'sibling CPUs' need to be progressed at the same rate, or extremely close to it. If one sibling makes progress and another one doesn't, you get what is called CPU skew. Too much skew, and the OS crashes.

ESXi (or any hypervisor, really) has the responsibility of keeping this from happening by maintaining the synchronicity of the vCPUs. This is known as "co-scheduling." Co-scheduling goes on without issue so long as all vCPUs are closely-enough scheduled. However, whenever one vCPU has gone too long without being scheduled, ESXi will not schedule ANY of the vCPUs. We are now in a state of co-stop.

ESXi has run in what is known as relaxed co-scheduling since version 3.0. This allows vCPUs which are idle to be simulated as running, so that the entire VM doesn't have to stop. From VMware's own blog on the topic:

" ...for co-scheduling decisions, idle vCPUs do not accumulate skew and are treated as if they were running. This optimization ensures that idle guest vCPUs don’t waste physical processor resources, which can instead be allocated to other VMs. "

Still, the point remains that it is essential to make sure VMs are properly sized to avoid the possibility of co-stop. Even with relaxed co-scheduling, all the vCPUs in a guest must be scheduled every so often, and even a machine running in a relaxed co-scheduling condition will experience performance degradation.

It is in the VMware administrators best interest to  the scheduler's job easy. There are two rules of thumb to follow when determining the right size for a guest machine.

1. Use as few vCPUs as possible.

The smaller the number of CPUs to schedule simultaneously, the easier it will be to schedule them. VMware's Hot-Plug feature allows you to easily add more vCPUs if the need for them becomes apparent. Let the workload dictate the allocation of resources.

2. Take advantage of NUMA-nodes.

NUMA Nodes are a hardware technology that runs groupings of CPUs and memory together in a tight loop that is much faster than a general system bus. Sizing your VMs to a maximum of the NUMA sizes for both CPU and RAM can give a significant boost to a VMs performance. Before you get to that size VM (and hopefully you won't, if you're following rule #1 above), though, you can get more performance by paying attention to the vCPU counts you use for all VMs in a virtual environment.

The best counts to use when allocating vCPUs are counts divisible into the total number of physical cores. So, for example, on a six-core system (assuming a single CPU per NUMA Node, the best vCPU counts are 1, 2, 3, and 6. 4 vCPUs is suboptimal, especially if you are respecting the practice with other VMs. The goal is to be able to schedule as many VMs simultaneously as possible. If this environment has many 4 vCPU VMs, the math won't add up and you'll start to see a lot of co-stop.

Labels: , , , , ,