This article explains the internal process behind the Hyper-V Virtual Machine Resource DLL and the functions used to interact with cluster components to improve the failover process for virtual machines.
Most of the article talks about Hyper-V Resource DLL. It doesn’t really show how to cluster Virtual Machines or how to configure Quick Migration or Live Migration in Hyper-V for Virtual Machines. Instead the article focuses more on the Hyper-V Resource DLL for Virtual Machines and the Failover Process for Virtual Machines running on Hyper-V Server.
I would recommend you reading how to cluster Virtual Machines article written by fellow writer Jaap Wesselius. The article is published at https://www.simple-talk.com/sysadmin/virtualization/hyper-v-r2-live-migration/.
Terms used throughout the article
Before we move ahead let define some important terms that we will be using.
- Cluster Service
- The Cluster Service is the main component of the Clustering Software which handles the communication between Resource Host Subsystem and its managers. All the clustering managers run under the Cluster Service.
- Resource Host Subsystem
- The Resource Host Subsystem (sometimes referred as RHS) is part of the Clustering Software. This runs under the Cluster Service (Clussvc.exe) to handle the communications between the Resource DLL and the Clustering Software.
- Resource DLL
- The Resource DLL ships with cluster-aware applications. The functions executed by the Clustering Software are supported by the Resource DLL. The main function of the Resource DLL is to report the status of the application resources to the Clustering Software and execute the functions from its library as and when needed.
- Cluster Configuration Database
- The Cluster Configuration Database is a registry hive that contains the state of the cluster. It is located at HKLM\Cluster at registry.
- Resources
- A resource is an entity that can provide a service to a client and can be taken offline and brought online by the Clustering Software. A resource must have its associated Resource DLL so that the Resource Host Subsystem can communicate with the resources using this DLL. The Virtual Machines running on Hyper-V can be configured as a Resource in the Cluster.
Windows Clustering
Microsoft introduced its first version of clustering software in Windows NT 4.0 Enterprise Edition. Microsoft has significantly improved the clustering software in Windows Server 2008, Windows Server 2008 R2 and Windows Server 2012. There are two types of clustering technologies: Server Cluster (formerly known as MSCS) and Network Load Balancing Cluster (NLB). MSCS or Server Cluster is basically used for High Availability. NLB is of course used to load balance the TCP/IP traffic. The MSCS or Server Cluster capability is also known as Failover Cluster. The support for Virtual Machines running on Hyper-V in a cluster is available only with Failover Clustering configured on Windows Server 2008, Windows Server 2008 R2 and Windows Server 2012.
Virtual Machines and High Availability
Support for Clustering Virtual Machines was introduced in Windows Server 2008 RTM running the Hyper-V Role and has been continued in the versions that followed.
Windows Failover Cluster feature includes many components such as Cluster Service, Node Manager, Membership Manager, Event Log Processor, Failover Manager, Resource Control Manager (RCM), Resource Host Subsystems (RHS) and Cluster Database Manager. The whole purpose of Failover clustering is to provide high availability of application resources. Clustering doesn’t get involved in deciding how much CPU and Memory should be utilized by an application.
An application running in the clustering environment must be cluster-aware. A cluster-aware application supports the functions executed by the cluster service or its components as shown in Figure 1.0 below. There is no way for Cluster Service to know about the availability of resources of an application in the cluster unless the application is cluster-aware. For example, if a node holding the application resources fails, the Cluster Service running on that node must be notified in order to start the failover process for the application’s resources. Cluster Service does this by receiving the responses from the Resource Host Subsystem (RHS). The RHS tracks the Virtual Machines with the help of Resource DLLs provided by Hyper-V Role.
You cannot cluster Virtual Machines running on Virtual Server. The Virtual Machines running on Virtual Server do not provide any Resource DLL which can be used with the Failover clustering software to make them highly available. On the other hand, the Virtual Machines running on Hyper-V are fully cluster-aware Virtual Machines, supporting/responding to all functions executed by the cluster service (Resource Host Subsystem). The Resource DLL of Hyper-V Virtual Machines, which supports all the functions, is VMCLUSRES.DLL. Hyper-V provides only one DLL for its Virtual Machines in the cluster. There are not any other Resource DLLs provided by the Hyper-V Role. We will discuss Hyper-v Resource DLL in detail in this article.
Tip: A Resource DLL is a separate application component that is specifically written to support cluster functions (for example, Open, Terminate, Online, Offline, Retry and so on) for resources for that application.
The Clustering Software Resource Host Subsystem (RHS) tracks the Hyper-V Virtual Machines availability through VMCLUSRES.DLL by performing two checks: IsAlive and LooksAlive. Implementing these tests is application specific and hence why cluster-aware applications are expected to provide their resource DLL. The Cluster Server doesn’t need to know about application-specific functions. It just executes the functions provided by the Resource DLLs. Hyper-V implements many other functions in its Resource DLL. The functions are shown in Figure 1.0 below. These functions are Hyper-V virtual Machine-specific and not related to clustering in any way.
Tip: The two basic checks (IsAlive and LooksAlive) are supported by every Resource DLL or cluster-aware application.
At heart of the cluster, there are two important components running; Resource Control Manager (RCM) and Resource Host Subsystem (RHS).
Resource Control Manager (RCM)
A Windows Server 2008 or later versions is able to host unlimited cluster resources with the help of Resource Control Manager (RCM). RCM provides the following functionalities:
- Manages the cluster resources
- Responsible for maintaining the cluster resource dependencies
- Responsible for maintaining the cluster policies.
- Maintains the state of individual resource in the cluster database (HKLM\Cluster). The states maintained by the cluster service are Online, Offline, Failed, Online Pending, and Offline Pending.
- With the help of Failover Cluster Manager, the RCM allows administrators to execute the cluster actions on a resource or resource group. The actions which can be executed are Move, Failover and Failback.
Resource Host Subsystem (RHS)
In versions of Microsoft Windows Server prior to Windows Server 2008, it is the Resource Monitor, running as RESRCOM.exe monitors the health of the resources running in a Failover Cluster. In Windows Server 2008 and later versions, Microsoft developers changed the coding of Failover Cluster. It is Resource Host Subsystem (RHS), running as RHS.exe or RHS32.exe for monitoring the health of cluster resources. It was a significant improvement over the previous versions. RHS plays an important role when it comes to decide the failover process for resources.
Resource Host Subsystem provides the following benefits:
- Communicates directly to Cluster Service (Clussvc.exe)
- Communicates directly to the Resource DLLs
- Participates in the Failover process
- Manages the individual or Group Resources state for each resource DLL
- Runs in the separate process as RHS.exe or RHS32.exe
- Conducts periodic health checks for all cluster resources to ensure they are operating normally
Note: By default, all the cluster resources run in RHS.exe which is a 64-bit process. However, the support for 32-bit resource DLLs is also provided with the help of RHS32.exe process as shown in the figure 1.1 below:
In Figure 1.0 you can see, the DLL VMCLUSRES.DLL, is installed when the Hyper-V Role is enabled initially. Before you can cluster Virtual Machines running on Hyper-V, you need to install the Failover Clustering Software on Windows Server 2008 RTM or 2008 R2 or Windows Server 2012. After installation is completed, you click on “Services and Applications” in Failover Cluster Management and then select the “Virtual Machines” as the Cluster resource.
Tip: If you don’t see “Virtual Machines” then try running the following commands. This DLL must be registered before you can cluster Virtual Machines.
Regsvr32.exe /u VMCLUSRES.DLL
Regsvr32.exe VMCLUSRES.DLL
The first command unregistered the DLL from the memory and second command re-registers the VMCLUSRES.DLL with the Failover Clustering Software.
The next DLL is VMCLUSEX.DLL. This DLL works as a proxy between the Cluster Administrator and the Hyper-V Manager. The main function of this DLL is to provide interfaces to configure and control Virtual Machines configuration parameters and screens. If this DLL is missing or corrupted you can’t access Virtual Machines. VMCLUSEX.DLL doesn’t implement any cluster-specific control functions. As an example, when you right click on a Virtual Machine resource using the Failover Cluster Manager, you will get “Bring this Virtual Machine Online” option to start the Virtual Machine. The same will be reflected in Hyper-V Manager. You will see the Virtual Machine starting in the Hyper-V Manager also.
VMMS.EXE which is the main process of Hyper-V needs to know the status of Virtual Machines running on the Hyper-V Server. The Resource DLL is written to update the status of the Virtual Machines in a cluster to VMMS.EXE. VMMS.EXE, in turn, shows the status of each Virtual Machine in Hyper-V Manager.
VMCLUSRES.DLL which sits between RHS and Virtual Machines plays an important role in the failover process. Without this DLL Hyper-V cannot function as a cluster-aware application.
Tip: A malicious code running in your system may corrupt the DLL files. To recover from a corrupted DLL, you can either re-run the Hyper-V Setup (disabling and enabling the role) or copy VMCLUSRES.DLL from a working computer.
In Figure 1.0 above, also shows the functions defined in VMCLUSRES.DLL. The Hyper-V Virtual Machine-Specific functions are mapped with the cluster-specific functions. For example, Cluster’s IsAlive and LooksAlive functions are mapped with VM IsAlive and VM LooksAlive respectively. However, there are no static mappings defined within VMCLUSRES.DLL. VMCLUSRES.DLL knows which function to execute. The same way, other Virtual Machines functions are also mapped to related cluster functions as shown in Figure 1.0.
VM IsAlive and VM LooksAlive functions are executed by VMCLUSRES.DLL at a predefined interval. Most of the monitoring task is done by performing a VM IsAlive query. VM IsAlive is implemented in such a way that it performs all the checks for Hyper-V Virtual Machines. It checks to make sure all the:
- Virtual Machines in cluster are online.
- Virtual Machines are configured with correct dependencies.
- The registry entries for Virtual Machines resources are configured correctly.
VM LooksAlive is used to perform a thorough check on the Virtual Machines in the cluster. This check might take some time as it includes checking the configuration of Virtual Machine, Virtual Machine Configuration file location (XML), VHD location, etc. It might take some time for LooksAlive to perform these checks and report back the status to the Resource Host Subsystem. To avoid the delays in reporting, the Resource Host Subsystem cluster component depends on the results reported by IsAlive which is configured to execute every 5 seconds by default. IsAlive only checks the status of Virtual Machine in the Cluster (e.g. Online or Failed). Based upon that, the action is taken by the Resource Host Subsystem.
Think of a situation where only LooksAlive is used to get the status of Virtual Machines in the Cluster. This may result in slightly more downtime of the Virtual Machines as LooksAlive calls are executed every 60 seconds! Now, you could ask why not decrease the interval of LooksAlive. Well, if you do so, you would see performance issue on the cluster. Please note that the Resource Host Subsystem component executes IsAlive and LooksAlive queries against the whole Cluster Group. It is the responsibility of the Resource DLL (VMCLUSRES.DLL) to execute VM IsAlive and VM LooksAlive against its Virtual Machine resources. By default, the IsAlive check is performed every 5 seconds and LooksAlive every 60 seconds as shown in Figure 1.2 below.
The default interval can be changed per Virtual Machines to improve failover response time as shown above in Figure 1.2.
Before I start explaining about IsAlive and LooksAlive calls, let me talk more about “Run this Resource in a Separate Resource Monitor” check box as shown in the Figure 1.2 above.
By default, all the cluster resources or Virtual Machines resources run under the single Resource Host Subsystem (RHS.EXE process). “Run This Resource In A Separate Resource Monitor” option provides the ability to run a Virtual Machine resource in a separate Resource Host Subsystem. This option is useful in the following scenario:
- You have 100s of Virtual Machines configured in the Failover Cluster. One of the Virtual Machine crashes every day at 12:00 noon. Since this Virtual Machine is configured to run under the same Resource Host Subsystem (RHS.EXE process), all the Virtual Machines stops responding as well.
To minimize the disruption for the other Virtual Machines Resources, you can configure the specific Virtual Machine to run under the separate Resource Host Subsystem by selecting the “Run This Resource In A Separate Resource Monitor” option.
In previous versions of Windows Clustering, it was not possible to define the IsAlive and LooksAlive interval per Resource. Now, starting with Windows Server 2008 cluster, it is possible to define the IsAlive and LooksAlive intervals per resource.
When you setup a cluster for the first time, the Cluster Service running on the node takes a snapshot of the cluster configuration and saves it in HKLM\Cluster key. This key contains the cluster configuration such as the resource name, their GUID, node holding the resources and status. This is generally called cluster configuration database. As an example, for Virtual Machines it includes the following:
Resource Name | GUID | Node Name | Status | PersitentState | |
Virtual Machine 1 | {GUID1} | Node1 | Online | 1 | |
Virtual Machine 2 | {GUID2} | Node1 | Online | 1 | |
Virtual Machine 3 | {GUID3} | Node1 | Online | 0 |
The PersistentState keeps the status of the Resources or Virtual Machines in the Cluster. The above shown Status column is just for your reference. The PersistentState 1 means Online and 0 means Offline. The “Status” column is not stored as a registry entry.
This is also shown in the Cluster Registry hive in figure 1.3 below:
As you can see in Figure 1.3, the PersistentState registry entry value of Virtual Machine “Test Cluster VM” is 1 which indicates that the Virtual Machine is Online in the cluster.
Before the Resource Host Subsystem executes any cluster function against the Virtual Machines or Cluster Groups, it looks at the cluster configuration database to check the status of all resources and their GUIDs. For example, let say we have a cluster group named “HyperV VMs”. All the Virtual Machines of Hyper-V reside in this group. When IsAlive interval expires (5 seconds by default), the RHS executes the IsAlive call against the “Hyper-V VMs” Cluster Group. It hands over the Resource GUID and Status to the Hyper-V Virtual Machines Resource DLL (VMCLUSRES.DLL). VMCLUSRES.DLL in turn executes the VM IsAlive call to check the Virtual Machines availability.
Note: Please note that VMCLUSRES.DLL doesn’t really know about the status of Virtual Machines. It is the Resource Host Subsystem who supplies this information to VMCLUSRES.DLL. So in another words, VMCLUSRES.DLL does not have direct access to the Cluster Database (HKLM\Cluster).
Next we look at VM Open, VM Close, VM Online and VM Offline status messages. These functions are called whenever Virtual Machines are moved across Hyper-V Servers or taken offline/online or when there is the need to call them. For example, you might want to take a Virtual Machine offline for maintenance purposes on a Hyper-V node. In that case, the Resource Host Subsystem executes the Offline function and in turn VMCLUSRES.DLL executes the VM Offline function to take the Virtual Machine offline. The same will be updated to the VMMS.EXE process in background so that it is aware of the Virtual Machine status. We will discuss these functions later in this article.
As a whole, these functions are executed by the Cluster Service and supported by the Hyper-V Resource DLL. That’s why Hyper-V Server Virtualization is known as pure cluster-aware Virtualization Software!
The Resource Host Subsystem determines the state of Virtual Machines by checking the PersistentState value at the registry. This value could be either 1 or 0. 1 is for Online and 0 is for Offline. For example, if you stop a Virtual Machine on a cluster node, the value 0 is set for that service or resource at the registry. If you stop the Virtual Machine using command line or Hyper-V Manager, the value is still updated in the Cluster Configuration Database. It is because Resource DLL of Hyper-V and VMMS.EXE always talk to each other to get the status of Virtual Machines and update accordingly in the Cluster Configuration Database. When you stop a Virtual Machine using a command line or WMI Script, you are actually interacting with VMMS.EXE service which, in turn, executes the Stop command on behalf of you. The status of Virtual Machine is updated in the Cluster Configuration Database. This may not work for other applications in the cluster. As an example, Exchange Server. Operations occurring out of the cluster for Exchange Server resources are not reflected at the cluster configuration database. In this case, the IsAlive query may not function correctly. The value supplied by the RHS will indicate that the Resources are running. Thus IsAlive will not take any action against the stopped Cluster Resources. The value is updated in the Cluster Configuration Database only when the LooksAlive is executed which performs a thorough check for the resources. The thorough check includes checking the Exchange Services.
How does Hyper-V Virtual Machine Resource DLL help in the failover process?
The status messages shown above Figure 1.1 are generated through IsAlive calls. When the IsAlive interval expires, the Resource Host Subsystem executes the Cluster IsAlive calls. The Hyper-V Cluster Resource DLL in turn executes VM IsAlive against all Virtual Machine Resources. The messages returned by these calls include one of the following:
- Online
- Offline
- Online Pending
- Offline Pending
- Failed
The above status messages are passed back to the Resource Host Subsystem. In turn this reports the need to take any action to the Cluster Service.
Tip: It is the Cluster Service which takes action during the Failover Process (if triggered). Cluster Service, running under Clussvc.exe process, is responsible for managing different cluster managers and calling them as and when needed.
As shown in Figure 1.0, the Resource Host Subsystem sits between the Hyper-V Resource DLL and the Cluster Service. Any calls made to Hyper-V Virtual Machine Resources have to take place at VMCLUSRES.DLL first. For example, if the Cluster Service needs to check the availability of Hyper-V Virtual Machine resources, it will make a call to the Resource Host Subsystem; in turn this will ask VMCLUSRES.DLL to check the status of the Hyper-V Virtual Machine Resources and report back. If the Resource Host Subsystem doesn’t receive any response from VMCLUSRES.DLL or it cannot detect the Virtual Machine availability, it will pass the status back to Cluster Service. The Cluster Service then passes this status message to related Managers as shown in the Figure 1.0. Cluster Managers take the action as per the status passed by lower layer components. The status message could indicate a failure of Virtual Machine resources or could indicate a simple status message. These messages and cluster actions are discussed later in this article with an example.
In addition, if functions executed by the Resource Host Subsystem do not exist in the Resource DLL, the request is simply discarded and no operation is carried out.
Hyper-V Server doesn’t really utilize its own mechanism to failover the Virtual Machines on the surviving node. Instead Resource DLLs are written to “support” the failover process. The following figure shows a simple failover process:
- After IsAlive interval expires (by default every 5 seconds), Cluster Service asks the Resource Host Subsystem to report the status of Virtual Machines.
- Resource Host Subsystem checks the status of Virtual Machine Resources in Cluster configuration database (HKLM\Cluster). It provides VMCLUSRES.DLL with the Virtual Machine Resources GUID and their current status (PersistenState).
- VMCLUSRES.DLL executes its own function (VM IsAlive) after it receives a signal from the Resource Host Subsystem to perform a check on the Virtual Machines. It checks and reports back the status to Resource Host Subsystem. VMCLUSRES.DLL will report one of the following status messages:
- Online
- Offline
- Online Pending
- Offline Pending
- Failed
- Stopped
- After the Resource Host Subsystem receives the status, it compares the status messages received from VMCLUSRES.DLL with the one stored in the Cluster configuration database. It then takes the action as per the status reported by the VMCLUSRES.DLL as listed below:
- If comparison is successful, no action is taken. For example, status message received in step 2 is “Online” and VM IsAlive query also reports the same status.
- If comparison is unsuccessful, the following actions are taken:
If status message received in step 2 is “Online” and VM IsAlive query reports “Offline”, the Resource Host Subsystem executes an “Online” function. VMCLUSRES.DLL receives this message and executes VM Online function to bring the Virtual Machine online. This status message is also reported to the VMMS.EXE process.
Tip: The Resource Host Subsystem doesn’t take any action for Online/Offline status messages because an Administrator might have stopped the resource for maintenance purposes, but the same should also be reflected in the Cluster configuration database before IsAlive is called. The Resource Host Subsystem only takes action when the comparison is not successful as stated above.
Furthermore, there shouldn’t be any inconsistencies in the Cluster configuration database. If there were any, these wouldn’t last longer than 5 seconds since Resource Control Manager always update the status at the Cluster configuration database. The Cluster configuration database is then replicated to all the nodes in the cluster with the help of Database Manager.
The mechanism isn’t really straight forward. There could be one more message returned by VMCLUSRES.DLL that is “Failed”. In this case the Resource Host Subsystem sends a message (Restart) back to VMCLUSRES.DLL to restart the Virtual Machine resource in the cluster. VMCLUSRES.DLL in turn executes the “VM Online” function to bring the failed Virtual Machines online.
Tip: VMCLUSRES.DLL doesn’t actually implement a separate Restart function. Instead it always uses its own implemented VM Online function. If a resource doesn’t come online within the specified interval or after a few attempts, the resource is considered to be failed and then the Failover process starts. The same is notified to the VMMS.EXE as it needs to keep the status of all the Virtual Machines running in the Cluster.
- After the Virtual Machine resource has failed, the message is passed back to the Resource Host Subsystem. The Cluster Service receives this message from the Resource Host Subsystem and starts the failover process with the help of the Failover Manager. The Failover Manager on each node will communicate with the Failover Manager on another selected cluster node to improve the failover process. Before Failover Manager on the node where the Virtual Machine resource has failed communicates with another Failover Manager, it needs to get the list of nodes available in Cluster. This is where the Node Manager comes into picture. It supplies the list of nodes available in the cluster and the first available node at the top of the list will be selected for failover.
- Once the list of nodes has been obtained by the source Failover Manager, it will talk to Failover Manager on the target node. The Failover Manager on the target node supplies the list of Virtual Machines Resources along with GUID and PersistentState to Resource Host Subsystem. Since this is a failover process, the Resource Host Subsystem knows what to do next. It lists all the Virtual Machines with its flag (Online or Offline) and instructs the Resource DLL of Hyper-V to execute the VM Online function from its library.
- The Resource DLL, in turn, executes the VM Online function to bring the resources online on the target node. The same is updated to the VMMS.EXE process of Hyper-V.
- If the Virtual Machine is started successfully within a few attempts, the failover process doesn’t occur.
Thus if there is no Resource DLL for Hyper-V Virtual Machines, the failover process could take a longer time to move the resources from one node to another surviving node. Because Hyper-V Resource DLL is competent enough to handle the cluster functions executed by the Clustering Software, it doesn’t need to wait to decide which action to take. As stated above, the cluster-aware functions are mapped with Hyper-V Resource DLL-specific functions, so it is easier for Hyper-V Resource DLL to execute these functions as soon as they are executed from the Resource Host Subsystem.
In Figure 1.4 you see VMMS.EXE and Hyper-V Manager. Every function executed by the VMCLUSRES.DLL is also notified to VMMS.EXE. VMMS.EXE, in turn, refreshes the status of its VMs on the Hyper-V Server. This is required in order to know the exact status of a VM running on the Hyper-V Server. As an example, an Administrator could open the Hyper-V Manager to get the status of all the Virtual Machines on the Hyper-V Server. If a Virtual Machine has failed and this is not communicated to VMMS.EXE, then there could be confusion, since the Failover Cluster Manager would report one status and the Hyper-V Manager would report a different status.
Tip:IsAlive is executed every 5 seconds for a Virtual Machine in the cluster. You could decrease this value to 1 or 2 to speed up the failover process.
Tip: The process defined above is very helpful when you are troubleshooting an issue with the Failed Virtual Machine resource in the cluster. These messages with the clustering component names are reflected in the cluster log files.
Conclusion
To summarize, Virtual Machines running on Virtual Server are not cluster-aware because they do not provide any Resource DLL. Virtual Machines running on Hyper-V are cluster-aware because Hyper-V Role ships along with a cluster Resource DLL.
We saw how the Cluster Service doesn’t talk to VMCLUSRES.DLL directly. In fact, it uses its Resource Host Subsystem (RHS). The status messages passed by the Hyper-V Resource DLL are received by the Resource Host Subsystem (RHS) to perform any appropriate action.
Finally we also saw how the Hyper-V Resource DLL plays an important role for its Virtual Machines in the cluster. Resource DLLs allow Hyper-V Virtual Machines to be fully cluster-aware VMs. The functions executed by the Resource Host Subsystem (RHS) on behalf of the Cluster Service are supported by the Hyper-V Resource DLL. This makes the failover process faster.
Load comments