Recovering from Semi-Dead ESXi Management
Firstly, I’d like to thank Bence Bertalan for the inspiration for this blog post (you can see his original blog post here: https://www.linkedin.com/pulse/qucik-tip-vmware-bringing-back-semi-dead-esxi-live-bence-bertalan/).
I have actually experienced this issue myself recently with a host that was still running ESXi 6.0. It appears that some activities had taken place on the host which resulted in the storage connection to a LUN becoming unavailable. This resulted in an ‘All Paths Down’ scenario… will would eventually progress onto a ‘Permanent Device Loss’ situation.
With the ‘All Paths Down’ scenario, the ESXi kernel will keep trying to reconnect to the lost device… unfortunately, this continues to retry until all of the kernel resources are focused purely on recovering connectivity to the lost volume.
The reason that this matters, is that you will begin the notice the following items on the host:
- The host will lose connectivity to the vCenter Server and will become unresponsive to vSphere API calls
- If you are able to view the DCUI directly, you will notice that the login process will be extremely slow (and could take hours to complete)
- If you were to try to ‘Restart Management Agents’ from the DCUI interface, it is unlikely to complete successfully and will hang at the ‘Stopping Management Agents’ phase
- If you try to ‘Enable SSH’ from within the DCUI, this will also take a long time to complete
- Throughout all of this, the virtual machines will continue to run on the host and will be accessible through their network connection.
There are a number of ways that you could go about correcting this issue.
One option would be to reboot the host, this would require you to RDP onto each of the desktops to power them down, as we want to make sure that they are gracefully shut down before utilising a remote management tool (DRAC, IMM, iLO) to reboot the host, if it is remote. As you can tell, this requires downtime on the virtual machines and the host, which is not always possible.
An alternative solution, expanded on below is to restart the relevant services using SSH – please note that this may take several hours to complete depending on the responsiveness of your host:
- You will need to enable SSH – DCUI will need to be utilised to enable SSH, and therefore you should log on with the root credentials for the host
- Select ‘Troubleshooting Options’ and then ‘Enable SSH’
- At this point you will need to wait until ‘Enable SSH’ changes to ‘Disable SSH’ and ‘SSH is Enabled’ appears in the grey part of the screen
- With SSH enabled, you can now putty into the host using SSH – remember to still use the root details to log on
- Before attempting to restart any services, you need to restart the hostd and vpxd services using the following commands:
- You can now restart the management agents using the following command:
- If everything goes to plan, then this command should complete successfully within 1-2 minutes and the host will become responsive again and connect back to vCenter. The good thing about these commands is that the VMs will continue to function normally throughout.