Recovering from Semi-Dead ESXi Management

VMware

October 5, 2017

Recovering from Semi-Dead ESXi Management

Firstly, I’d like to thank Bence Bertalan for the inspiration for this blog post (you can see his original blog post here: https://www.linkedin.com/pulse/qucik-tip-vmware-bringing-back-semi-dead-esxi-live-bence-bertalan/).

I have actually experienced this issue myself recently with a host that was still running ESXi 6.0. It appears that some activities had taken place on the host which resulted in the storage connection to a LUN becoming unavailable. This resulted in an ‘All Paths Down’ scenario… will would eventually progress onto a ‘Permanent Device Loss’ situation.

With the ‘All Paths Down’ scenario, the ESXi kernel will keep trying to reconnect to the lost device… unfortunately, this continues to retry until all of the kernel resources are focused purely on recovering connectivity to the lost volume.

The reason that this matters, is that you will begin the notice the following items on the host:

The host will lose connectivity to the vCenter Server and will become unresponsive to vSphere API calls
If you are able to view the DCUI directly, you will notice that the login process will be extremely slow (and could take hours to complete)
If you were to try to ‘Restart Management Agents’ from the DCUI interface, it is unlikely to complete successfully and will hang at the ‘Stopping Management Agents’ phase
If you try to ‘Enable SSH’ from within the DCUI, this will also take a long time to complete
Throughout all of this, the virtual machines will continue to run on the host and will be accessible through their network connection.

There are a number of ways that you could go about correcting this issue.

One option would be to reboot the host, this would require you to RDP onto each of the desktops to power them down, as we want to make sure that they are gracefully shut down before utilising a remote management tool (DRAC, IMM, iLO) to reboot the host, if it is remote. As you can tell, this requires downtime on the virtual machines and the host, which is not always possible.

An alternative solution, expanded on below is to restart the relevant services using SSH – please note that this may take several hours to complete depending on the responsiveness of your host:

You will need to enable SSH – DCUI will need to be utilised to enable SSH, and therefore you should log on with the root credentials for the host
Select ‘Troubleshooting Options’ and then ‘Enable SSH’
At this point you will need to wait until ‘Enable SSH’ changes to ‘Disable SSH’ and ‘SSH is Enabled’ appears in the grey part of the screen
With SSH enabled, you can now putty into the host using SSH – remember to still use the root details to log on
Before attempting to restart any services, you need to restart the hostd and vpxd services using the following commands:

/etc/init.d/hostd restart

/etc/init.d/vpxa restart

You can now restart the management agents using the following command:

services.sh restart

If everything goes to plan, then this command should complete successfully within 1-2 minutes and the host will become responsive again and connect back to vCenter. The good thing about these commands is that the VMs will continue to function normally throughout.

cbell76

Recovering from Semi-Dead ESXi Management

Leave a Reply Cancel reply

Archives

Recent Posts

Recovering from Semi-Dead ESXi Management

Share this:

Leave a Reply Cancel reply

Archives

Recent Posts