Recovering from Semi-Dead ESXi Management

Firstly, I’d like to thank Bence Bertalan for the inspiration for this blog post (you can see his original blog post here:  https://www.linkedin.com/pulse/qucik-tip-vmware-bringing-back-semi-dead-esxi-live-bence-bertalan/).

I have actually experienced this issue myself recently with a host that was still running ESXi 6.0.  It appears that some activities had taken place on the host which resulted in the storage connection to a LUN becoming unavailable.  This resulted in an ‘All Paths Down’ scenario… will would eventually progress onto a ‘Permanent Device Loss’ situation.

With the ‘All Paths Down’ scenario, the ESXi kernel will keep trying to reconnect to the lost device… unfortunately, this continues to retry until all of the kernel resources are focused purely on recovering connectivity to the lost volume.

The reason that this matters, is that you will begin the notice the following items on the host:

  • The host will lose connectivity to the vCenter Server and will become unresponsive to vSphere API calls
  • If you are able to view the DCUI directly, you will notice that the login process will be extremely slow (and could take hours to complete)
  • If you were to try to ‘Restart Management Agents’ from the DCUI interface, it is unlikely to complete successfully and will hang at the ‘Stopping Management Agents’ phase
  • If you try to ‘Enable SSH’ from within the DCUI, this will also take a long time to complete
  • Throughout all of this, the virtual machines will continue to run on the host and will be accessible through their network connection.

There are a number of ways that you could go about correcting this issue.

One option would be to reboot the host, this would require you to RDP onto each of the desktops to power them down, as we want to make sure that they are gracefully shut down before utilising a remote management tool (DRAC, IMM, iLO) to reboot the host, if it is remote.  As you can tell, this requires downtime on the virtual machines and the host, which is not always possible.

An alternative solution, expanded on below is to restart the relevant services using SSH – please note that this may take several hours to complete depending on the responsiveness of your host:

  • You will need to enable SSH – DCUI will need to be utilised to enable SSH, and therefore you should log on with the root credentials for the host
  • Select ‘Troubleshooting Options’ and then ‘Enable SSH’
  • At this point you will need to wait until ‘Enable SSH’ changes to ‘Disable SSH’ and ‘SSH is Enabled’ appears in the grey part of the screen
  • With SSH enabled, you can now putty into the host using SSH – remember to still use the root details to log on
  • Before attempting to restart any services, you need to restart the hostd and vpxd services using the following commands:
/etc/init.d/hostd restart
/etc/init.d/vpxa restart
  • You can now restart the management agents using the following command:
services.sh restart
  • If everything goes to plan, then this command should complete successfully within 1-2 minutes and the host will become responsive again and connect back to vCenter.  The good thing about these commands is that the VMs will continue to function normally throughout.

About the Author

Dinger

I have been in IT for the past 15 years and using virtualisation technologies for around the past 8 years. I started, as quite a lot of people do, working with PCs after playing with such iconic systems like the ZX81, ZX Spectrum and then progressing through 386s, 486s, Pentiums etc. After being headhunted at sixth form to work for a small company based around Hertfordshire, UK. I began working with small businesses and gaining a lot of hardware experience. Three years later, after helping to increase the size of the business, I needed to gain exposure to a larger environment to progress my own career. I joined a large manufacturing company around Electronic Test and Measurement which progressed my skills onto more PC work, hardware work and then onto Server Operating Systems. I progressed again onto a consultancy company based in Reading, UK. Initially working as an engineer performing hardware / software installations for larger companies contracted out to the consultancy company, I moved up into a Consultant position continuing my travel across the UK assisting and providing solutions to companies. I finally moved on again to my current position, working back in Hertfordshire, UK. Again working for a large manufacturing company, this time with over 50,000 users worldwide. I am responsible for the datacenter hardware, the storage environment, the vmware environment and also implementing their new Citrix XenApp farm. My days are busy but also productive, its a friendly environment and in my four years of being with the company, I have seen many changes in technology and infrastructure in use within the company. About the site I started this site as I had been thinking of having more of a presence on the web for a while. On a daily basis, I perform tasks and use tools that others may not use or may not think to do and therefore I thought that I would share some of these experiences and tips with others to help with their day to day work. Currently, my main focus of work is around VMware and Veeam Backup & Replication but hopefully as my tasks progress, I’ll be able to share useful bits of information about other areas of IT as well.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.