Recently, I had to retire some nodes from a client’s Nutanix cluster. We were recycling 3 out of the 6 nodes down to DR in this particular case. I’ve added nodes to a cluster but this was the first time I had to remove nodes from one. I reached out to Alan Biren from Nutanix for some quick instructions. As with all my dealings with Nutanix, I knew this would be a simple process. Below is the short and long of it (which includes removing nodes from a live running system without any user disruptions or data interruptions). –Nice!
Since all nodes participate in data protection and replication, the process needs to be done one node at a time.
First step would be to migrate all VMs from the node (ESXi) to the other nodes (use vCenter). Since a CVM is running on each node, putting the node into maintenance mode can’t be done (can’t maintenance mode while a VM is running) so I instead opted to remove the nodes from the vSphere DRS cluster and reconfigure them as stand alone ESXi hosts.
Once all VMs are evacuated, remove the NFS Datastore from the node.
CVM needs to be up and running.
From PRISM, go to hardware, select the node from the Diagram and then click the Remove Node and OK the warning message.
This will start the process of moving the data that sits on the node to the other nodes/forcing the data locality. This process could take up to 6 hours to complete depending on the amount of data.
The process will move through the paces of migrating all of the data from the node for removal to the remaining nodes in the cluster. Once complete, the GUI will show success and a reduced number of nodes in the cluster. Verify there are no remaining alerts before proceeding to the next node.
You can also check this has completed by running the cluster status and checking what nodes are in the Nutanix Cluster.
'ncli get-remove-status’, if ‘MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE is displayed, the process is not complete, if no output is returned then the process has completed.
Once the nodes have been removed from the Nutanix Cluster, you can clean up ESX side, by shutting down the CVM (triple verify in PRISM that no errors pop up when shutting down the CVM), putting node into maintenance mode and removing from ESX Cluster.