Before we begin
If you have not checked out what this series is about then please take a look at the previous parts below.
Software Defined DataCentre Manager deployment
The aim of this part of the blog series is to configure the SDDC management cluster and get vSAN running on there for storage. The VMware Cloud Foundation deployment virtual machine will allow us to do this.
Before you go any further, make sure you grab a copy of the VMware Cloud Foundation deployment OVA from here. It’s about 9GB.
I would also recommend making a blueprint of your environment at this point. If the process goes wrong (like mine did) you have an easy way to get back to this point.
The stand-alone ESXi host that was deployed in part 2 of this blog series will host the deployment VM.
Import the OVA and power it on, you should end up with something like this. Note the access URL.
Note. When defining network settings, ensure the NTP server is defined as a local NTP source such as the ADDS server.
Log into the web UI.
Usefully, when you first log in, there is a checklist of action items that need to be completed before continuing. Each item I have not ticked below I need to go and rectify on the management hosts. Ensure you do the same.
One other item to check is if the disks are marked as SSD or HDD on the management cluster hosts. They will default to HDD but we want them to appear as SSD drives so we can simulate an all-flash vSAN cluster. This is an easy task if the hosts are connected to a vCenter server already, but as they are not, this needs to be done via the command line. Check out this blog post from William Lam for details.
SSH onto the host and run the following command
esxcli storage nmp device list
This will list all the disks on the host.
There are quite a few disks to configure. My list of commands ended up looking like this. I also marked the OS drive as an SSD.
esxcli storage nmp satp rule add -s VMW_SATP_LOCAL -d mpx.vmhba2:C0:T0:L0 -o enable_ssd
esxcli storage nmp satp rule add -s VMW_SATP_LOCAL -d mpx.vmhba1:C0:T0:L0 -o enable_ssd
esxcli storage nmp satp rule add -s VMW_SATP_LOCAL -d mpx.vmhba4:C0:T0:L0 -o enable_ssd
esxcli storage nmp satp rule add -s VMW_SATP_LOCAL -d mpx.vmhba3:C0:T0:L0 -o enable_ssd
esxcli storage core claiming reclaim -d mpx.vmhba2:C0:T0:L0
esxcli storage core claiming reclaim -d mpx.vmhba1:C0:T0:L0
esxcli storage core claiming reclaim -d mpx.vmhba4:C0:T0:L0
esxcli storage core claiming reclaim -d mpx.vmhba3:C0:T0:L0
Then check the disks are marked as SSD on the host.
Moving on, back to the VCF builder. Tick all the boxes, click next, accept the EULA, click next.
Until you end up here, where a configuration file has to be uploaded.
Download the Excel spreadsheet. It looks like this. All fairly self-explanatory. You can also see why I have not deviated too much from the standard VLAN ID’s, naming convention etc.
Note you must have license keys for all of the components shown below. SDDC-Manager key is not included as part of the vExpert license pack but I am assured this should work without a key. (It does)
Validation failure on the first try!
Summary of failures
- Password policy not correct for SDDC-Manager
- Updated password in the spreadsheet
- DNS Records not created for all VM’s
- Created DNS records for all components listed in configuration spreadsheet on VCFDC01
- Network connectivity validation
- States hosts are not accessible, although they are. One to check out
- vSAN Datastore validation
- Boot disks are smaller than 16GB – Need to grow those. I have amended previous parts of this blog series to reflect the need for a minimum of 16GB disk
- NTP Configuration wrong
- Can’t use external NTP provider on ESXi hosts or VCF builder, change to internal NTP source.
The test did pass for licenses without having an SDDC-Manager license defined.
After addressing most of the problems above the validation test looks like this.
Still issues with Network connectivity validation. vSAN disks complaining about cache size on OS disk and NTP still failing on VCF builder VM. I will accept the errors and move on by clicking acknowledge at the top right of the screen.
This allows me to start the bring up process, which is to configure and deploy everything we previously defined.
Which then starts a checklist of tasks to run through
As the process moves on through, we will start to see items marked as successful for deployment. Below, two platform service controllers and a vCenter service appliance have been deployed on management host 1.
vSAN datastore created on host 1 for vCenter to reside on before the full cluster is created. Hopefully, the disk type will show as SSD and not unknown post-deployment.
And the deployment proceeds as far as the NSX deployment and fails.
Logging into the vCenter server reveals that the OVF file could not be deployed.
I tried to create an empty folder on the vSAN datastore, same issue. can not create any files on the vSAN datastore.
The first hit in Google for the error was this VMware KB, but unfortunately, this did not resolve the issue. A little further testing led me to discover that I could not send a Jumbo packet with a VMKping to the vSAN IP address on host 01, but this worked for the other 3 hosts. Could it be a connectivity issue?
I powered the whole cluster down and brought it back online. After this, I could get a jumbo packet VMKping to the vSAN IP on host 01 and also noticed that the vSAN datastore is now reporting the correct amount of storage available. I suspect there were communication failures between all the vSAN nodes causing issues with OVF deployment.
Creating an empty folder following the power cycle worked as expected.
The next step was NSX deployment, most of this completed except one of the hosts purple screened during deployment, bringing down vSAN again. Unfortunately, this led to the vCenter Server appliance becoming orphaned, which is the one VM you do not want to become orphaned on the vSAN datastore. There are a number of troubleshooting articles available to run vSAN check state from the Ruby vSphere client, but the Ruby vSphere client runs on the vCenter appliance.
The VM in question on host 2.
after much checking of the state of vSAN from ESXCLI, I decided to destroy the hosts and start again.
TL:DR, each node in the vSAN cluster reported as being the Master node, and not Master, Agent and Backup nodes. Try as I might, I could not remedy this issue. I did have some great help from the guys in the vSAN channel in the vExpert Slack, but my problem was one that had not been seen before :(.
Redeploy, 4 vSAN hosts and VCF builder VM and start again. The failures above, I had two disk groups defined with 6 disks in use. Subsequent host deployment was reduced to a single disk group per host. Earlier blog posts on deploying hosts have been amended to reflect this setup.
Next crack at deploying VCF, yeah boi, success!
And SDDC Manager is accessible. Happy days.
A lot of time went into making this work. vSAN gave me the biggest headache, I did deploy the cluster 4 times before deciding to drop back to single disk group per host.
The next post will focus on making sure all the services are working correctly and addressable from SDDC Manager. Check it out in Part 6: SDDC Manager configuration.