The One Problem with the VCSA
Over the past couple of months I noticed a trend in my top blog daily reporting…the Quick fix post on fixing a 503 Service Unavailable error was constantly in the top 5 and getting significant views. The 503 error in various forms has been around since the early days of the VCSA which usually manifests it’s self with the following.
503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http20NamedPipeServiceSpecE:0x0000559b1531ef80] _serverNamespace = / action = Allow _pipeName =/var/run/vmware/vpxd-webserver-pipe)
Looking at the traffic stats for that post it’s clear to see an upward trend in the page views since about the end of June.
This to me is both a good and bad thing. It tells me that more people are deploying or migrating to the VCSA which is what VMware want…but it also tells me that more people are running into this 503 error and looking for ways to fix it online.
The Very Good:
The vCenter Server Appliance is a brilliant initiative from VMware and there has been a huge effort in developing the platform over the past three to four years to get it to a point where it not only became equal to vCenter’s deployed on Windows (and relying on MSSQL) but surpassed it in a lot of features especially in the vSphere 6.5 release. Most VMware shops are planning to or have migrated from Windows to the VCSA and for VMware labs it’s a no brainer for both corporate or homelab instances.
Personally I’ve been running VCSA’s in my various labs since the 5.5 release, have deployed key management clusters with the VCSA and more recently have proven that even the most mature Windows vCenter can be upgraded with the excellent migration tool. Being free of Windows and more importantly MSSQL is a huge factor in why the VCSA is an important consideration and the fact you get extra goodies like HA and API UI’s adds to it’s value.
The One Bad:
Everyone who has dealt with storage issues knows that it can lead to Guest OS file systems errors. I’ve been involved with shared hosting storage platforms all my career so I know how fickle filesystems can be to storage latency or loss of connectivity. Reading through the many forums and blog posts around the 503 error there seems to be a common denominator of something going wrong with the underlying storage before a reboot triggers the 503 error. Clicking here will show the Google results for VCSA + 503 where you can read the various posts mentioned above.
As you may or may not know the 6.5 VCSA has twelve VMDKs, up from 2 in the initial release and to 11 in the 6.0 release. There a couple of great posts from William Lam and Mohammed Raffic that go through what each disk partition does. The big advantage in having these seperate partitions is that you can manage storage space a lot more granularly.
The problem as mentioned is that the underlying Linux file system is susceptible to storage issue. Not matter what storage platform you are running you are guaranteed to have issues at one point or another. In my experience Linux filesystems don’t deal will with those issues. Windows file systems seem to tolerate storage issue much better than their Linux counterparts and without starting a religious war I do know about the various tweaks that can be done to help make Linux filesystems more resilient to underlying storage issues.
With that in mind, the VCSA is very much susceptible to those same storage issues and I believe a lot of people are running into problems mainly triggered by storage related events. Most of the symptoms of the 503 relate back to key vCenter services unable to start after reboot. This usually requires some intervention to fix or a recovery of the VCSA from backup, but hopefully all that’s needed is to run an e2fsck against the filesystem(s) impacted.
VMware are putting a lot of faith into the VCSA and have done a tremendous job to develop it up to this point. It is the only option moving forward for VMware based platforms however there needs to be a little more work done into the resiliency of the services to protect against external issues that can impact the guest OS. PhotonOS is now the OS of choice from 6.5 onwards but that will not stop the legacy of susceptibility that comes with Linux based filesystems leading to issues such as the 503 error. If VMware can protect key services in the event of storage issues that will go a long way to improving that resiliency.
I believe it will get better and just this week VMware announced a monthly security patch program for the VCSA which shows that they are serious (not to say they where not before) about ensuring the appliance is protected but I’m sure many would agree that it needs to offer reliability as well…this is the one area where the Windows based vCenter has an advantage still.
With all that said, make sure you are doing everything possible to have the VCSA housed on as reliable as possible storage and make sure that you are not only backing up the VCSA and external dependancies correctly but understand how to restore the appliance including understanding of the inbuilt backup mechanisms for backing up the config and the PostGres database.
I love and would certainly recommend the VCSA…I just want to love it a little more without having to deal with possibility of having the 503 server error lurking around every storage event.