April 10, 2014

Good Time for a Server to Crash?

The MAF phone system is a Voice over IP (VoIP) telephony system.  Several of the functions that this phone system uses are on different computer servers.

In June of 2013, the MAF phone system was upgraded to new software, fixing bugs and adding new features.  In the following month, we setup a ESXi 5.1 server to virtualize some of the servers that the phone system was using.  We had a VMware infrastructure (A High Availability cluster of ESX 3.5 servers on Dell Poweredges), but that infrastructure was too old to be supported by our phone vendor, Mitel.  We were able to re-use an older server that was now out of warranty for these functions, as the downtime tolerance for them was high, and they were not critical systems.  The older server was a Dell 1950 Poweredge, which supported the ESXi 5.1 platform (which Mitel did support).  We installed the Virtual Machines (VMs), a phone conference bridge VM, and a VM for connectivity for particular offsite phones, and they began serving wonderfully for the next year.

As that High Availability cluster is now quite old (in computer years), it was time to migrate away from it.  Generally, MAF has moved towards Hyper-V for our medium business sized needs, and the non-profit pricing is a great advantage over VMware.  However, we still had some needs for a VMWare infrastructure, as that is the only thing supported by our phone vendor.  So we budgeted for a new server this year, and ordered it in May.  It arrived, and we had some additional Dell hard drives that we could use with it, so we ordered some hard drive sleds/trays (as they were for an MD3000 storage san, having different sleds).  On May 22nd, when the sleds arrived, I installed them and provisioned the new server's RAID storage.  I installed the latest VMware software ESXi 5.5, and had the machine ready for service.

Moving the phones system VMs onto the server would be problematic logistically; as part of the service that they render is for certain offsite phones.  We would need to hunt through all the phones and determine which users were using those particular phones, and contact them to apprise them of the downtime.

That evening at 5:25pm MST, our automated alerting system chimed an alert message.  Some of the phone servers were down.  A quick look at our monitoring dashboard revealed that the VM host had crashed.  I drove over to MAf and descended upon our air conditioned server room (in my flip flops).  The VM host server for the phone VMs had crashed.  The front console of the Dell Poweredge stated:  E1420 CPU Bus PERR.   A CPU Bus Ram parity issue.  Not a good thing in the life of a server, and a rarity in dealing with Enterprise grade server equipment.  I cold cycled the power on the server, and it booted up fine with no errors.

I immediately initiated migrating the two phone server VMs to the New VMware server, which was just finished being setup hours earlier.  The migrations finished within the hour of the system crashing.  The VMs booted up fine, and have been serving the organization flawlessly.

This was serendipitous timing of the super-ordinary (which might be more commonplace at MAF): it was effectively after business hours for MAF (including probably most of the Pacific Time users).  It also allowed us to have an unplanned window to migrate the VMs to the new platform.  It also allowed us to not have to collate through our phone list records to find the specific users of the affected phones.  The migration would of probably required an off hours trip to the server room (in a planned maintenance window), but it worked out quite well for the timing of it.

No comments: