The Server From Hell Ate My Weekend
No good deed goes unpunished, they say, and they are right! I recently lost a valued customer, just through changing circumstances and the fact that they will be taking their support in-house. Obviously I want to keep the relationship going so I tried to throw out a few ‘teasers’ to make them think about why they might still need me. The server, a low-end Dell running SBS 2003 Standard, lost half of its RAI
D-1 mirror a few months back and I replaced the drive, rebuilt the mirror and all was well. So I floated the idea that since the drives were installed as a matched pair, it might be reasonable to expect the other one to fail in the coming months. Well, this backfired on me big time. “You’d better replace it for us now”, they said, “before our support contract expires”. How did I not see that coming? <smacks palm to forehead>
Now, understand that there is no RAID controller in this server. It is a _budget_ Dell, using Windows’ software RAID. It is a few years old and the original documentation and installation media is long lost. It was Friday afternoon and the customer’s workforce all went home half an hour early, leaving me the key and the alarm code. Well, armed with an array of recent backups, I thought “20 minute job – what can possibly go wrong?”. You just know how this is going to end, don’t you?
I confidently broke the mirror, shut down, popped out the ‘suspect’ drive and popped in the new one. One non-booting server. Oh man, I tried everything to get that sucker to boot. This drive in, that drive out, Recovery this, repair that, diskpart, WinPE, nothing doing. I gave up around 9pm and headed home.
Having slept on the problem, I decided the way to go was to start with a clean slate. So Saturday begins with a frenzy of downloading Dell utilities, service packs, ISO images for this, that and the other. Wipe the drives, reinstall the base OS and restore from backup. Great plan, followed Microsoft’s instructions to the letter. Didn't work. The base OS was installing fine, interrupting before the SBS 2003 setup phase, reboot into Directory Services Restore Mode and restore all files and system state from the disk-based backup. The whole process takes about 2 hours and when it is complete, the server gets to the boot loader screen then just
reboots (no blue screen, just reboot). So, try again, some slight variation on the process, try a different utility, install a service pack, waste another 2 hours. After a few attempts, it was getting boring really quickly and daylight fading fast. I posted on ServerFault and got some ideas from the guys there – I had high quality responses literally within minutes. One of the most useful ideas to come out of that was to do a clean install then use ImageX to create a WIM image of it, this reduced the time for each attempt drastically as I could get back to square one in a matter of 5 minutes or so. Another suggestion was to only restore “\Windows”, “\Program Files” and System State, until we worked out what was going wrong. This allowed much more efficient experimentation.
Into Sunday, no sleep, I’m tired and starting to panic now. This thing has to be back online 9am Monday morning, it’s Sunday, there’s no-one I can call for support, I’m on my own. Finally, around Sunday noon bleary eyed and exhausted, I hit the magic combination and she booted up and everything came back perfect. In the end, the final ingredient was to ignore Microsoft’s recommended restore settings and choose “Overwrite only if existing file is older”. Quick check that the internet is connected and email is flowing, adios, vaya con Dios, amigos!
I learned/reaffirmed a number of home truths from this experience.
- Software RAID is for toy servers. Always use a hardware RAID controller. If the customer will not pay for it, exclude support for disk faults from your support contract.
- However good an idea it might seem, if it ain’t broke, don’t fix it!
- Don’t break a software-RAID mirror. Shut it down and pull out one of the disks. The mirror will be ‘at risk’ but at least you’ll still have two mirror-halves to boot from.
- Make sure you’ve got all the original installation media, utilities, drivers and service packs that you need and don’t start a job like this on Friday night.
- It would probably be best to thoroughly understand the Windows boot process before tackling a job like this. In fact, it would be better to migrate to a new server with properly specified storage.
- And finally… Don’t tell the customer what might go wrong with their server in the last week of your support contract! D’oh!