The Great Ides of March VOIP-Info Outage

There were a lot of questions about why voip-info was down for a couple of days in mid-March 2007. The server uptime was almost 6 months before this event, so the outage was little expected.

First let me say the extended length of the outage was totally due to not having an hot standby backup server. I will be fixing this. Having mirrors available would also have mitigated the effects of the outage. There were many kind offers of mirrors during the outage, and hopefully some of these will be set up. See: Voip-Info Mirrors

I share the details below solely for their amusement and educational value, not as excuses for the outage or its lengh.

  • As background: the server hardware is owned by voip-info, and coloed at Net Actuate, The server is dual Opteron 248 with 4GB and 4 300GD SATA drives in a RAID 10 array. At the time of the outage it was about 8 months old.
  • The outage started on the 13th — a day when I was traveling without phone and Internet access for most of the day, so one day lost.
  • On learning of the outage early on the 14th, I immediately checked the RAID controller status and saw that 2 of the 4 drives in the RAID array had ECC errors, and one had totally vanished from the array.
  • I immediately contacted the hardware support vendor, and opened a ticket with Net Actuate
  • We tried various resets and reboots and managed to get drive 4 back into the array, but the RAID rebuild stalled.
  • We Find out that it will be overnight before we can get new warrenty replacement drives. Net Actuate offers to drive to Frys and buy new drives for me. I decline, the box still seems to be alive and I have hopes of reviving it and limping along until the new drives arrive.
  • I try a non-destructive badblock scan, and we immediately lose the first drive out the array. This means that we have now seen errors of one kind or another on all 4 drives.
  • Decide to wait for new drives in the morning.
  • On the 15th Mark from Net Actuate drives to the vendor and picks up 4 brand new replacement drives. After installing them — Disaster strikes again, they don't work even as well as the old ones.
  • Move to Plan B (which I obviously should have done much sooner) Net Actuate loans me a high-end server and I restore a DB copy pulled from the old server.
  • Copy of the DB from the old server is corrupt, pull previous days copy from the off-site backup site, its corrupt too....
  • Have to go back to March 11 to get an uncorrupted DB backup.
  • Voip-info is back online!

The old server remains down, the theory now is that it may have a bad SATA backplane or RAID controller.
Hopefully it will be restored to health soon. I'll update this page when we find out what was wrong.

2007-03-18: Update, the old server is alive again. It took replacing all 4 disk drives and the RAID controller to restore it to life. We are loading the O/S and will transfer the site back to it once everything is setup.

2007-03-23: Update, Voip-Info.org is once again running on the old (original) server.

One of our theories about the disk drives was that they were suffering from the bug described here:
Western Digital Firmware Bug This may have been part of the problem, as the drives vanishing from a raid array and only reappearing after a power cycle is a common symptom of this bug.

Thanks

I'd like to thank the tireless support of Mark at Net Actuate, he was
competent, helpful, and continually went the exta mile for me.
If you are looking for a colo site, consider them.









Created by: admin, Last modification: Sat 24 of Mar, 2007 (09:44 UTC)


Please update this page with new information, just login and click on the "Edit" or "Discussion" tab. Get a free login here: Register Thanks! - Find us on Google+

Page Changes | Comments

 

Featured -

Search: