login | register
Tue 02 of Dec, 2008 [00:23 UTC]

voip-info.org

History

The Great Ides of March VOIP-Info Outage

Created by: admin,Last modification on Sat 24 of Mar, 2007 [09:44 UTC]
There were a lot of questions about why voip-info was down for a couple of days in mid-March 2007. The server uptime was almost 6 months before this event, so the outage was little expected.

First let me say the extended length of the outage was totally due to not having an hot standby backup server. I will be fixing this. Having mirrors available would also have mitigated the effects of the outage. There were many kind offers of mirrors during the outage, and hopefully some of these will be set up. See: Voip-Info Mirrors

I share the details below solely for their amusement and educational value, not as excuses for the outage or its lengh.

  • As background: the server hardware is owned by voip-info, and coloed at Net Actuate, The server is dual Opteron 248 with 4GB and 4 300GD SATA drives in a RAID 10 array. At the time of the outage it was about 8 months old.
  • The outage started on the 13th — a day when I was traveling without phone and Internet access for most of the day, so one day lost.
  • On learning of the outage early on the 14th, I immediately checked the RAID controller status and saw that 2 of the 4 drives in the RAID array had ECC errors, and one had totally vanished from the array.
  • I immediately contacted the hardware support vendor, and opened a ticket with Net Actuate
  • We tried various resets and reboots and managed to get drive 4 back into the array, but the RAID rebuild stalled.
  • We Find out that it will be overnight before we can get new warrenty replacement drives. Net Actuate offers to drive to Frys and buy new drives for me. I decline, the box still seems to be alive and I have hopes of reviving it and limping along until the new drives arrive.
  • I try a non-destructive badblock scan, and we immediately lose the first drive out the array. This means that we have now seen errors of one kind or another on all 4 drives.
  • Decide to wait for new drives in the morning.
  • On the 15th Mark from Net Actuate drives to the vendor and picks up 4 brand new replacement drives. After installing them — Disaster strikes again, they don't work even as well as the old ones.
  • Move to Plan B (which I obviously should have done much sooner) Net Actuate loans me a high-end server and I restore a DB copy pulled from the old server.
  • Copy of the DB from the old server is corrupt, pull previous days copy from the off-site backup site, its corrupt too....
  • Have to go back to March 11 to get an uncorrupted DB backup.
  • Voip-info is back online!

The old server remains down, the theory now is that it may have a bad SATA backplane or RAID controller.
Hopefully it will be restored to health soon. I'll update this page when we find out what was wrong.

2007-03-18: Update, the old server is alive again. It took replacing all 4 disk drives and the RAID controller to restore it to life. We are loading the O/S and will transfer the site back to it once everything is setup.

2007-03-23: Update, Voip-Info.org is once again running on the old (original) server.

One of our theories about the disk drives was that they were suffering from the bug described here:
Western Digital Firmware Bug This may have been part of the problem, as the drives vanishing from a raid array and only reappearing after a power cycle is a common symptom of this bug.

Thanks

I'd like to thank the tireless support of Mark at Net Actuate, he was
competent, helpful, and continually went the exta mile for me.
If you are looking for a colo site, consider them.










Comments

Comments Filter
222

333Mirror

by r3dn3ck, Saturday 17 of March, 2007 [16:46:51 UTC]
If you need one, all you need to do is ask.
222

333Power Supplies . . . Check em

by r3dn3ck, Saturday 17 of March, 2007 [16:44:56 UTC]
RAID 10 failure on all drives is hard to believe. We have only had one drive at a time . . . EVER. However, we have lost two, three, and four drives at a time with a bad power supply. I would definitely check the power supply.
222

333Mirror

by psycox, Friday 16 of March, 2007 [20:50:28 UTC]
If you are intrested in having a public avaliable mirror / load balence hosted, give us a line, we are a large private ISP and use the site often, would hate to see you have any further problems due to only having a single point of failure. Contact info should be on our site, since I'm not sure if email info is allowed in these comments

We offer colo if wanted, but it sounds like just a hosted site would suffice and be less hassle for you =)

www.isdn.net
222

333Mirror

by evert, Friday 16 of March, 2007 [13:37:58 UTC]
I hereby also offer services as mirror. Located in NH, USA :-)


222

333

by pschenkeveld, Friday 16 of March, 2007 [08:19:39 UTC]
Not trying to be a wise guy I suggest to look at the power supply as well. Your description of the problem including the 4 new drives not working as well matches my experience with many occasions of failing power supplies.

Being unable to visit voip-info.org for a couple of days underlined the great value of it so I'd like to take this opportunity to thank voip-info and all helping hands for bringing us this invaluable service!