Murphy is my hero

I’m a big believer in Murphy’s Law… especially when it comes to computers. If you work with computers long enough, you learn to expect the unexpected.

Case in point:

My company has a server. It’s got a crappy Adaptec RAID card which is smart enough to tell me that one of the four drives has a problem, but not WHICH drive. The only solution was to get four new drives, replace all the existing drives and get rid of the crappy Adaptec card.

Problems:

  • RAID controller doesn’t seem to like a mix of the old SATA1 and the new SATA2 drives. It picks a random number of drives (0-4) to marked as failed during boot. About 5-10 reboots are required before it decides all the drives are good. Since the system takes forever to initialize it’s BIOS and the BIOS of the card, that’s about 7-15min.
  • RAID controller won’t allow the OS to see the individual drives unless it’s a part of a RAID array. Reboot 10 more times after marking the four drives as “raw volumes”.
  • During the coping of files from the old RAID array to the new software RAID array, a drive fails in the original array which slows down the copy.
  • Grub segfaults when trying to write the MBR to the new drives.
  • My CD copy of Recovery is Possible is apparently too scrached up to be useful. Must re-download/burn an ISO.
  • This whole process was supposed to take about 7hrs max. Heh.
  • A 2nd drive has failed from the original array. It is now completely failed and all data on it unrecoverable. I no longer have a fail-safe plan.
  • The Marvell SATA chipset on the motherboard isn’t supported by my kernel. I have to keep using the crappy Adaptec SATA RAID card (w/o the raid).
  • Oh, looks like I’m going to miss a party since this is taking so long.
  • Yep missing the party. Also apparently CentOS 4.1’s initrd doesn’t support root on LVM2. This is bad since my root partition is on RAID5/LVM2.
  • Of course it helps if you install the GRUB boot loader in the MBR not a partition. Doh.

Final (temporary) solution:

  • Put one of the old drives back in the system. Yes, this causes disk detection errors at boot, but I need another disk.
  • Make this new disk the boot/root filesystem disk.
  • This allows me to boot off the harddrives (yea!) and get a root filesystem running in order to start LVM2 and mount /home and /var
  • At some point in time, the software raid got confused and marked one partition offline for each of a raid1 and raid5 array. Now I get to resync them (should only take about 500minutes according to /proc/mdstat).

Total time: 10am-5pm, 7-10pm on Sat. 9am-6pm on Sun.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.