Tuesday, 6:05pm: Today I took time out from work for a trip out to Kennedy (the Eastern terminus of the main Toronto subway line) to visit Canada Computers (whose downtown store is out of commission, apparently because of a fire) and buy some G.SKILL memory to replace the Corsair memory that failed me. I was a little worried about this because the new modules are exactly the same spec as the old - DDR3 1333, CL9-9-9-24, 4x 2G modules, 1.5v. If the others didn't work, these shouldn't. However, the others said they were "for Intel" and these say "for Intel or AMD."
And they do work - at least as far as the computer will POST with two of them inserted. Next, I insert the others and make sure it still does; and then I start plugging in hard drives. My plan (since I have five drive bays, four drives I want to use long-term, and two drives to recover) is to start with three new drives; build a degraded RAID array (technically, several) with those; then attach the two old drives, recover their data to the array, remove the old drive and put in the remaining new one, and resync the array.
But first I pour a drink (sparkling water, I've run out of alcohol) and maybe think about some tacos, because it's going to be a long night. The next several updates will be additions to this entry rather than new entries.
6:32: Works fine with all four DIMMs inserted. BIOS config looks good. It picked up the correct clock settings for the DIMMs automatically. I'm a little concerned that it doesn't seem to do a memory test (and show the size of memory) on bootup, but whatever. Next step: installing three SATA drives.
6:55: Installation of three SATA drives went without a hitch, and it boots fine from a Slackware install USB key. I think at this point, unless I'm forced to in order to get needed utilities, I probably won't even install an OS yet - I'll just do all my RAID building and file copying from the install key.
My plan is to split the big drives into four partitions:
- A small one, RAID1 (mirroring) for booting. This is the only RAID level I can boot from, assuming I want some reliability guarantees on my boot partition, because the boot loader doesn't do other RAID levels.
- Another small one, RAID0 (striping) for swap. The idea is that if a disk fails, I lose this, but I can run comfortably without swap (especially now that I have 8G of RAM!) to deal with the failed swap.
- A medium-size one of RAID6 (double parity). This will preserve data even with two drives failed - or one failed and one missing, so I can have some reliability guarantee even during my upcoming data recovery process. In the longer term, my home directory and current work will go here - stuff that I can't expect to be backed up yet.
- The rest of the drives (most of their capacity) RAID5, for archive kinds of things.
The RAID5 and RAID6 partitions will be LVM "physical volumes" that I can further subdivide to store a variety of different filesystems, rearrange those, and so on. So the first step is to create this partition structure on all three drives. Then I'll build the degraded-RAID6 array, an LVM volume group over it, hook up the IDE drives from opal, build logical volumes to store the recovered data, recover the data, and then remove the IDE drives, attach the fourth SATA drive, un-degrade the RAID6, and start work on the rest of the system. Whew!
7:23: Okay, I partitioned the drives and the RAID6 array is created (a near-instantaneous process - most of this time was spent choosing the RAID options to use) and now it's automatically rebuilding (to bring the first parity disk online). I think I could, technically, go ahead and start using it immediately, but I'm going to wait for it to finish, both so I benefit from the parity-disk reliability when I start copying data onto it, and to use the time making tacos. The /proc/mdstat file estimates 80 minutes to complete the rebuild.
7:50: the RAID rebuild is at 32%, and the taco meat is simmering on the stove. While I was working on that I noticed that the message light on my phone answering machine was flashing, and it turns out there was a message at 10:30 this morning from someone at TigerDirect asking me to call him back. My guess is this was triggered by my having filed for (and getting) an RMA number for the non-working modules, last night. I called back but got voicemail, this being well after the close of business. I'll try again tomorrow. It will be interesting to see what he has to say for himself. I could see it going either way. The fact that I do have some modules that work now, makes me feel a lot less generally stressed and angry than I would be if I were computerless the whole time; on the other hand, the fact that these modules that work are exactly the same specification as (and noticeably cheaper than) the TigerDirect modules, strengthens my case that the modules from TigerDirect are not all they should be and that that is not entirely my fault.
By the way, Linux is currently reporting 4G of RAM even though I'm supposed to have 8G installed. I sure hope that's because the kernel option for allowing more is turned off in this installer kernel.
8:58: The rebuild is complete. Now to put in the IDE drives from the old machine, and see if they're still readable. This'll be the moment of truth for how much data I lose.
9:06: Rebooting with the IDE drives installed, the POST doesn't mention them, but then the LILO menu from the old computer comes up! That suggests the drives must be in very good shape. I shut it off in a hurry because I don't know what damage might be done to the system image by allowing this new, totally different, hardware to attempt booting those drives, and the kernel I was using on the old Pentium 4 certainly isn't going to work very well on this new Athlon II (though it's tempting to see just how far it would get). I'm going to have to force it to boot from the USB key and not those IDE drives. But this suggests the drives and data are probably in perfect condition. Yay!
9:26: Up and running with the IDE drives. They certainly appear to be in perfect condition. Each one contains two swap partitions (made in an era when Linux's swap partitions could be no bigger than 2G) and six "RAID autodetect" partitions. I don't remember exactly what the RAID configuration was on the old machine, but it doesn't matter: I'm going to set up LVM on my new degraded-RAID6 array, then copy each of the old arrays to a new logical volume, for later perusal and processing once I have the new arrays up and running and the IDE drives safely out of the system. They can serve as backups.
9:33: Oh, my. When I tried to "assemble" the new RAID as /dev/md0, I got a message that the device name was already in use - because the installer kernel had already used those "autodetect" flags to assemble all the old arrays from the old machine! I'm not sure I like that (what if the old arrays really were screwed up in a major way and an automatic "rebuild" damaged them further?), and it supports the documentation I read recently saying that "autodetect" is deprecated because it's dangerous... but it sure is convenient, and it's another good sign for the health of the drives.
10:17: All the LVM nonsense is set up, and now I'm running dd to copy all the stuff from the old RAID arrays to the new logical volumes, which are on the new RAID array. This will take a while, and doesn't produce much in the way of progress reporting, so about all I can do is watch for the "X blocks in/out" messages from dd as it finishes each partition. It'd be nice if it finishes before I have to go to bed, but I'm not counting on that; there is a lot of data to be copied.
It was touch and go for a little while there because mdadm was claiming not to recognize the RAID superblock on one of the drives in the new RAID. I'm not sure if that was fallout from earlier fumbling around trying to bring the new RAID online when I didn't know the old RAIDs were already online, or what. It seemed to be corrected by a reboot. Something to watch, though.
11:15: It just started copying the last of the old arrays. This is the largest one (accounting for roughly half of each of the old disks) and it's RAID0 instead of being RAID1 like the others - it's where I stored big multimedia files that were handy to have on my computer but could be recovered from other sources if necessary (CD and DVD rips, and so on). I don't know what the RAID0/1 distinction will do for the speed of copying; RAID0 can be read twice as fast (assuming the drives are the bottleneck and not, for instance, the IDE bus), but if the writing to the new RAID6 is the bottleneck, that won't help. But regardless of just how long it takes, I'm hoping it will finish early enough that I can pull the IDE drives, put in the last SATA drive, start the final rebuild of the new RAID arrays, and get at least a couple hours sleep.
I am planning to go to Waterloo tomorrow, which would mean a 5:30 wakeup time under ordinary circumstances, and unfortunately it does not look realistically possible to get tetsu into such a condition that I can work on it remotely over the Net, before I go. But if the initial RAID rebuild is going to take a few hours, it may be just as well to let it run while I'm gone and have it all nice and finished for me when I return.
11:45: While I'm waiting for the copy to finish, I investigated that matter of the memory being reported as 4G instead of 8G. Sure enough, the install kernel is set up to only see up to 4G; and that's reasonable, because the necessary option to make it support more will cause it to fail completely on some older machines that aren't capable of supporting more than 4G anyway. So it makes sense for an installation kernel, which must run anywhere, to enforce the limit. This does mean, though, that I won't be able to get really authoritative confirmation that my memory modules work 100%, until after I compile a new kernel for this machine.
Wednesday, 1:05am: Disk copy finally finished. At this point, all the data from the old machine is safely on the new degraded-RAID6 array, as well as remaining on the old drives. Next steps are to shut down, remove the old drives, add the one remaining new drive, then start up again, un-degrade the RAID6, and build the other RAID arrays. I think once that's running, it'll be a good time for a nap.
2:05: Well, that took longer than planned. The thing is that while I was moving the drives around I took the opportunity to re-do all the cabling that I'd previously left haphazard (since it would be changed later anyway). I also added another fan to blow on the memory, which otherwise would be buried under cables, and that was a trick because there was nowhere convenient to mount it and I didn't have the right hardware. But all the drives that should be in are now in, and the ones that should be out are out, and the system is booting (slowly) again from the Slackware install USB key. So, very soon, I'll be able to start the last battled of tonight's campaign.
2:35: The RAID6 is un-degrading itself now, estimated completion in 80 minutes, and the other arrays are built and queued up for their own builds. I also took the opportunity to screw on the cover of my computer. All the insides should keep without further adjustment, for a while at least. Now I have just under three hours for a nap before starting my day tomorrow, and probably starting a new posting. Goodnight, all.