The purpose of this post is to confirm the confidence I have in RAID technology as expressed in the earlier post “RAID“. It is occasioned by my recent plans to write a very different piece.
Background: the Warning Signs
Summers can get pretty hot here in Auckland. The average temperature for this time of year is 24 degrees Celsius (that’s 75 degrees Fahrenheit to North Americans) with 99% humidity so it’s no simple matter to keep a computer cool.
Your computers resources are not infinite and each time a new program is called, a portion of the total resource is used and as we chew through the systems resource we eventually reach a point where the resources are no longer sufficient for the proper functioning of the system.
At that point many things can happen starting with “nothing” happening to a full on “Blue Screen of Death” forced re-boot.
In most cases no error message is produced, or the message that is produced has nothing to do with the actual problem that may be discovered later. What we notice is that the system starts acting “weird”. Commands don’t get executed, we lose icons, a simple file copy takes 3 times as long as it should … and at some point we make the decision to shut everything down and start again fresh with a reboot.
Some time ago (more than a year) I found that I had to re-boot my systems far more often on hot days than on cool days.
Eventually it dawned on me that at least some these faults might be the result of overheating.
I asked my wife if we could turn her sewing room into an air-conditioned computer room with hardware racks and a raised floor – but she said “No”. Women – go figure.
My fallback plan “B” called for the installation of more internal fans into my boxes, and while this seemed to alleviate the faults, it also made the room a lot noisier.
In an effort to reduce the office noise levels I had a cabinet built to house the four computers I had at the time. I included several ventilation holes that I hoped would be sufficient.
The result was a large desk-shaped oven. The multiple 6-inch vent holes weren’t nearly enough to extract the trapped heat so it just got hotter and hotter. I had to take off the cabinet doors and even the PC covers and direct a large office fan directly into the cavity to bring the operating temperature down to a safe level. In the end, for a time, the office was hotter and noisier than ever. (more about this in our next article)
I mention my heat problems because for a long time I attributed all of my “un attributable” faults to it. In particular, the one that bugged me the most was the loss of my RAID array.
The first time I noticed it was a bit of a shock given that the drive holds all my data and my “backups” discipline is a bit loose, but after some investigation and a few reboots it became clear that the array hadn’t disappeared, it had only failed to mount.
I’ve configured this PC with 2 partitions – the boot drive (“C:\”) is a separate physical HDD so the system boots up fine, it just has no second partition which is a virtual drive made up of 4 physical drive and combined by RAID to a single “D:\” drive.
No error message was produced and in fact, the report from the HighPoint RAID management system told me that the array was “Normal” apart from the number 2 disk running a bit hot.
In order to recover the folders I had to power down the box and reboot at least once, and sometimes more, to get the virtual “D:\” drive back on line.
I sent off a note to the HighPoint group via their support web page and got an email back saying their support guy was on vacation and gave me an alternate address to contact. No reply came back from the alternate email.
I used the support page to request an update on the status of my fault report and shortly thereafter I got an email saying that my trouble ticket had been updated. I logged back in to the support site only to find that the “update” was my own query asking them for an update.
Around this time I’d been posting my problem off to my various forums and one kind reader wrote me to point out that if I really wanted to back up a 1.5Tb RAID array, I’d need a 1.5Tb backup disk to do it. He was right of course, but it was a depressing kind of right – there is a good measure of fault tolerance built into the RAID software, but it is fault “tolerance” not fault “proof”. If you lose a second disk before you can replace the first fail, you will lose the array.
The lack of progress on this issue and a growing frustration with the supplier drove me to consider an article on “the failure of raid technology and its suppliers”. Fortunately a lucky-un-lucky break intervened.
“We had to destroy the village in order to save it”
Have you ever had an intermittent fault on a system that you couldn’t pinpoint, but you knew it was in a particular subsystem, so you just whacked the subsystem with a hammer to get the whole thing replaced?
Fortunately for me, fate held the hammer this time.
As disk failures go, the “head crash” (see Wikipedia) has to be one of the most dramatic. It’s a catastrophic hardware fault that occurs when a read-write head (works like the needle on a turntable) comes into contact with the surface of the disks platter which is spinning around at 7200 Rpm.
On February 1st, a noise that sounded very much like a high-speed dentist drill came screaming from my PC. Checking the RAID Management page I could see the number 2 drive had indeed failed. (I’ve put a sound file on my web site if you want to hear it Barracuda Head Crash.wav)
Securing a replacement drive (a Western Digital) I had a go at getting it integrated into my system but the first attempt failed miserably (“no available drive found”). After figuring out that the drive had to be formatted first it only took only a minute to install, and then another 8 hours to mirror the drive back in the array restoring my system to peak performance.
Since February 1, and with additional system cooling modifications, both servers have run well although I still can’t close the cabinet doors. My confidence in RAID technology is solidified, and I’m very happy to re-recommend a RAID 5 solution for any situation that requires a large logical drive for optimum disk utilization and data protection with a lower cost of ownership profile than simply doubling the number of disks.
One down, One to go
Bolstered by the success with the disk failure I pushed ahead for a solution to the disappearing drive problem and sent another email off to HighPoint. I got a note back directing me to their Chinese web site to download the latest drivers, BIOS, and web management tools. It sounded like a fob to get me off their backs, but those basic steps – even if they never seem to work, must be undertaken in order to move on to the next step.
On March 1, I found the driver, bios, and application files on their web site and they were, indeed, different from the ones I’d obtained earlier from their US web site (why didn’t they just update the US files themselves?)
I installed the 4 new files and I guess it must have worked — 4 or 5 reboots since the install and not a missing drive in sight!
I won’t claim a final victory however. As with the currently accepted scientific theory: it’s only true until it’s not.