Wednesday, November 12, 2008

Spent an hour rewiring my main server

I've had tons of raid failures in my main server the last couple of months (All in all 6 disks I think that have failed fairly recently). I yesterday realized that perhaps the problem was the way I had run the wires inside the box. Once I opened it I realized I had squeezed the SATA wired together with power cables that had pretty high amps (After all they were powering 17 disks and a bunch of fans). What basically happened was that with alarming regularity I hat the drives time out and just completely lock up until I rebooted (Causing them to have been failed from the RAID by that time). Then after a reboot they usually worked fine for a a couple of days until it happened again.

It was also fairly obvious that certain hotswap slots were more prone to fail in this way than others. So basically now I have rewired the box so that the power cables go over the top of the fans in the middle of the box and the SATA cables go below it and so far it seems to work out fine. For the first time in ages I am running with completely synced raids in all my servers.

[pallas]$ cat /proc/mdstat
Personalities : [raid6]
md0 : active raid6 sda1[0] sdg1[15] sdm1[14] sdo1[13] sdp1[12] sdl1[11] sdk1[10] sdj1[9]
                   sdi1[8] sdh1[7] sdn1[6] sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1]
      6837375104 blocks level 6, 64k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]

unused devices: 
[cadiz]$ cat /proc/mdstat
Personalities : [raid6]
md0 : active raid6 sdp1[6] hdc1[0] sdq1[22] sdr1[21] sdo1[20] sdn1[19] sdm1[18] sdl1[17]
                   sdk1[16] sdj1[15] sdt1[14] sdi1[13] sde1[12] sdf1[11] sdg1[10] sdh1[9]
                   sdd1[8] sdc1[7] sds1[5] sdu1[4] sdb1[3] sda1[2] hda1[1]
      5128113984 blocks level 6, 64k chunk, algorithm 2 [23/23] [UUUUUUUUUUUUUUUUUUUUUUU]

unused devices: 
[valdez]$ cat /proc/mdstat
Personalities : [raid6]
md1 : active raid6 hda2[0] sda2[9] sdc2[8] sdb2[7] sdd2[6] sdh2[5] sde2[4] sdf2[3] sdg2[2]
      1789768704 blocks level 6, 64k chunk, algorithm 2 [10/10] [UUUUUUUUUU]

md0 : active raid6 sdg1[0] sda1[6] sdc1[5] sdb1[4] sdd1[3] sdh1[2] sdf1[1]
      102373440 blocks level 6, 64k chunk, algorithm 2 [7/7] [UUUUUUU]

unused devices: 

I leave it up to the reader to figure out the total size of that read from that readout. Running with degraded raids is one of the few things that can really stress me out because I know if I lost the stuff I have on those servers it would be impossible to replace.

To end with a quote (This time by myself): Peace of mind is a synced raid

No comments: