How I replaced a failed disk in a RAID1 array without downtime

A server I maintain at People & Planet uses Linux software RAID1 to protect us against disk failure. RAID1 means that two (or more) devices are kept in an identical state at all times, if one fails, the OS can continue, using the remaining disk(s). The procedure for replacing a disk is documented in several places online, but there are always little things that are different for every colour of Linux and every set-up, so I document this here in case our situation matches yours. Please note that I cannot take any responsibility for any problems you have based on this blog post.

I was able to replace the drive without any downtime or reboots.

The set up

The server is running Debian stable (Squeeze).
There are just two 1Tb SATA drives with all their partitions in a RAID1 (mirrored) configuration.
The drives are known to Debian as sda and sdb. It was most of the partitions on sda that failed.

Identifying the failed disk

I was alerted to the failed disk by an email from logcheck

Security Events for mdadm
=-=-=-=-=-=-=-=-=-=-=-=-=
Nov 14 10:21:58 maui kernel: [11987426.042865] raid1: Disk failure on sda7, disabling device.

I confirmed the degraded state by looking at mdstat:

$ cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda9[2](F) sdb9[1]
      638864768 blocks [2/1] [_U]
     
md4 : active raid1 sda8[2](F) sdb8[1]
      7815488 blocks [2/1] [_U]
     
md5 : active (auto-read-only) raid1 sda7[0] sdb7[1]
      7815488 blocks [2/2] [UU]
     
md2 : active raid1 sda6[2](F) sdb6[1]
      97659008 blocks [2/1] [_U]
     
md1 : active raid1 sda5[2](F) sdb5[1]
      195310080 blocks [2/1] [_U]
     
md0 : active raid1 sda1[2](F) sdb1[1]
      29294400 blocks [2/1] [_U]

The F means failed, and the underscore means down, as opposed to the adjacent U which means Up. the [2/1] bit after the block count, means that the array should have 2 devices in it, but only 1 is in use.

So here you can see that of the 6 partitions, 5 had failed. In my experience, when a disk begins to fail, it completely fails very soon after.

Note: the format of /proc/mdstat is at least odd and a little confusing. In particular, I do not understand the order of the devices. The Linux Kernel Wiki says "The order in which the devices appear in this line means nothing" but at least this order correlates with the line below it, i.e. _U means the first drive listed on the line above has failed. The number following the device name, e.g. the 2 in sda1[2](F) above is apparently the number of the device in the RAID array. This is confusing because that would imply that this device was the second (or possibly 3rd as devices in some systems start at zero) and that the first device was the second one listed.

So be absolutely sure you are satisfied that you know which device has died.

I was concerned that numbering can change and differ. e.g. the BIOS, the OS and the bootloader (Grub 1 in my case) may enumerate devices differently to the OS. I was concerned that this could create a lot of confusion after rebooting with the new drive in, so ti was a big advantage to find that it was possible to do a live swap-out, which meant that the labels and numbers did not change throughout the procedure.

I know that the OS calls the device sda1 but there were two other identifications that were useful: the SCSI bus address and the serial number.

$ lshw -c disk
  *-disk:0
       description: SCSI Disk
       physical id: 0
       bus info: scsi@2:0.0.0
       logical name: /dev/sda
       size: 931GiB (1TB)
  *-disk:1
       description: ATA Disk
       product: ST31000528AS
       vendor: Seagate
       physical id: 1
       bus info: scsi@3:0.0.0
       logical name: /dev/sdb
       version: CC38
       serial: 9VP50TYG
       size: 931GiB (1TB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 signature=00031a61

Nb. lshw was unable to extract the serial number for the failed sda1 disk.

Removing the disk

First, remove the disk from the RAID array by marking each of its partitions as failed. To summarise the RAID devices which use sda, I used this command:

$ grep sda /proc/mdstat  | sort
md0 : active raid1 sda1[2](F) sdb1[1]
md1 : active raid1 sda5[2](F) sdb5[1]
md2 : active raid1 sda6[2](F) sdb6[1]
md3 : active raid1 sda9[2](F) sdb9[1]
md4 : active raid1 sda8[2](F) sdb8[1]
md5 : active (auto-read-only) raid1 sda7[0] sdb7[1]

For each partition not already in a failed state, e.g. the last line in the output above, run a command like this:

$ mdadm -f /dev/md5 /dev/sda7
mdadm: set /dev/sda7 faulty in /dev/md5

Now all the components of each RAID device that uses the disk are failed, they can be removed from the RAID array, with commands like this:

$ mdadm --remove /dev/md0 /dev/sda1
... repeat for each part of sda 
    being sure to match md with appropriate sda numbers ...

Now tell the system you're about to whip out the disk. We'll need the numbers output from lshw above: bus info: scsi@2:0.0.0

echo "scsi remove-single-device" 2 0 0 0 >/proc/scsi/scsi

Next is the physical removal of the dead disk from the running server. Fortunately for me, it was fairly easy to access the disks and confirm the serial number of the dead one without pulling any wires out. I unplugged, and stuck in the new disk, which was the same make and identical model of disk, to minimise other problems. Make sure you connect the new disk to the same SATA channel.

To tell the system to look again at the device,

echo "scsi add-single-device" 2 0 0 0 >/proc/scsi/scsi

Then a quick tail syslog assured me that the device was recognised, and registered as sda.

Setting up the new disk

The new disk needs to be partitioned identically to the other. You can do this with a sfdisk one liner, although I had to use the --force option because at first it complained about a partition that did not lie along cylendar boundaries. Rather important to get the sda and sdb things the right way around here! In my case I needed to copy the partition table from sdb (the working, live drive) to sda (the new, blank drive).

$ sfdisk -d /dev/sdb | sfdisk --force /dev/sda

Next we need to add the new sda partitions back into the RAID array.

$ mdadm --add /dev/md0 /dev/sda1
... repeat for each part of sda 
    being sure to match md with appropriate sda numbers ...

Check it's working - hooray! It was.

# cat /proc/mdstat 
Personalities : [raid1] 
md3 : active raid1 sda9[2] sdb9[1]
      638864768 blocks [2/1] [_U]
      [>....................]  recovery =  1.2% (8078976/638864768) finish=143.2min speed=73405K/sec
      
md4 : active raid1 sda8[2] sdb8[1]
      7815488 blocks [2/1] [_U]
        resync=DELAYED
      
md5 : active raid1 sda7[2] sdb7[1]
      7815488 blocks [2/1] [_U]
        resync=DELAYED
      
md2 : active raid1 sda6[2] sdb6[1]
      97659008 blocks [2/1] [_U]
        resync=DELAYED
      
md1 : active raid1 sda5[2] sdb5[1]
      195310080 blocks [2/1] [_U]
        resync=DELAYED
      
md0 : active raid1 sda1[2] sdb1[1]
      29294400 blocks [2/1] [_U]
        resync=DELAYED

The DELAYED bit is because mdadm recognises that these partitions are on the same disk, so doing the resyncs in parallel would not help.

Conclusion

I've loved the concept of mirrored RAID: so sensible when disk failure can come at any time and be so devastating. This is my first real-world experience of replacing a failed disk in a RAID arary, putting the theory into practise. It saved me a huge amount of work, and miraculously, there was not a moment of downtime. Many people I've come across recommend expensive hardware RAID controllers and advise against software RAID, but I have found Linux RAID to be more reliable and to have more community support online than the harware RAIDed server I also maintain.

I'm grateful to Linux/Debian/GNU for saving my charity People & Planet from having to spend precious supporter/funder's cash on expensive hardware, and for the wealth of support documentation found online, which I hope this post adds to.

Thanks to TechRepublic (how-to) Anchor.com (how-to - but note some bits missing (mdadm --add), and grep /dev/sda /proc/mdstat won't work as the output does not include /dev/) and the Linux Kernel Wiki.

Pull Out: