A server I maintain at People & Planet uses Linux software RAID1 to protect us against disk failure. RAID1 means that two (or more) devices are kept in an identical state at all times, if one fails, the OS can continue, using the remaining disk(s). The procedure for replacing a disk is documented in several places online, but there are always little things that are different for every colour of Linux and every set-up, so I document this here in case our situation matches yours. Please note that I cannot take any responsibility for any problems you have based on this blog post.
I was able to replace the drive without any downtime or reboots.
The set up
- The server is running Debian stable (Squeeze).
- There are just two 1Tb SATA drives with all their partitions in a RAID1 (mirrored) configuration.
- The drives are known to Debian as sda and sdb. It was most of the partitions on sda that failed.
Identifying the failed disk
I was alerted to the failed disk by an email from logcheck
Security Events for mdadm =-=-=-=-=-=-=-=-=-=-=-=-= Nov 14 10:21:58 maui kernel: [11987426.042865] raid1: Disk failure on sda7, disabling device.
I confirmed the degraded state by looking at mdstat:
$ cat /proc/mdstat Personalities : [raid1] md3 : active raid1 sda9(F) sdb9 638864768 blocks [2/1] [_U] md4 : active raid1 sda8(F) sdb8 7815488 blocks [2/1] [_U] md5 : active (auto-read-only) raid1 sda7 sdb7 7815488 blocks [2/2] [UU] md2 : active raid1 sda6(F) sdb6 97659008 blocks [2/1] [_U] md1 : active raid1 sda5(F) sdb5 195310080 blocks [2/1] [_U] md0 : active raid1 sda1(F) sdb1 29294400 blocks [2/1] [_U]
The F means failed, and the underscore means down, as opposed to the adjacent U which means Up. the
[2/1] bit after the block count, means that the array should have 2 devices in it, but only 1 is in use.
So here you can see that of the 6 partitions, 5 had failed. In my experience, when a disk begins to fail, it completely fails very soon after.
Note: the format of
/proc/mdstat is at least odd and a little confusing. In particular, I do not understand the order of the devices. The Linux Kernel Wiki says "The order in which the devices appear in this line means nothing" but at least this order correlates with the line below it, i.e.
_U means the first drive listed on the line above has failed. The number following the device name, e.g. the 2 in
sda1(F) above is apparently the number of the device in the RAID array. This is confusing because that would imply that this device was the second (or possibly 3rd as devices in some systems start at zero) and that the first device was the second one listed.
So be absolutely sure you are satisfied that you know which device has died.
I know that the OS calls the device sda1 but there were two other identifications that were useful: the SCSI bus address and the serial number.
$ lshw -c disk *-disk:0 description: SCSI Disk physical id: 0 bus info: scsi@2:0.0.0 logical name: /dev/sda size: 931GiB (1TB) *-disk:1 description: ATA Disk product: ST31000528AS vendor: Seagate physical id: 1 bus info: scsi@3:0.0.0 logical name: /dev/sdb version: CC38 serial: 9VP50TYG size: 931GiB (1TB) capabilities: partitioned partitioned:dos configuration: ansiversion=5 signature=00031a61
Nb. lshw was unable to extract the serial number for the failed sda1 disk.
Removing the disk
First, remove the disk from the RAID array by marking each of its partitions as failed. To summarise the RAID devices which use sda, I used this command:
$ grep sda /proc/mdstat | sort md0 : active raid1 sda1(F) sdb1 md1 : active raid1 sda5(F) sdb5 md2 : active raid1 sda6(F) sdb6 md3 : active raid1 sda9(F) sdb9 md4 : active raid1 sda8(F) sdb8 md5 : active (auto-read-only) raid1 sda7 sdb7
For each partition not already in a failed state, e.g. the last line in the output above, run a command like this:
$ mdadm -f /dev/md5 /dev/sda7 mdadm: set /dev/sda7 faulty in /dev/md5
Now all the components of each RAID device that uses the disk are failed, they can be removed from the RAID array, with commands like this:
$ mdadm --remove /dev/md0 /dev/sda1 ... repeat for each part of sda being sure to match md with appropriate sda numbers ...
Now tell the system you're about to whip out the disk. We'll need the numbers output from lshw above:
bus info: scsi@2:0.0.0
echo "scsi remove-single-device" 2 0 0 0 >/proc/scsi/scsi
Next is the physical removal of the dead disk from the running server. Fortunately for me, it was fairly easy to access the disks and confirm the serial number of the dead one without pulling any wires out. I unplugged, and stuck in the new disk, which was the same make and identical model of disk, to minimise other problems. Make sure you connect the new disk to the same SATA channel.
To tell the system to look again at the device,
echo "scsi add-single-device" 2 0 0 0 >/proc/scsi/scsi
Then a quick
tail syslog assured me that the device was recognised, and registered as
Setting up the new disk
The new disk needs to be partitioned identically to the other. You can do this with a
sfdisk one liner, although I had to use the
--force option because at first it complained about a partition that did not lie along cylendar boundaries. Rather important to get the sda and sdb things the right way around here! In my case I needed to copy the partition table from sdb (the working, live drive) to sda (the new, blank drive).
$ sfdisk -d /dev/sdb | sfdisk --force /dev/sda
Next we need to add the new sda partitions back into the RAID array.
$ mdadm --add /dev/md0 /dev/sda1 ... repeat for each part of sda being sure to match md with appropriate sda numbers ...
Check it's working - hooray! It was.
# cat /proc/mdstat Personalities : [raid1] md3 : active raid1 sda9 sdb9 638864768 blocks [2/1] [_U] [>....................] recovery = 1.2% (8078976/638864768) finish=143.2min speed=73405K/sec md4 : active raid1 sda8 sdb8 7815488 blocks [2/1] [_U] resync=DELAYED md5 : active raid1 sda7 sdb7 7815488 blocks [2/1] [_U] resync=DELAYED md2 : active raid1 sda6 sdb6 97659008 blocks [2/1] [_U] resync=DELAYED md1 : active raid1 sda5 sdb5 195310080 blocks [2/1] [_U] resync=DELAYED md0 : active raid1 sda1 sdb1 29294400 blocks [2/1] [_U] resync=DELAYED
The DELAYED bit is because mdadm recognises that these partitions are on the same disk, so doing the resyncs in parallel would not help.
I've loved the concept of mirrored RAID: so sensible when disk failure can come at any time and be so devastating. This is my first real-world experience of replacing a failed disk in a RAID arary, putting the theory into practise. It saved me a huge amount of work, and miraculously, there was not a moment of downtime. Many people I've come across recommend expensive hardware RAID controllers and advise against software RAID, but I have found Linux RAID to be more reliable and to have more community support online than the harware RAIDed server I also maintain.
Thanks to TechRepublic (how-to) Anchor.com (how-to - but note some bits missing (
mdadm --add), and
grep /dev/sda /proc/mdstat won't work as the output does not include /dev/) and the Linux Kernel Wiki.
I'm grateful to Linux/Debian/GNU for saving my charity People & Planet from having to spend precious supporter/funder's cash on expensive hardware, and for the wealth of support documentation found online, which I hope this post adds to.