Replace a failed drive in a software RAID

View kenel log to detect a possible failing hard drive

root@ubuntu:~# dmesg
[ 886.492585] sdb: Current: sense key: Recovered Error
[ 886.497903] Additional sense: Recovered data with retries
[ 886.504060] Info fld=0xdf82e1
[ 919.421181] sdb: Current: sense key: Recovered Error
[ 919.426474] Additional sense: Recovered data without ECC - recommend rewrite
[ 919.434375] Info fld=0xd66a9a
[ 1728.424643] sdb: Current: sense key: Recovered Error
[ 1728.429945] Additional sense: Recovered data without ECC - data auto-real
located
[ 1728.438197] Info fld=0xccc0fe
[ 1731.086946] sdb: Current: sense key: Recovered Error
[ 1731.092252] Additional sense: Recovered data without ECC - data auto-real
located
[ 1731.100514] Info fld=0xccb675

Perform SMART test on drive

Install SMART tools

root@ubuntu:~# aptitude install smartmontools

Run SMART tests

root@ubuntu:~# smartctl --test=long /dev/sdb
root@ubuntu:~# smartctl -a /dev/sdb
smartctl version 5.34 [x86_64-unknown-linux-gnu] Copyright (C) 2002-5 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
Device: FUJITSU MAV2073RCSUN72G Version: 0301
Serial number: 000535S00AUB
Device type: disk
Transport protocol: SAS
Local Time is: Sat Jan 29 14:22:13 2011 CST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK
Current Drive Temperature: 27 C
Drive Trip Temperature: 65 C
Manufactured in week 35 of year 2005
Current start stop count: 43 times
Recommended maximum start stop count: 10000 times
Elements in grown defect list: 355
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 530114 1342 1342 0 78930.620 0
write: 0 2 0 0 0 38013.435 0
Non-medium error count: 44
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK
ASC ASQ]
Description number (hours)
# 1 Background long Failed in segment --> 9 42754 13399317 [0x3
0x11 0x1]
# 2 Background long Failed in segment --> 9 42635 13399317 [0x3
0x11 0x1]
# 3 Background short Completed - 42635 - [- -
-]
# 4 Background long Failed in segment --> 9 42634 13398730 [0x3
0x11 0x1]
Long (extended) Self Test duration: 2233 seconds [37.2 minutes]
root@ubuntu:~# fdisk -l
Disk /dev/sda: 73.4 GB, 73407865856 bytes
255 heads, 63 sectors/track, 8924 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          12       96358+  fd  Linux raid autodetect
/dev/sda2              13        8924    71585640   fd  Linux raid autodetect
Disk /dev/sdb: 73.4 GB, 73407865856 bytes
255 heads, 63 sectors/track, 8924 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          12       96358+  fd  Linux raid autodetect
/dev/sdb2              13        8924    71585640   fd  Linux raid autodetect
Disk /dev/sdc: 73.4 GB, 73407865856 bytes
255 heads, 63 sectors/track, 8924 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1        8924    71681998+  83  Linux
Disk /dev/sdd: 73.4 GB, 73407865856 bytes
255 heads, 63 sectors/track, 8924 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1               1        8924    71681998+  83  Linux
Disk /dev/md0: 98 MB, 98566144 bytes
2 heads, 4 sectors/track, 24064 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk /dev/md0 doesn't contain a valid partition table
Disk /dev/md1: 73.3 GB, 73303588864 bytes
2 heads, 4 sectors/track, 17896384 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk /dev/md1 doesn't contain a valid partition table
Disk /dev/md2: 73.4 GB, 73402286080 bytes
2 heads, 4 sectors/track, 17920480 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk /dev/md2 doesn't contain a valid partition table
root@ubuntu:~# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdc1[0] sdd1[1]
      71681920 blocks [2/2] [UU]
md1 : active raid1 sda2[0] sdb2[1]
      71585536 blocks [2/2] [UU]
md0 : active raid1 sda1[0] sdb1[1]
      96256 blocks [2/2] [UU]
unused devices: <none>
root@ubuntu:~# mdadm --query --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Wed Feb  8 17:29:05 2006
     Raid Level : raid1
     Array Size : 96256 (94.02 MiB 98.57 MB)
    Device Size : 96256 (94.02 MiB 98.57 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent
    Update Time : Mon Jan 31 06:26:13 2011
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
           UUID : 96c88b09:82b06262:679309e4:bbe2fe4f
         Events : 0.20160
    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
root@ubuntu:~# mdadm --query --detail /dev/md1
/dev/md1:
        Version : 00.90.03
  Creation Time : Wed Feb  8 17:29:25 2006
     Raid Level : raid1
     Array Size : 71585536 (68.27 GiB 73.30 GB)
    Device Size : 71585536 (68.27 GiB 73.30 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent
    Update Time : Mon Jan 31 17:42:26 2011
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
           UUID : 6154cd5a:edf5f628:28d7a268:ad434b95
         Events : 0.59383068
    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2

Remove the Failed Drive

root@ubuntu:~# mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0
root@ubuntu:~# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdc1[0] sdd1[1]
      71681920 blocks [2/2] [UU]
md1 : active raid1 sda2[0] sdb2[1]
      71585536 blocks [2/2] [UU]
md0 : active raid1 sda1[0] sdb1[2](F)
      96256 blocks [2/1] [U_]
unused devices: <none>
root@ubuntu:~# mdadm --manage /dev/md0 --remove /dev/sdb1
mdadm: hot removed /dev/sdb1
root@ubuntu:~# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdc1[0] sdd1[1]
      71681920 blocks [2/2] [UU]
md1 : active raid1 sda2[0] sdb2[1]
      71585536 blocks [2/2] [UU]
md0 : active raid1 sda1[0]
      96256 blocks [2/1] [U_]
unused devices: <none>
root@ubuntu:~# mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md1
root@ubuntu:~# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdc1[0] sdd1[1]
      71681920 blocks [2/2] [UU]
md1 : active raid1 sda2[0] sdb2[2](F)
      71585536 blocks [2/1] [U_]
md0 : active raid1 sda1[0]
      96256 blocks [2/1] [U_]
unused devices: <none>
root@ubuntu:~# mdadm --manage /dev/md1 --remove /dev/sdb2
mdadm: hot removed /dev/sdb2
root@ubuntu:~# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdc1[0] sdd1[1]
      71681920 blocks [2/2] [UU]
md1 : active raid1 sda2[0]
      71585536 blocks [2/1] [U_]
md0 : active raid1 sda1[0]
      96256 blocks [2/1] [U_]
unused devices: <none>

Replace Drive

Power down the server and replace the failed physical drive.

Add new Drive to RAID

Verify current partition information

root@ubuntu:~# sfdisk -d /dev/sda
# partition table of /dev/sda
unit: sectors
/dev/sdb1 : start=       63, size=   192779, Id=fd, bootable
/dev/sdb2 : start=   192780, size=143364059, Id=fd
/dev/sdb3 : start=        0, size=        0, Id= 0
/dev/sdb4 : start=        0, size=        0, Id= 0

Copy the partition information over

root@ubuntu:~# sfdisk -d /dev/sda | sfdisk /dev/sdb
Checking that no-one is using this disk right now ...
OK
Disk /dev/sdb: 8924 cylinders, 255 heads, 63 sectors/track
sfdisk: ERROR: sector 0 does not have an msdos signature
 /dev/sdb: unrecognized partition table type
Old situation:
No partitions found
New situation:
Units = sectors of 512 bytes, counting from 0
   Device Boot    Start       End   #sectors  Id  System
/dev/sdb1   *        63    192779     192717  fd  Linux raid autodetect
/dev/sdb2        192780 143364059  143171280  fd  Linux raid autodetect
/dev/sdb3             0         -          0   0  Empty
/dev/sdb4             0         -          0   0  Empty
Successfully wrote the new partition table
Re-reading the partition table ...
If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)

Verify partition information

root@ubuntu:~# fdisk -l /dev/sda /dev/sdb
Disk /dev/sda: 73.4 GB, 73407865856 bytes
255 heads, 63 sectors/track, 8924 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          12       96358+  fd  Linux raid autodetect
/dev/sda2              13        8924    71585640   fd  Linux raid autodetect
Disk /dev/sdb: 73.4 GB, 73407865856 bytes
255 heads, 63 sectors/track, 8924 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          12       96358+  fd  Linux raid autodetect
/dev/sdb2              13        8924    71585640   fd  Linux raid autodetect

Add new drive partitions to software RAID

root@ubuntu:~# mdadm --manage /dev/md0 --add /dev/sdb1
mdadm: hot added /dev/sdb1
root@ubuntu:~# mdadm --manage /dev/md1 --add /dev/sdb2
mdadm: hot added /dev/sdb2
root@ubuntu:~# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdc1[0] sdd1[1]
      71681920 blocks [2/2] [UU]
md1 : active raid1 sdb2[2] sda2[0]
      71585536 blocks [2/1] [U_]
      [>....................]  recovery =  0.1% (97408/71585536) finish=73.3min speed=16234K/sec
md0 : active raid1 sdb1[1] sda1[0]
      96256 blocks [2/2] [UU]
unused devices: <none>

Verify that the RAID build process eventually finishes successfully

root@ubuntu:~# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdc1[0] sdd1[1]
      71681920 blocks [2/2] [UU]
md1 : active raid1 sdb2[1] sda2[0]
      71585536 blocks [2/2] [UU]
md0 : active raid1 sdb1[1] sda1[0]
      96256 blocks [2/2] [UU]
unused devices: <none>

Make Disks Bootable with Grub

If the drive you replaced contains the boot partition, you need to make it bootable by Grub once again.

/dev/sda

root@ubuntu:~# grub
Probing devices to guess BIOS drives. This may take a long time.
       [ Minimal BASH-like line editing is supported.   For
         the   first   word,  TAB  lists  possible  command
         completions.  Anywhere else TAB lists the possible
         completions of a device/filename. ]
grub> device (hd0) /dev/sda
grub> root (hd0,0)
grub> setup (hd0)
 Checking if "/boot/grub/stage1" exists... no
 Checking if "/grub/stage1" exists... yes
 Checking if "/grub/stage2" exists... yes
 Checking if "/grub/e2fs_stage1_5" exists... yes
 Running "embed /grub/e2fs_stage1_5 (hd0)"...  16 sectors are embedded.
succeeded
 Running "install /grub/stage1 (hd0) (hd0)1+16 p (hd0,0)/grub/stage2 /grub/menu.lst"... succeeded
Done.
grub> quit

/dev/sdb

root@ubuntu:~# grub
Probing devices to guess BIOS drives. This may take a long time.
       [ Minimal BASH-like line editing is supported.   For
         the   first   word,  TAB  lists  possible  command
         completions.  Anywhere else TAB lists the possible
         completions of a device/filename. ]
grub> device (hd1) /dev/sdb
grub> root (hd1,0)
grub> setup (hd1)
 Checking if "/boot/grub/stage1" exists... no
 Checking if "/grub/stage1" exists... yes
 Checking if "/grub/stage2" exists... yes
 Checking if "/grub/e2fs_stage1_5" exists... yes
 Running "embed /grub/e2fs_stage1_5 (hd1)"...  16 sectors are embedded.
succeeded
 Running "install /grub/stage1 (hd1) (hd1)1+16 p (hd1,0)/grub/stage2 /grub/menu.lst"... succeeded
Done.
grub> quit
root@ubuntu:~#

References

  • http://www.howtoforge.com/replacing_hard_disks_in_a_raid1_array