A drive in my RAID6 has failed

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

A drive in my RAID6 has failed

Paul Hartman-7
Hi,

I woke up this morning to see the dreaded email from mdadm telling me
one of my drives failed overnight, while I was happily dreaming about
cute puppies and kittens installing a rainbow-colored roof on my
house. The array is a RAID6 (two parity drives) and this is the
current state:

md0 : active raid6 sdd1[5] sdg1[4] sde1[3](F) sdh1[2] sdf1[1] sdi1[0]
      11720009728 blocks super 1.2 level 6, 512k chunk, algorithm 2
[6/5] [UUU_UU]

I've been using RAID in Linux for years, but this is actually the
first time I've had a disk fail in one.

If I remember correctly, the process should be as simple as:

#remove the failed disk from the array:
mdadm /dev/md0 -r /dev/sde1

#pull the drive, replace with new one, partition it, then add it to the array:
mdadm /dev/md0 -a /dev/sde1

and sit back and eat popcorn while I enjoy the blinkenlights for the
next several hours/days? :) Any advice/suggestions for managing this
process any differently?

For now I have unmounted the filesystem that sits atop it, to prevent
any more writes from occurring, just in case...

Thanks,
Paul

Reply | Threaded
Open this post in threaded view
|

Re: A drive in my RAID6 has failed

Michael Orlitzky-2
On 09/05/2013 12:49 PM, Paul Hartman wrote:

> Hi,
>
> I woke up this morning to see the dreaded email from mdadm telling me
> one of my drives failed overnight, while I was happily dreaming about
> cute puppies and kittens installing a rainbow-colored roof on my
> house. The array is a RAID6 (two parity drives) and this is the
> current state:
>
> md0 : active raid6 sdd1[5] sdg1[4] sde1[3](F) sdh1[2] sdf1[1] sdi1[0]
>       11720009728 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [6/5] [UUU_UU]
>
> I've been using RAID in Linux for years, but this is actually the
> first time I've had a disk fail in one.
>
> If I remember correctly, the process should be as simple as:
>
> #remove the failed disk from the array:
> mdadm /dev/md0 -r /dev/sde1
>
> #pull the drive, replace with new one, partition it, then add it to the array:
> mdadm /dev/md0 -a /dev/sde1
>
> and sit back and eat popcorn while I enjoy the blinkenlights for the
> next several hours/days? :) Any advice/suggestions for managing this
> process any differently?
>

This is the process I always follow:

  http://www.howtoforge.com/replacing_hard_disks_in_a_raid1_array

The sfdisk trick will save you a bit of hassle.


Reply | Threaded
Open this post in threaded view
|

Re: A drive in my RAID6 has failed

Paul Hartman-3
On Thu, Sep 5, 2013 at 11:52 AM, Michael Orlitzky <[hidden email]> wrote:
> This is the process I always follow:
>
>   http://www.howtoforge.com/replacing_hard_disks_in_a_raid1_array
>
> The sfdisk trick will save you a bit of hassle.

Thanks, it looks like I was on the right path! Crossing my fingers...

Reply | Threaded
Open this post in threaded view
|

Re: A drive in my RAID6 has failed

Paul Hartman-3
On Thu, Sep 5, 2013 at 12:11 PM, Paul Hartman
<[hidden email]> wrote:
> On Thu, Sep 5, 2013 at 11:52 AM, Michael Orlitzky <[hidden email]> wrote:
>> This is the process I always follow:
>>
>>   http://www.howtoforge.com/replacing_hard_disks_in_a_raid1_array
>>
>> The sfdisk trick will save you a bit of hassle.
>
> Thanks, it looks like I was on the right path! Crossing my fingers...

So, I probably should not have attempted to do this immediately after
eating dinner. My brain was not operating at full speed, and I went
ahead and pulled the drive before removing it from the array. Oops! As
soon as I pulled the latch to release the drive, I had that "oh no!"
moment. Luckily, as it turns out, md (or mdadm? or udev?) was nice
enough to automatically remove it for me when the drive ceased to
exist.

So, I simply inserted and partitioned the new drive, added it to the
array and away we go!

md0 : active raid6 sde1[6] sdd1[5] sdg1[4] sdh1[2] sdf1[1] sdi1[0]
      11720009728 blocks super 1.2 level 6, 512k chunk, algorithm 2
[6/5] [UUU_UU]
      [>....................]  recovery =  2.3% (69513216/2930002432)
finish=428.7min speed=111206K/sec

When I wake up in the morning, I hope there won't be any errors.


BTW -- a couple tips I found which speed up RAID building/recovery
tremendously (season to taste):

echo 32768 > /sys/block/md0/md/stripe_cache_size
echo 200000 > /proc/sys/dev/raid/speed_limit_max

Reply | Threaded
Open this post in threaded view
|

Re: A drive in my RAID6 has failed

Paul Hartman-3
On Fri, Sep 6, 2013 at 12:46 AM, Paul Hartman
<[hidden email]> wrote:

> So, I simply inserted and partitioned the new drive, added it to the
> array and away we go!
>
> md0 : active raid6 sde1[6] sdd1[5] sdg1[4] sdh1[2] sdf1[1] sdi1[0]
>       11720009728 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [6/5] [UUU_UU]
>       [>....................]  recovery =  2.3% (69513216/2930002432)
> finish=428.7min speed=111206K/sec
>
> When I wake up in the morning, I hope there won't be any errors.

Success! It took 10 hours to rebuild the drive (speeds near the start
of the disk are significantly faster than those near the end of the
disk, so early estimates quoted by /proc/mdstat above were overly
optimistic):

[3720270.120695] md: bind<sde1>
[3720270.162933] RAID conf printout:
[3720270.162942]  --- level:6 rd:6 wd:5
[3720270.162949]  disk 0, o:1, dev:sdi1
[3720270.162954]  disk 1, o:1, dev:sdf1
[3720270.162958]  disk 2, o:1, dev:sdh1
[3720270.162962]  disk 3, o:1, dev:sde1
[3720270.162965]  disk 4, o:1, dev:sdg1
[3720270.162969]  disk 5, o:1, dev:sdd1
[3720270.163060] md: recovery of RAID array md0
[3720270.163067] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[3720270.163071] md: using maximum available idle IO bandwidth (but
not more than 200000 KB/sec) for recovery.
[3720270.163085] md: using 128k window, over a total of 2930002432k.
[3756293.459324] md: md0: recovery done.
[3756294.797961] RAID conf printout:
[3756294.797969]  --- level:6 rd:6 wd:6
[3756294.797974]  disk 0, o:1, dev:sdi1
[3756294.797979]  disk 1, o:1, dev:sdf1
[3756294.797982]  disk 2, o:1, dev:sdh1
[3756294.797986]  disk 3, o:1, dev:sde1
[3756294.797989]  disk 4, o:1, dev:sdg1
[3756294.797992]  disk 5, o:1, dev:sdd1