btrfs fails to balance

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

btrfs fails to balance

William Kenworthy
Can someone suggest what is causing a balance on this raid 1, 3 disk
volume to successfully complete but leave the data unevenly distributed?
 Content is mostly VM images.

sdc and sdd are 2TB WD greens, and sda is a 2TB WD red.


rattus backups # btrfs fi sh
Label: none  uuid: 04d8ff4f-fe19-4530-ab45-d82fcd647515
        Total devices 1 FS bytes used 8.25GiB
        devid    1 size 271.36GiB used 23.04GiB path /dev/sdb3

Label: none  uuid: f5a284b6-442f-4b3d-aa1a-8d6296f517b1
        Total devices 3 FS bytes used 1.90TiB
        devid    1 size 1.82TiB used 1.77TiB path /dev/sdc
        devid    2 size 1.82TiB used 1.77TiB path /dev/sdd
        devid    4 size 1.82TiB used 270.03GiB path /dev/sda

Btrfs v3.18
rattus backups #  btrfs fi df /mnt/btrfs-root/
Data, RAID1: total=1.90TiB, used=1.89TiB
System, RAID1: total=32.00MiB, used=464.00KiB
Metadata, RAID1: total=8.00GiB, used=6.53GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
rattus backups # btrfs fi usage /mnt/btrfs-root/
Overall:
    Device size:                   5.46TiB
    Device allocated:              3.81TiB
    Device unallocated:            1.65TiB
    Used:                          3.80TiB
    Free (estimated):            845.00GiB      (min: 845.00GiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 154.42MiB)

Data,RAID1: Size:1.90TiB, Used:1.89TiB
   /dev/sda      265.00GiB
   /dev/sdc        1.77TiB
   /dev/sdd        1.77TiB

Metadata,RAID1: Size:9.00GiB, Used:6.96GiB
   /dev/sda        6.00GiB
   /dev/sdc        7.00GiB
   /dev/sdd        5.00GiB

System,RAID1: Size:32.00MiB, Used:464.00KiB
   /dev/sda       32.00MiB
   /dev/sdd       32.00MiB

Unallocated:
   /dev/sda        1.55TiB
   /dev/sdc       47.02GiB
   /dev/sdd       47.99GiB
rattus backups #

Reply | Threaded
Open this post in threaded view
|

Re: btrfs fails to balance

Marc Joliet
Am Mon, 19 Jan 2015 16:32:53 +0800
schrieb Bill Kenworthy <[hidden email]>:

> Can someone suggest what is causing a balance on this raid 1, 3 disk
> volume to successfully complete but leave the data unevenly distributed?
>  Content is mostly VM images.
>
> sdc and sdd are 2TB WD greens, and sda is a 2TB WD red.
>
>
> rattus backups # btrfs fi sh
> Label: none  uuid: 04d8ff4f-fe19-4530-ab45-d82fcd647515
>         Total devices 1 FS bytes used 8.25GiB
>         devid    1 size 271.36GiB used 23.04GiB path /dev/sdb3
>
> Label: none  uuid: f5a284b6-442f-4b3d-aa1a-8d6296f517b1
>         Total devices 3 FS bytes used 1.90TiB
>         devid    1 size 1.82TiB used 1.77TiB path /dev/sdc
>         devid    2 size 1.82TiB used 1.77TiB path /dev/sdd
>         devid    4 size 1.82TiB used 270.03GiB path /dev/sda
>
> Btrfs v3.18
> rattus backups #  btrfs fi df /mnt/btrfs-root/
> Data, RAID1: total=1.90TiB, used=1.89TiB
> System, RAID1: total=32.00MiB, used=464.00KiB
> Metadata, RAID1: total=8.00GiB, used=6.53GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> rattus backups # btrfs fi usage /mnt/btrfs-root/
> Overall:
>     Device size:                   5.46TiB
>     Device allocated:              3.81TiB
>     Device unallocated:            1.65TiB
>     Used:                          3.80TiB
>     Free (estimated):            845.00GiB      (min: 845.00GiB)
>     Data ratio:                       2.00
>     Metadata ratio:                   2.00
>     Global reserve:              512.00MiB      (used: 154.42MiB)
>
> Data,RAID1: Size:1.90TiB, Used:1.89TiB
>    /dev/sda      265.00GiB
>    /dev/sdc        1.77TiB
>    /dev/sdd        1.77TiB

OK, that is odd.  While my experience with BTRFS RAID1 was that the FS is never
perfectly balanced, this is just... weird.

If I were you, I would ask on the BTRFS ML.

> Metadata,RAID1: Size:9.00GiB, Used:6.96GiB
>    /dev/sda        6.00GiB
>    /dev/sdc        7.00GiB
>    /dev/sdd        5.00GiB

This I would see as perfectly possible with BTRFS RAID1.

> System,RAID1: Size:32.00MiB, Used:464.00KiB
>    /dev/sda       32.00MiB
>    /dev/sdd       32.00MiB
>
> Unallocated:
>    /dev/sda        1.55TiB
>    /dev/sdc       47.02GiB
>    /dev/sdd       47.99GiB
> rattus backups #

HTH
--
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup

attachment0 (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: btrfs fails to balance

Marc Stuermer
In reply to this post by William Kenworthy

Zitat von Bill Kenworthy <[hidden email]>:

> Can someone suggest what is causing a balance on this raid 1, 3 disk
> volume to successfully complete but leave the data unevenly distributed?
>  Content is mostly VM images.

On which kernel version are you?

Reply | Threaded
Open this post in threaded view
|

Re: btrfs fails to balance

Marc Stuermer
In reply to this post by William Kenworthy
Am 19.01.2015 um 09:32 schrieb Bill Kenworthy:

> Can someone suggest what is causing a balance on this raid 1, 3 disk
> volume to successfully complete but leave the data unevenly distributed?
>   Content is mostly VM images.
>
> sdc and sdd are 2TB WD greens, and sda is a 2TB WD red.

Question: was /dev/sda a smaller HDD before the 2 TB WD red?

If your sda was around 250 GB before you changed it with 2 TB, did you
just issue a "btrfs balance" after that? If so, Btrfs just configured
itself for 2*2 TB + 1*250 GB, that's why.

The proper Btrfs way if replacing a smaller hdd for a bigger one in Raid
1 is to issue "btrfs filesystem resize" to make it use all of the
available space.

This would be one possible explanation for the behaviour of your array.

Reply | Threaded
Open this post in threaded view
|

Re: btrfs fails to balance

William Kenworthy
In reply to this post by Marc Stuermer
On 19/01/15 17:46, Marc Stürmer wrote:
>
> Zitat von Bill Kenworthy <[hidden email]>:
>
>> Can someone suggest what is causing a balance on this raid 1, 3 disk
>> volume to successfully complete but leave the data unevenly distributed?
>>  Content is mostly VM images.
>
> On which kernel version are you?
>

3.17.7 ... will move to 3.18 tomorrow as there are some brtfs fixes, but
nothing that would relate to this problem which I think has been present
for quite awhile.

BillK


Reply | Threaded
Open this post in threaded view
|

Re: btrfs fails to balance

William Kenworthy
In reply to this post by Marc Stuermer
On 19/01/15 18:45, Marc Stürmer wrote:

> Am 19.01.2015 um 09:32 schrieb Bill Kenworthy:
>
>> Can someone suggest what is causing a balance on this raid 1, 3 disk
>> volume to successfully complete but leave the data unevenly distributed?
>>   Content is mostly VM images.
>>
>> sdc and sdd are 2TB WD greens, and sda is a 2TB WD red.
>
> Question: was /dev/sda a smaller HDD before the 2 TB WD red?
>
> If your sda was around 250 GB before you changed it with 2 TB, did you
> just issue a "btrfs balance" after that? If so, Btrfs just configured
> itself for 2*2 TB + 1*250 GB, that's why.
>
> The proper Btrfs way if replacing a smaller hdd for a bigger one in Raid
> 1 is to issue "btrfs filesystem resize" to make it use all of the
> available space.
>
> This would be one possible explanation for the behaviour of your array.
>

Brilliant, you have hit on the answer! - The ancient 300GB system disk
was sda at one point and moved to sdb - possibly at the time I changed
to using UUID's.  Ive just resized all the disks and its now moved past
300G for the first time as well as the other two falling in step with
the data moving.

I moved to UUID's as the machine has a number of sata ports and a PCI-e
sata adaptor and the sd* drive numbering kept moving around when I added
the WD red.

BillK



Reply | Threaded
Open this post in threaded view
|

Re: btrfs fails to balance

Marc Stuermer
Am 19.01.2015 um 12:02 schrieb Bill Kenworthy:

> Brilliant, you have hit on the answer! - The ancient 300GB system disk
> was sda at one point and moved to sdb - possibly at the time I changed
> to using UUID's.  Ive just resized all the disks and its now moved past
> 300G for the first time as well as the other two falling in step with
> the data moving.

Actually this is not my doing, I am not touching Btrfs at the moment
with a ten foot pole yet because yet.

Russel Coker wrote about the problem you had, just take a look at it
here and also mentioned the solution:

http://etbe.coker.com.au/2014/12/05/btrfs-status-dec-2014/

He had the same problem with a bigger drive. His Btrfs status updates
are something I do follow indeed.

Reply | Threaded
Open this post in threaded view
|

Re: btrfs fails to balance

James-2
In reply to this post by William Kenworthy
Bill Kenworthy <billk <at> iinet.net.au> writes:


> > Am 19.01.2015 um 09:32 schrieb Bill Kenworthy:

> >> Can someone suggest what is causing a balance on this raid 1

Interesting.
I am about to test (reboot) a btrfs, raid one installation.

> Brilliant, you have hit on the answer! - The ancient 300GB system disk
> was sda at one point and moved to sdb - possibly at the time I changed
> to using UUID's.  Ive just resized all the disks and its now moved past
> 300G for the first time as well as the other two falling in step with
> the data moving.

I was wondering what my /etc/fstab should look like using uuids, raid 1 and
btrfs.

Could you post your /etc/fstab and any other modifications you made to
your installation related to the btrfs, raid 1 uuid setup?

I'm just using (2) identical 2T disks for my new gentoo workstation.

> I moved to UUID's as the machine has a number of sata ports and a PCI-e
> sata adaptor and the sd* drive numbering kept moving around when I added
> the WD red.


Eventually, I want to run CephFS on several of these raid one btrfs
systems for some clustering code experiments. I'm not sure how that
will affect, if at all, the raid 1-btrfs-uuid setup.


TIA,
James




Reply | Threaded
Open this post in threaded view
|

Re: btrfs fails to balance

Rich Freeman
On Mon, Jan 19, 2015 at 11:50 AM, James <[hidden email]> wrote:
> Bill Kenworthy <billk <at> iinet.net.au> writes:
>
> I was wondering what my /etc/fstab should look like using uuids, raid 1 and
> btrfs.

From mine:
/dev/disk/by-uuid/7d9f3772-a39c-408b-9be0-5fa26eec8342          /
         btrfs           noatime,ssd,compress=none
/dev/disk/by-uuid/cd074207-9bc3-402d-bee8-6a8c77d56959          /data
         btrfs           noatime,compress=none

The first is a single disk, the second is 5-drive raid1.

I disabled compression due to some bugs a few kernels ago.  I need to
look into whether those were fixed - normally I'd use lzo.

I use dracut - obviously you need to use some care when running root
on a disk identified by uuid since this isn't a kernel feature.  With
btrfs as long as you identify one device in an array it will find the
rest.  They all have the same UUID though.

Probably also worth nothing that if you try to run btrfs on top of lvm
and then create an lvm snapshot btrfs can cause spectacular breakage
when it sees two devices whose metadata identify them as being the
same - I don't know where it went but there was talk of trying to use
a generation id/etc to keep track of which ones are old vs recent in
this scenario.

>
> Eventually, I want to run CephFS on several of these raid one btrfs
> systems for some clustering code experiments. I'm not sure how that
> will affect, if at all, the raid 1-btrfs-uuid setup.
>

Btrfs would run below CephFS I imagine, so it wouldn't affect it at all.

The main thing keeping me away from CephFS is that it has no mechanism
for resolving silent corruption.  Btrfs underneath it would obviously
help, though not for failure modes that involve CephFS itself.  I'd
feel a lot better if CephFS had some way of determining which copy was
the right one other than "the master server always wins."

--
Rich

Reply | Threaded
Open this post in threaded view
|

Re: btrfs fails to balance

William Kenworthy
In reply to this post by Marc Stuermer
On 19/01/15 22:08, Marc Stuermer wrote:

> Am 19.01.2015 um 12:02 schrieb Bill Kenworthy:
>
>> Brilliant, you have hit on the answer! - The ancient 300GB system disk
>> was sda at one point and moved to sdb - possibly at the time I changed
>> to using UUID's.  Ive just resized all the disks and its now moved past
>> 300G for the first time as well as the other two falling in step with
>> the data moving.
>
> Actually this is not my doing, I am not touching Btrfs at the moment
> with a ten foot pole yet because yet.
>
> Russel Coker wrote about the problem you had, just take a look at it
> here and also mentioned the solution:
>
> http://etbe.coker.com.au/2014/12/05/btrfs-status-dec-2014/
>
> He had the same problem with a bigger drive. His Btrfs status updates
> are something I do follow indeed.
>

Yes, I read those but failed to "connect the dots" :)

Balance almost done:

rattus backups # btrfs fi usage -T /mnt/btrfs-root/
Overall:
    Device size:                   5.46TiB
    Device allocated:              3.81TiB
    Device unallocated:            1.65TiB
    Used:                          3.80TiB
    Free (estimated):            843.92GiB      (min: 843.92GiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 97.84MiB)

                   Data    Metadata System
         single    RAID1   RAID1    RAID1     Unallocated

/dev/sda         - 1.26TiB  6.00GiB  32.00MiB   562.99GiB
/dev/sdc         - 1.26TiB  8.00GiB         -   562.02GiB
/dev/sdd         - 1.27TiB  4.00GiB  32.00MiB   561.99GiB
         ========= ======= ======== ========= ===========
Total    512.00MiB 1.90TiB  9.00GiB  32.00MiB     1.65TiB
Used      97.84MiB 1.90TiB  6.42GiB 304.00KiB
rattus backups #

Reply | Threaded
Open this post in threaded view
|

Re: btrfs fails to balance

William Kenworthy
In reply to this post by James-2
On 20/01/15 00:50, James wrote:

> Bill Kenworthy <billk <at> iinet.net.au> writes:
>
>
>>> Am 19.01.2015 um 09:32 schrieb Bill Kenworthy:
>
>>>> Can someone suggest what is causing a balance on this raid 1
>
> Interesting.
> I am about to test (reboot) a btrfs, raid one installation.
>
>> Brilliant, you have hit on the answer! - The ancient 300GB system disk
>> was sda at one point and moved to sdb - possibly at the time I changed
>> to using UUID's.  Ive just resized all the disks and its now moved past
>> 300G for the first time as well as the other two falling in step with
>> the data moving.
>
> I was wondering what my /etc/fstab should look like using uuids, raid 1 and
> btrfs.
>
> Could you post your /etc/fstab and any other modifications you made to
> your installation related to the btrfs, raid 1 uuid setup?
>
> I'm just using (2) identical 2T disks for my new gentoo workstation.
>
>> I moved to UUID's as the machine has a number of sata ports and a PCI-e
>> sata adaptor and the sd* drive numbering kept moving around when I added
>> the WD red.
>
>
> Eventually, I want to run CephFS on several of these raid one btrfs
> systems for some clustering code experiments. I'm not sure how that
> will affect, if at all, the raid 1-btrfs-uuid setup.
>
>
> TIA,
> James
>
>
>
>

Sorry about the line wrap:

rattus backups # lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0   1.8T  0 disk
sdb      8:16   0 279.5G  0 disk
├─sdb1   8:17   0   100M  0 part
├─sdb2   8:18   0     8G  0 part [SWAP]
└─sdb3   8:19   0 271.4G  0 part /
sdc      8:32   0   1.8T  0 disk /mnt/vm
sdd      8:48   0   1.8T  0 disk
sde      8:64   0   1.8T  0 disk
rattus backups #

rattus backups # blkid
/dev/sda: UUID="f5a284b6-442f-4b3d-aa1a-8d6296f517b1"
UUID_SUB="9003b772-3487-447a-9794-50cf9880a9c0" TYPE="btrfs" PTTYPE="dos"
/dev/sdc: UUID="f5a284b6-442f-4b3d-aa1a-8d6296f517b1"
UUID_SUB="20523d9d-3d90-439e-ad68-62def0824198" TYPE="btrfs"
/dev/sdb1: UUID="cc5f4bf7-28fc-4661-9d24-a0c9d0048f40" TYPE="ext2"
/dev/sdb2: UUID="dddb7e60-89a9-40d4-bf6b-ff4644e079e9" TYPE="swap"
/dev/sdb3: UUID="04d8ff4f-fe19-4530-ab45-d82fcd647515"
UUID_SUB="72134593-8c9f-436f-98ce-fbb07facbf35" TYPE="btrfs"
/dev/sdd: UUID="f5a284b6-442f-4b3d-aa1a-8d6296f517b1"
UUID_SUB="2ca026f7-e5c9-4ece-bba1-809ddb03979b" TYPE="btrfs"
rattus backups #


rattus backups # cat /etc/fstab

UUID=cc5f4bf7-28fc-4661-9d24-a0c9d0048f40               /boot
                ext2            noauto,noatime
                        1 2
UUID=04d8ff4f-fe19-4530-ab45-d82fcd647515               /
                btrfs
defaults,noatime,compress=lzo,space_cache                       0 0
UUID=dddb7e60-89a9-40d4-bf6b-ff4644e079e9               none
                swap            sw
                        0 0
UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1               /mnt/btrfs-root
                btrfs
defaults,noatime,compress=lzo,space_cache                       0 0
UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1               /home/wdk
                btrfs
defaults,noatime,compress=lzo,space_cache,subvolid=258          0 0
UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1               /mnt/backups
                btrfs
defaults,noatime,compress=lzo,space_cache,subvolid=365          0 0
UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1               /mnt/vm
                btrfs
defaults,noatime,compress=lzo,space_cache,subvolid=14916        0 0

rattus backups #


Reply | Threaded
Open this post in threaded view
|

Re: btrfs fails to balance

William Kenworthy
In reply to this post by Rich Freeman
On 20/01/15 05:10, Rich Freeman wrote:

> On Mon, Jan 19, 2015 at 11:50 AM, James <[hidden email]> wrote:
>> Bill Kenworthy <billk <at> iinet.net.au> writes:
>>
>> I was wondering what my /etc/fstab should look like using uuids, raid 1 and
>> btrfs.
>
> From mine:
> /dev/disk/by-uuid/7d9f3772-a39c-408b-9be0-5fa26eec8342          /
>          btrfs           noatime,ssd,compress=none
> /dev/disk/by-uuid/cd074207-9bc3-402d-bee8-6a8c77d56959          /data
>          btrfs           noatime,compress=none
>
> The first is a single disk, the second is 5-drive raid1.
>
> I disabled compression due to some bugs a few kernels ago.  I need to
> look into whether those were fixed - normally I'd use lzo.
>
> I use dracut - obviously you need to use some care when running root
> on a disk identified by uuid since this isn't a kernel feature.  With
> btrfs as long as you identify one device in an array it will find the
> rest.  They all have the same UUID though.
>
> Probably also worth nothing that if you try to run btrfs on top of lvm
> and then create an lvm snapshot btrfs can cause spectacular breakage
> when it sees two devices whose metadata identify them as being the
> same - I don't know where it went but there was talk of trying to use
> a generation id/etc to keep track of which ones are old vs recent in
> this scenario.
>
>>
>> Eventually, I want to run CephFS on several of these raid one btrfs
>> systems for some clustering code experiments. I'm not sure how that
>> will affect, if at all, the raid 1-btrfs-uuid setup.
>>
>
> Btrfs would run below CephFS I imagine, so it wouldn't affect it at all.
>
> The main thing keeping me away from CephFS is that it has no mechanism
> for resolving silent corruption.  Btrfs underneath it would obviously
> help, though not for failure modes that involve CephFS itself.  I'd
> feel a lot better if CephFS had some way of determining which copy was
> the right one other than "the master server always wins."
>

Forget ceph on btrfs for the moment - the COW kills it stone dead after
real use.  When running a small handful of VMs on a raid1 with ceph -
sloooooooooooow :)

You can turn off COW and go single on btrfs to speed it up but bugs in
ceph and btrfs lose data real fast!

ceph itself (my last setup trashed itself 6 months ago and I've given
up!) will only work under real use/heavy loads with lots of discrete
systems, ideally 10G network, and small disks to spread the failure
domain.  Using 3 hosts and 2x2g disks per host wasn't near big enough :(
 Its design means that small scale trials just wont work.

Its not designed for small scale/low end hardware, no matter how
attractive the idea is :(

BillK






Reply | Threaded
Open this post in threaded view
|

Re: btrfs fails to balance

James-2
Bill Kenworthy <billk <at> iinet.net.au> writes:


> > The main thing keeping me away from CephFS is that it has no mechanism
> > for resolving silent corruption.  Btrfs underneath it would obviously
> > help, though not for failure modes that involve CephFS itself.  I'd
> > feel a lot better if CephFS had some way of determining which copy was
> > the right one other than "the master server always wins."


The "Giant" version 0.87 is a major release with many new fixes;
it may have the features you need. Currently the ongoing releases are
up to : v0.91. The readings look promissing, but I'll agree it
needs to be tested with non-critical data.

http://ceph.com/docs/master/release-notes/#v0-87-giant

http://ceph.com/docs/master/release-notes/#notable-changes


> Forget ceph on btrfs for the moment - the COW kills it stone dead after
> real use.  When running a small handful of VMs on a raid1 with ceph -
> sloooooooooooow :)

I'm staying away from VMs. It's spark on top of mesos I'm after. Maybe
docker or another container solution, down the road.

I read where some are using a SSD with raid 1 and bcache to speed up
performance and stability a bit. I do not want to add SSD to the mix right
now, as the (3) node development systems all have 32 G of ram.



> You can turn off COW and go single on btrfs to speed it up but bugs in
> ceph and btrfs lose data real fast!

Interesting idea, since I'll have raid1 underneath each node. I'll need to
dig into this idea a bit more.


> ceph itself (my last setup trashed itself 6 months ago and I've given
> up!) will only work under real use/heavy loads with lots of discrete
> systems, ideally 10G network, and small disks to spread the failure
> domain.  Using 3 hosts and 2x2g disks per host wasn't near big enough :(
>  Its design means that small scale trials just wont work.

Huh. My systems are FX8350 (8)processors running at 4GHz with 32 G ram.
Water coolers will allow me to crank up the speed (when/if needed) to
5 or 6 GHz. Not intel but  low end either.


> Its not designed for small scale/low end hardware, no matter how
> attractive the idea is :(

Supposedly there are tool to measure/monitor ceph better now. That is
one of the things I need to research. How to manage the small cluster
better and back off the throughput/load while monitoring performance
on a variety of different tasks. Definitely not a production usage.

I certainly appreciate your ceph_experiences. I filed a but with the
version request for Giant v0.87. Did your run the 9999 version ?
What versions did you experiment with?

I hope to set up Anisble to facilitate rapid installations of a variety
of gentoo systems used for cluster or ceph testing. That way configurations
should be able to "reboot" after bad failures.  Did your experienced
failures with Ceph require the gentoo-btrfs based systems to be complete
reinstalled from scratch, or just purge the disk of Ceph and reconfigure Ceph?

I'm hoping to "configure ceph" in such a way that failures do not corrupt
the gentoo-btrfs installation and only require repair to ceph; so your
comments on that strategy are most welcome.




> BillK


James


>





Reply | Threaded
Open this post in threaded view
|

Re: btrfs fails to balance

Rich Freeman
On Tue, Jan 20, 2015 at 10:07 AM, James <[hidden email]> wrote:
> Bill Kenworthy <billk <at> iinet.net.au> writes:
>
>> You can turn off COW and go single on btrfs to speed it up but bugs in
>> ceph and btrfs lose data real fast!
>
> Interesting idea, since I'll have raid1 underneath each node. I'll need to
> dig into this idea a bit more.
>

So, btrfs and ceph solve an overlapping set of problems in an
overlapping set of ways.  In general adding data security often comes
at the cost of performance, and obviously adding it at multiple layers
can come at the cost of additional performance.  I think the right
solution is going to depend on the circumstances.

if ceph provided that protection against bitrot I'd probably avoid a
COW filesystem entirely.  It isn't going to add any additional value,
and they do have a performance cost.  If I had mirroring at the ceph
level I'd probably just run them on ext4 on lvm with no
mdadm/btrfs/whatever below that.  Availability is already ensured by
ceph - if you lose a drive then other nodes will pick up the load.  If
I didn't have robust mirroring at the ceph level then having mirroring
of some kind at the individual node level would improve availability.

On the other hand, ceph currently has some gaps, so having it on top
of zfs/btrfs could provide protection against bitrot.  However, right
now there is no way to turn off COW while leaving checksumming
enabled.  It would be nice if you could leave the checksumming on.
Then if there was bitrot btrfs would just return an error when you
tried to read the file, and then ceph would handle it like any other
disk error and use a mirrored copy on another node.  The problem with
ceph+ext4 is that if there is bitrot neither layer will detect it.

Does btrfs+ceph really have a performance hit that is larger than
btrfs without ceph?  I fully expect it to be slower than ext4+ceph.
Btrfs in general performs fairly poorly right now - that is expected
to improve in the future, but I doubt that it will ever outperform
ext4 other than for specific operations that benefit from it (like
reflink copies).  It will always be faster to just overwrite one block
in the middle of a file than to write the block out to unallocated
space and update all the metadata.

--
Rich

Reply | Threaded
Open this post in threaded view
|

Re: btrfs fails to balance

James-2
Rich Freeman <rich0 <at> gentoo.org> writes:


> >> You can turn off COW and go single on btrfs to speed it up but bugs in
> >> ceph and btrfs lose data real fast!

> So, btrfs and ceph solve an overlapping set of problems in an
> overlapping set of ways.  In general adding data security often comes
> at the cost of performance, and obviously adding it at multiple layers
> can come at the cost of additional performance.  I think the right
> solution is going to depend on the circumstances.

Raid 1 with btrfs can not only protect the ceph fs files but the gentoo
node installation itself.  I'm not so worried about proformance, because
my main (end result) goal is to throttle codes so they run almost
exclusively in ram (in memory) as design by amplabs. Spark plus Tachyon is a
work in progress, for sure.  The DFS will be used in lieu of HDFS for
distributed/cluster types of apps, hence ceph.  Btrfs + raid 1 is as
a failsafe for the node installations, but also all data. I only intend
to write out data, once a job/run is finished; but granted that is very
experimental right now and will evolve over time.


>
> if ceph provided that protection against bitrot I'd probably avoid a
> COW filesystem entirely.  It isn't going to add any additional value,
> and they do have a performance cost.  If I had mirroring at the ceph
> level I'd probably just run them on ext4 on lvm with no
> mdadm/btrfs/whatever below that.  Availability is already ensured by
> ceph - if you lose a drive then other nodes will pick up the load.  If
> I didn't have robust mirroring at the ceph level then having mirroring
> of some kind at the individual node level would improve availability.

I've read that btrfs and ceph are a very, suitable, yet very immature
match for local-distributed file system needs.


> On the other hand, ceph currently has some gaps, so having it on top
> of zfs/btrfs could provide protection against bitrot.  However, right
> now there is no way to turn off COW while leaving checksumming
> enabled.  It would be nice if you could leave the checksumming on.
> Then if there was bitrot btrfs would just return an error when you
> tried to read the file, and then ceph would handle it like any other
> disk error and use a mirrored copy on another node.  The problem with
> ceph+ext4 is that if there is bitrot neither layer will detect it.

Good points, hence a flexible configuration where ceph can be reconfigured
and recovered as warranted, for this long term set of experiments.

> Does btrfs+ceph really have a performance hit that is larger than
> btrfs without ceph?  I fully expect it to be slower than ext4+ceph.
> Btrfs in general performs fairly poorly right now - that is expected
> to improve in the future, but I doubt that it will ever outperform
> ext4 other than for specific operations that benefit from it (like
> reflink copies).  It will always be faster to just overwrite one block
> in the middle of a file than to write the block out to unallocated
> space and update all the metadata.

I fully expect the combination of btrfs+ceph to mature and become
competitive. It's not critical data, but a long term experiment. Surely
critical data will be backed up off the 3-node cluster. I hope to use
ansible to enable recovery, configuration changes and bringing on and
managing additional nodes; this a concept at the moment, but googling around
it does seem to be a popular idea.

As always your insight and advice is warmly received.


James


>





Reply | Threaded
Open this post in threaded view
|

Re: btrfs fails to balance

Rich Freeman
On Tue, Jan 20, 2015 at 12:27 PM, James <[hidden email]> wrote:
>
> Raid 1 with btrfs can not only protect the ceph fs files but the gentoo
> node installation itself.

Agree 100%.  Like I said, the right solution depends on your situation.

If you're using the server doing ceph storage only for file serving,
then protecting the OS installation isn't very important.  Heck, you
could just run the OS off of a USB stick.

If you're running nodes that do a combination of application and
storage, then obviously you need to worry about both, which probably
means not relying on ceph as your sole source of protection.  That
applies to a lot of "kitchen sink" setups where hosts don't have a
single role.

--
Rich

Reply | Threaded
Open this post in threaded view
|

Re: btrfs fails to balance

William Kenworthy
In reply to this post by Rich Freeman
On 21/01/15 00:03, Rich Freeman wrote:

> On Tue, Jan 20, 2015 at 10:07 AM, James <[hidden email]> wrote:
>> Bill Kenworthy <billk <at> iinet.net.au> writes:
>>
>>> You can turn off COW and go single on btrfs to speed it up but bugs in
>>> ceph and btrfs lose data real fast!
>>
>> Interesting idea, since I'll have raid1 underneath each node. I'll need to
>> dig into this idea a bit more.
>>
>
> So, btrfs and ceph solve an overlapping set of problems in an
> overlapping set of ways.  In general adding data security often comes
> at the cost of performance, and obviously adding it at multiple layers
> can come at the cost of additional performance.  I think the right
> solution is going to depend on the circumstances.
>
> if ceph provided that protection against bitrot I'd probably avoid a
> COW filesystem entirely.  It isn't going to add any additional value,
> and they do have a performance cost.  If I had mirroring at the ceph
> level I'd probably just run them on ext4 on lvm with no
> mdadm/btrfs/whatever below that.  Availability is already ensured by
> ceph - if you lose a drive then other nodes will pick up the load.  If
> I didn't have robust mirroring at the ceph level then having mirroring
> of some kind at the individual node level would improve availability.
>
> On the other hand, ceph currently has some gaps, so having it on top
> of zfs/btrfs could provide protection against bitrot.  However, right
> now there is no way to turn off COW while leaving checksumming
> enabled.  It would be nice if you could leave the checksumming on.
> Then if there was bitrot btrfs would just return an error when you
> tried to read the file, and then ceph would handle it like any other
> disk error and use a mirrored copy on another node.  The problem with
> ceph+ext4 is that if there is bitrot neither layer will detect it.
>
> Does btrfs+ceph really have a performance hit that is larger than
> btrfs without ceph?  I fully expect it to be slower than ext4+ceph.
> Btrfs in general performs fairly poorly right now - that is expected
> to improve in the future, but I doubt that it will ever outperform
> ext4 other than for specific operations that benefit from it (like
> reflink copies).  It will always be faster to just overwrite one block
> in the middle of a file than to write the block out to unallocated
> space and update all the metadata.
>

answer to both you and James here:

I think it was pre 8.0 when I dropped out.  Its Ceph that suffers from
bitrot - I use the "golden master" approach to generating the VM's so
corruption was obvious.  I did report one bug in the early days that
turned out to be btrfs, but I think it was largely ceph which has been
born out by consolidating the ceph trial hardware and using it with
btrfs and the same storage - rare problems and I can point to
hardware/power when it happened.

The performance hit was not due to lack of horsepower (cpu, ram etc) but
due to I/O - both network bandwidth and internal bus on the hosts.  That
is why a small number of systems no matter how powerful wont work well.
 For real performance, I saw people using SSD's and large numbers of
hosts in order to distribute the data flows - this does work and I saw
some insane numbers posted.  It also requires multiple networks
(internal and external) to separate the flows (not VLAN but dedicated
pipes) due to the extreem burstiness of the traffic.  As well as VM
images, I had backups (using dirvish) and thousands of security camera
images.  Deletes of a directory with a lot of files would take many
hours.  Same with using ceph for a mail store (came up on the ceph list
under "why is it so slow") - as a chunk server its just not suitable for
lots of small files.

Towards the end of my use, I stopped seeing bitrot on a system with data
but idle to limiting it to occurring during heavy use.  My overall
conclusion is lots of small hosts with no more than a couple of drives
each and multiple networks with lots of bandwidth is what its designed for.

I had two reasons for looking at ceph - distributed storage where data
in use was held close to the user but could be redistributed easily
with multiple copies (think two small data stores with an intermittent
WAN link storing high and low priority data) and high performance with
high availability on HW failure.

Ceph was not the answer for me with the scale I have.

BillK