mirror fetch jobs and --checksum

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

mirror fetch jobs and --checksum

Robin H. Johnson-2
Hi mirrors,

Historically, our CVS-to-rsync process was overly enthusiastic about
updating timestamps on files, even if they hadn't changed.

With the new Git-to-rsync process, we've run into a few cases where the
mtime is not represented with sufficiently high accuracy to catch all of
the changes, and as a result changes are being missed.

Can you please alter your rsync cronjobs to include --checksum in the
commandline? All Portage calls will be including --checksum in future as
well.

A decade ago, we didn't include --checksum in the calls, as the CPU
power available was considerably less; however systems have improved
tremendously since that time.

I have updated the official mirroring wiki page, and if it's been a long
time since you reviewed your scripts, I encourage you to review it:
https://wiki.gentoo.org/wiki/Project:Infrastructure/Mirrors/Rsync

--
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee
E-Mail     : [hidden email]
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85

Reply | Threaded
Open this post in threaded view
|

Re: mirror fetch jobs and --checksum

aditsu
Hi, is this for source mirrors or rsync mirrors?

Adrian


From: Robin H. Johnson <[hidden email]>
To: [hidden email]
Sent: Saturday, January 16, 2016 3:08 PM
Subject: [gentoo-mirrors] mirror fetch jobs and --checksum

Hi mirrors,

Historically, our CVS-to-rsync process was overly enthusiastic about
updating timestamps on files, even if they hadn't changed.

With the new Git-to-rsync process, we've run into a few cases where the
mtime is not represented with sufficiently high accuracy to catch all of
the changes, and as a result changes are being missed.

Can you please alter your rsync cronjobs to include --checksum in the
commandline? All Portage calls will be including --checksum in future as
well.

A decade ago, we didn't include --checksum in the calls, as the CPU
power available was considerably less; however systems have improved
tremendously since that time.

I have updated the official mirroring wiki page, and if it's been a long
time since you reviewed your scripts, I encourage you to review it:
https://wiki.gentoo.org/wiki/Project:Infrastructure/Mirrors/Rsync

--
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee
E-Mail    : [hidden email]
GnuPG FP  : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85



Reply | Threaded
Open this post in threaded view
|

Re: mirror fetch jobs and --checksum

Robin H. Johnson-2
My primary concern is rsync mirrors, but source mirrors would benefit as
well (there was a recent incident with the bash upstream changing a
distfile file without modifying the mtime).

On Mon, Jan 18, 2016 at 12:30:02AM +0000, Adrian Sandor wrote:

> Hi, is this for source mirrors or rsync mirrors?
>  Adrian
>
>       From: Robin H. Johnson <[hidden email]>
>  To: [hidden email]
>  Sent: Saturday, January 16, 2016 3:08 PM
>  Subject: [gentoo-mirrors] mirror fetch jobs and --checksum
>    
> Hi mirrors,
>
> Historically, our CVS-to-rsync process was overly enthusiastic about
> updating timestamps on files, even if they hadn't changed.
>
> With the new Git-to-rsync process, we've run into a few cases where the
> mtime is not represented with sufficiently high accuracy to catch all of
> the changes, and as a result changes are being missed.
>
> Can you please alter your rsync cronjobs to include --checksum in the
> commandline? All Portage calls will be including --checksum in future as
> well.
>
> A decade ago, we didn't include --checksum in the calls, as the CPU
> power available was considerably less; however systems have improved
> tremendously since that time.
>
> I have updated the official mirroring wiki page, and if it's been a long
> time since you reviewed your scripts, I encourage you to review it:
> https://wiki.gentoo.org/wiki/Project:Infrastructure/Mirrors/Rsync
>
> --
> Robin Hugh Johnson
> Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee
> E-Mail    : [hidden email]
> GnuPG FP  : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
>
>
>
>  

--
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee
E-Mail     : [hidden email]
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85

Reply | Threaded
Open this post in threaded view
|

Re: mirror fetch jobs and --checksum

aditsu
Well, my concern is that for a source mirror, rsync would have to read 300+GB of data every 4 hours, it seems a bit harsh (unless there is some way to save/cache checksums?)

Adrian


From: Robin H. Johnson <[hidden email]>
To: [hidden email]
Sent: Tuesday, January 19, 2016 2:44 PM
Subject: Re: [gentoo-mirrors] mirror fetch jobs and --checksum

My primary concern is rsync mirrors, but source mirrors would benefit as
well (there was a recent incident with the bash upstream changing a
distfile file without modifying the mtime).

On Mon, Jan 18, 2016 at 12:30:02AM +0000, Adrian Sandor wrote:

> Hi, is this for source mirrors or rsync mirrors?
>  Adrian
>
>      From: Robin H. Johnson <[hidden email]>
>  To: [hidden email]
>  Sent: Saturday, January 16, 2016 3:08 PM
>  Subject: [gentoo-mirrors] mirror fetch jobs and --checksum
>   
> [...]
> Can you please alter your rsync cronjobs to include --checksum in the
> commandline?
> [...]

Reply | Threaded
Open this post in threaded view
|

Re: mirror fetch jobs and --checksum

Robin H. Johnson-2
On Fri, Jan 22, 2016 at 10:03:04AM +0000, Adrian Sandor wrote:
> Well, my concern is that for a source mirror, rsync would have to read
> 300+GB of data every 4 hours, it seems a bit harsh (unless there is
> some way to save/cache checksums?)
I agree it's painful.

It does seem there used to be even more than one patch to cache
checksums:
https://mattmccutchen.net/rsync/rsync-patches.git/tree

It was done twice:
checksums-reading*
db.diff

I don't know why upstream didn't accept these in the end (I've submitted
other patches to upstream also with no response).

--
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee
E-Mail     : [hidden email]
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85

Reply | Threaded
Open this post in threaded view
|

Re: mirror fetch jobs and --checksum

nholland
In reply to this post by Robin H. Johnson-2
On Sat, Jan 16, 2016 at 07:08:28AM +0000, Robin H. Johnson wrote:

> All Portage calls will be including --checksum in future as well.

Hmm, wouldn't this require that all "end-user facing" mirrors do
actually support checksums as well? My own tiny little community
mirror does, but I noticed the following:

At least one of the officical German rotation mirrors from
rsync.de.gentoo.org, namely rsync15.de.gentoo.org (aka
ftp.halifax.rwth-aachen.de, or 137.226.34.46), doesn't seem to support
checksums. In its MOTD it says...:

| nils@teela ~ $ rsync rsync15.de.gentoo.org::
| --------------------------------------------------------------
| The features compression (-z) and checksums (-c) are disabled.
|
| More information about this server is available via HTTP:
| http://ftp.halifax.rwth-aachen.de/
| --------------------------------------------------------------

And this isn't just an outdated message, in practice I tried, as a
little test:

| nils@teela ~ $ rsync --recursive --links --perms --times -D --delete
| --timeout=300 --checksum
| ftp.halifax.rwth-aachen.de::gentoo-portage/metadata/news/ test/
| [...]
| rsync: read error: Connection reset by peer (104)
| rsync error: error in socket IO (code 10) at io.c(785)
| [Receiver=3.1.2]

Without --checksum, it does properly send me the contents of the news
directory.

Now, while I didn't test any other mirrors and thus can't say if they
(all) support --checksum, I can say that this particular primary
German rotation mirror doesn't. And if portage, in the future,
includes --checksum by default, then I fear people using / reaching
such a mirror (from a rotation) will be unable to sync as the server
might just abort the connection like it did in my test above.

So ... do we have a problem here?

Greetings,
Nils


Reply | Threaded
Open this post in threaded view
|

Re: mirror fetch jobs and --checksum

Robin H. Johnson-2
On Mon, Jan 25, 2016 at 11:30:54AM +0100, Nils Holland wrote:
> On Sat, Jan 16, 2016 at 07:08:28AM +0000, Robin H. Johnson wrote:
>
> > All Portage calls will be including --checksum in future as well.
>
> Hmm, wouldn't this require that all "end-user facing" mirrors do
> actually support checksums as well? My own tiny little community
> mirror does, but I noticed the following:
Hmm, so they use 'refuse options = checksum compress'.
That is problematic, and we'll have to get mirrors to turn it off for
the gentoo-portage module.

If you'd like to test any given mirror, please try to fetch the file
gentoo-portage/metadata/.checksum-test-marker

It contains a timestamp and instructions, and I've explicitly configured
the mtime of the file to remain static. If the timestamp inside the file
isn't recent, then you know the mirror isn't using --checksum to
communicate with upstream somewhere [1].

[1] There is no way to detect if there was an intermediate mirror that
was missing checksums, or if it was the user-facing mirror itself.

--
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee
E-Mail     : [hidden email]
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85

Reply | Threaded
Open this post in threaded view
|

Re: mirror fetch jobs and --checksum

Adrian Reber
On Mon, Jan 25, 2016 at 04:55:30PM +0000, Robin H. Johnson wrote:

> On Mon, Jan 25, 2016 at 11:30:54AM +0100, Nils Holland wrote:
> > On Sat, Jan 16, 2016 at 07:08:28AM +0000, Robin H. Johnson wrote:
> >
> > > All Portage calls will be including --checksum in future as well.
> >
> > Hmm, wouldn't this require that all "end-user facing" mirrors do
> > actually support checksums as well? My own tiny little community
> > mirror does, but I noticed the following:
> Hmm, so they use 'refuse options = checksum compress'.
> That is problematic, and we'll have to get mirrors to turn it off for
> the gentoo-portage module.

I am using that on my mirror and I have seen many other mirrors who are
refusing it also. I would have no problem enabling compress, but as most
mirrored data is already compressed it does not really make sense. But
to allow checksumming is something I would rather not do. If I
understand it correctly a rsync run, without transferring any data,
requires all of a sudden to read 1GB of actual data from the disk
instead of only the metadata. This is exactly the situation I am trying
to avoid.

I also see that 'refuse options' can be specified per rsync module, but
I cannot see a way to enable it for a single module and having it
disabled for all other modules.

So, from my point of view, this new requirement sounds undesirable.

                Adrian

Reply | Threaded
Open this post in threaded view
|

Re: mirror fetch jobs and --checksum

Carlos Carvalho-3
In reply to this post by Robin H. Johnson-2
Robin H. Johnson ([hidden email]) wrote on Sat, Jan 16, 2016 at 05:08:28AM BRST:
> Historically, our CVS-to-rsync process was overly enthusiastic about
> updating timestamps on files, even if they hadn't changed.
>
> With the new Git-to-rsync process, we've run into a few cases where the
> mtime is not represented with sufficiently high accuracy to catch all of
> the changes, and as a result changes are being missed.
>
> Can you please alter your rsync cronjobs to include --checksum in the
> commandline?

Huh??

Denied.

> A decade ago, we didn't include --checksum in the calls, as the CPU
> power available was considerably less; however systems have improved
> tremendously since that time.

*Sigh*

The issue is not calculating checksums, it's I/O. The gentoo repository is now
335GB. It's out of question to read it all at every update.

And you should know it!

Also, we block --checksum from clients. Most big mirrors do it.

Concerning storing the checksums, you're asking mirrors to use a patched rsync
version for gentoo? Forget it.

It's the master job to update the repository as you need. Asking mirrors to
bear an enormous load because you cannot do your job is silly, to put it
mildly. You'll be ignored by big mirrors, as you've been since your first post.

If you provide a list of checksums, like Debian does, I can use it. However I
know of no other mirror that has such functionality.

BTW, we're archlinux.c3sl.ufpr.br, the largest free software mirror in the
south hemisphere and one of the largest in the world.

Reply | Threaded
Open this post in threaded view
|

Re: mirror fetch jobs and --checksum

Robin H. Johnson-2
On Thu, Jan 28, 2016 at 08:42:33PM -0200, Carlos Carvalho wrote:
> The issue is not calculating checksums, it's I/O. The gentoo repository is now
> 335GB. It's out of question to read it all at every update.
> And you should know it!
The primary module I care about is gentoo-portage (historically
gentoo-x86) is under 400MiB raw (~207k inodes however, so watch your
inode space waste).  If your mirror DOESN'T have that entire rsync
module sitting in cache already, you probably have fairly low traffic.
If it's already sitting in cache, there should be no IO hit.
Hashing that much _was_ a CPU hit a decade ago, and most people did NOT
have the memory to fit it all into cache either.

The other issues of mtimes of Manifests esp on ebuild removal have been
resolved by the various series of patches, but part of those were
artificially bumping the mtime by a single second in certain cases.

Those cases are STILL going to have a bumpy time in some situations
because Git's commit time resolution is only 1 second (ditto many
filesystems). Those situations either need sub-second resolution in the
entire ecosystem or checksums of some form.

We've had ~27k commits in the last 6 months, of which:
- 4811 colliding (commit timestamp only)
- 45 colliding (author timestamp only)
- 12 colliding (commit timestamp, author timestamp)

We're using author timestamps on the outgoing rsync files, and
eventually we ARE going to have a real collision that hits users.

This didn't happen CVS because even if you were really fast (read: local
to the CVS server, and did not go via SSH), you could only get it down
to about 2 seconds per commit, and never in the same package. Two
different devs could never a commit to the same package in the same
second.

> Also, we block --checksum from clients. Most big mirrors do it.
The rsync cached checksums patch needs to get popular again, because
then the mirrors won't have any huge burden at all:
- update the checksums when syncing from the parent repo
- compare against the checksums when queried by the client

> Concerning storing the checksums, you're asking mirrors to use a patched rsync
> version for gentoo? Forget it.
Actually, I want upstream rsync to accept the rsync cached checksums
patch; it was dropped a few years ago amongst other major changes and
never picked up again due to lack of maintainer.

If the original stats still hold up, the patch will actually improve
performance even without --checksum, as it can save on stat calls on the
server side (it caches a lot more than just the checksum).

> It's the master job to update the repository as you need. Asking mirrors to
> bear an enormous load because you cannot do your job is silly, to put it
> mildly. You'll be ignored by big mirrors, as you've been since your first post.
I've got at least TWO cases of distfiles, upstream from Gentoo, being
screwed up by the relevant authors.

These happen infrequently enough that we'll continue to deal with them
on an exception basis.

There are also known cases where attackers have interfered with mirrors
(upstream & distributions); and you can expect that an attacker could
REASONABLY set the mtime of a file (if the size & mtime are the same,
and you aren't using --checksum in your cronjob to fetch from upstream,
the attack is going to persist). Catching this in all of the cases is
very difficult to do cheaply in the middle tiers. Cappos [1] included
this in his thesis on Attacks on Package Managers.

> If you provide a list of checksums, like Debian does, I can use it. However I
> know of no other mirror that has such functionality.
MirrorBrain, as used by SUSE uses checksums internally (and provides
them to end users). It can verify checksums from mirrors. Downside is
that you have to run apache (at least on the master, and on some of the
mirrors for best results).

[1] https://www.cs.arizona.edu/stork/packagemanagersecurity/attacks-on-package-managers.html

--
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee
E-Mail     : [hidden email]
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85

Reply | Threaded
Open this post in threaded view
|

Re: mirror fetch jobs and --checksum

Carlos Carvalho-3
Robin H. Johnson ([hidden email]) wrote on Fri, Jan 29, 2016 at 03:45:13AM BRST:
> On Thu, Jan 28, 2016 at 08:42:33PM -0200, Carlos Carvalho wrote:
> > The issue is not calculating checksums, it's I/O. The gentoo repository is now
> > 335GB. It's out of question to read it all at every update.
> > And you should know it!
> The primary module I care about is gentoo-portage (historically
> gentoo-x86) is under 400MiB raw (~207k inodes however, so watch your
> inode space waste).  If your mirror DOESN'T have that entire rsync
> module sitting in cache already, you probably have fairly low traffic.

No, a big mirror like us has tens of repositories and more than 10 million
inodes. Gentoo is a small part of it.

> Hashing that much _was_ a CPU hit a decade ago, and most people did NOT
> have the memory to fit it all into cache either.

For mirroring many repositories the main problem has always been disk I/O,
particularly inodes. We know it because we do mirroring for about a decade
already.

> The other issues of mtimes of Manifests esp on ebuild removal have been
> resolved by the various series of patches, but part of those were
> artificially bumping the mtime by a single second in certain cases.
>
> Those cases are STILL going to have a bumpy time in some situations
> because Git's commit time resolution is only 1 second (ditto many
> filesystems). Those situations either need sub-second resolution in the
> entire ecosystem or checksums of some form.
>
> We've had ~27k commits in the last 6 months, of which:
> - 4811 colliding (commit timestamp only)
> - 45 colliding (author timestamp only)
> - 12 colliding (commit timestamp, author timestamp)
>
> We're using author timestamps on the outgoing rsync files, and
> eventually we ARE going to have a real collision that hits users.
>
> This didn't happen CVS because even if you were really fast (read: local
> to the CVS server, and did not go via SSH), you could only get it down
> to about 2 seconds per commit, and never in the same package. Two
> different devs could never a commit to the same package in the same
> second.

You'll have to deal with it at the repository building stage.

> > Also, we block --checksum from clients. Most big mirrors do it.
> The rsync cached checksums patch needs to get popular again, because
> then the mirrors won't have any huge burden at all:
> - update the checksums when syncing from the parent repo
> - compare against the checksums when queried by the client

It'd be really nice yes, but unfortunately it's much harder. One has to make
sure that the checksums match what's on disk, no matter how the update process
is interrupted and what happens upstream. The bulk of it is not difficult but
the corner cases are :-( The patch is surely a godsend to the origin of content
but not to the destination.

Anyway, I agree checksums are better, so much so that I DO USE checksums when
they're available, like Debian does. I won't use the --checksum option
for Gentoo or any other repository but if you provide a file with a
list of them at your repository the C3SL mirror will use them. The format of
the file should be like the md5/sha* one. These utilities include only regular
files, so you also have to provide another file with a list containing all
objects in the repository. Please use

   cd /root-of-repository && TZ=UTC rsync --no-h --list-only -r > /path/to/filelist

to create it, because it's easy to parse.

md5 is becoming increasingly vulnerable, so the Debian repository maintainers
are thinking about using other hashes. It seems sha512 is faster than sha256 on
64-bit machines, making it a good option. If you use md5sum the mirror job
here is simpler because rsync already does the check; for other hashes it's
harder at the mirror side because we have to calculate it after download but
the cost is small nowadays. I'm willing to do it and modify our script
accordingly.