New distfile mirror layout

classic Classic list List threaded Threaded
56 messages Options
123
Reply | Threaded
Open this post in threaded view
|

New distfile mirror layout

Michał Górny-5
Hi, everybody.

It is my pleasure to announce that yesterday (EU) evening we've switched
to a new distfile mirror layout.  Users will be switching to the new
layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
already -- as their caches expire (24hrs).

The new layout is mostly a bow towards mirror admins, for some of whom
having a 60000+ files in a single directory have been a problem.
However, I suppose some of you also found e.g. the directory index
hardly usable due to its size.

Throughout a transitional period (whose exact length hasn't been decided
yet), both layouts will be available.  Afterwards, the old layout will
be removed from mirrors.  This has a few implications:

1. Users who don't upgrade their package managers in time will lose
the ability of fetching from Gentoo mirrors.  This shouldn't be that
much of a problem given that the core software needed to upgrade Portage
should all have reliable upstream SRC_URIs.

2. mirror://gentoo/file URIs will stop working.  While technically you
could use mirror://gentoo/XX/file, I'd rather recommend finally
discarding its usage and moving distfiles to devspace.

3. Directly fetching files from distfiles.gentoo.org will become
a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
to use something like:

$ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
1b
$ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
...


Alternatively, you can:

$ wget http://distfiles.gentoo.org/distfiles/INDEX

and grep for the right path there.  This INDEX is also a more
lightweight alternative to HTML indexes generated by the servers.


If you're interested in more background details and some plots, see [1].

[1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html

--
Best regards,
Michał Górny


signature.asc (631 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Richard Yao-2


> On Oct 18, 2019, at 9:42 AM, Michał Górny <[hidden email]> wrote:
>
> Hi, everybody.
>
> It is my pleasure to announce that yesterday (EU) evening we've switched
> to a new distfile mirror layout.  Users will be switching to the new
> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> already -- as their caches expire (24hrs).
>
> The new layout is mostly a bow towards mirror admins, for some of whom
> having a 60000+ files in a single directory have been a problem.
> However, I suppose some of you also found e.g. the directory index
> hardly usable due to its size.
This sounds like a filesystem issue. Do we know which filesystems are suffering?

ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that.

>
> Throughout a transitional period (whose exact length hasn't been decided
> yet), both layouts will be available.  Afterwards, the old layout will
> be removed from mirrors.  This has a few implications:
>
> 1. Users who don't upgrade their package managers in time will lose
> the ability of fetching from Gentoo mirrors.  This shouldn't be that
> much of a problem given that the core software needed to upgrade Portage
> should all have reliable upstream SRC_URIs.
>
> 2. mirror://gentoo/file URIs will stop working.  While technically you
> could use mirror://gentoo/XX/file, I'd rather recommend finally
> discarding its usage and moving distfiles to devspace.
>
> 3. Directly fetching files from distfiles.gentoo.org will become
> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> to use something like:
>
> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> 1b
> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> ...
>
>
> Alternatively, you can:
>
> $ wget http://distfiles.gentoo.org/distfiles/INDEX
>
> and grep for the right path there.  This INDEX is also a more
> lightweight alternative to HTML indexes generated by the servers.
>
>
> If you're interested in more background details and some plots, see [1].
>
> [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
>
> --
> Best regards,
> Michał Górny
>


Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Michał Górny-5
On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:

> > On Oct 18, 2019, at 9:42 AM, Michał Górny <[hidden email]> wrote:
> >
> > Hi, everybody.
> >
> > It is my pleasure to announce that yesterday (EU) evening we've switched
> > to a new distfile mirror layout.  Users will be switching to the new
> > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> > already -- as their caches expire (24hrs).
> >
> > The new layout is mostly a bow towards mirror admins, for some of whom
> > having a 60000+ files in a single directory have been a problem.
> > However, I suppose some of you also found e.g. the directory index
> > hardly usable due to its size.
> This sounds like a filesystem issue. Do we know which filesystems are suffering?
>
> ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that.
Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
may apply only to older ntfs versions.  NFS has been mentioned too.

However, just because modern filesystems can handle them efficiently, it
doesn't mean having directories that huge comes with zero cost.

[1] https://bugs.gentoo.org/534528

--
Best regards,
Michał Górny


signature.asc (631 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Richard Yao-2

> On Oct 18, 2019, at 4:49 PM, Michał Górny <[hidden email]> wrote:
>
> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
>>>>>> On Oct 18, 2019, at 9:42 AM, Michał Górny <[hidden email]> wrote:
>>>>> Hi, everybody.
>>>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>>>> to a new distfile mirror layout.  Users will be switching to the new
>>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>>>> already -- as their caches expire (24hrs).
>>>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>>>> having a 60000+ files in a single directory have been a problem.
>>>>> However, I suppose some of you also found e.g. the directory index
>>>>> hardly usable due to its size.
>> This sounds like a filesystem issue. Do we know which filesystems are suffering?
>> ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that.
>
> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
> may apply only to older ntfs versions.  NFS has been mentioned too.

ext2 and vfat are not surprises to me (outside of the idea that anyone would use them for a mirror). NTFS and NFS are though.
>
> However, just because modern filesystems can handle them efficiently, it
> doesn't mean having directories that huge comes with zero cost.
While I am okay with the change, what do you mean when you say that having huge directories does not come with zero cost?

Filesystems with O(1) directory lookups like ZFS would probably be hurt by this, but the impact should be negligible. Filesystems with O(log n) directory lookups would see faster directory lookups.

Outside of directory lookups, this could speed up up searches and sort operations when listing everything with just about any filesystem benefiting from the improvement.

Listing directories on such filesystems should not benefit from this unless you are using ls where the default behavior is to sort the directory contents (which is where the improvement when sorting comes into play). The need to sort the directory contents by default keeps ls from displaying anything until it has scanned the entire directory. The asymptotic complexity of a fast comparison based sort improves in this situation from O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory independently. A further speed up could be obtained by doing multithreading to parallelize the sort operations.

Since I know someone will call me out on that comment, I will explain. Each bucket has roughly n/b items in it where n is the total number and b is the number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each of the b buckets. The buckets are pre-sorted by prefix, so the result is now sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn) comparison sort on this very special case where you call it multiple times on data that has been persorted by prefix into buckets.

Is there any other benefit to this or did I get everything?

By the way, it is offtopic for the thread, but it occurs to me that a hybrid of radix sort and A comparison based sort could give us a general sorting algorithm that is asymptotically faster than O(nlogn).
>
> [1] https://bugs.gentoo.org/534528
>
> --
> Best regards,
> Michał Górny


Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Michał Górny-5
On Fri, 2019-10-18 at 21:09 -0400, Richard Yao wrote:

> > On Oct 18, 2019, at 4:49 PM, Michał Górny <[hidden email]> wrote:
> >
> > On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
> > > > > > > On Oct 18, 2019, at 9:42 AM, Michał Górny <[hidden email]> wrote:
> > > > > > Hi, everybody.
> > > > > > It is my pleasure to announce that yesterday (EU) evening we've switched
> > > > > > to a new distfile mirror layout.  Users will be switching to the new
> > > > > > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> > > > > > already -- as their caches expire (24hrs).
> > > > > > The new layout is mostly a bow towards mirror admins, for some of whom
> > > > > > having a 60000+ files in a single directory have been a problem.
> > > > > > However, I suppose some of you also found e.g. the directory index
> > > > > > hardly usable due to its size.
> > > This sounds like a filesystem issue. Do we know which filesystems are suffering?
> > > ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that.
> >
> > Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
> > may apply only to older ntfs versions.  NFS has been mentioned too.
>
> ext2 and vfat are not surprises to me (outside of the idea that anyone would use them for a mirror). NTFS and NFS are though.
Are you surprised that people use NTFS on Windows?  Or that they use
local mirrors over NFS?  The latter still needs to be addressed
separatel, provided that they mount it on DISTDIR.

> > However, just because modern filesystems can handle them efficiently, it
> > doesn't mean having directories that huge comes with zero cost.
> While I am okay with the change, what do you mean when you say that having huge directories does not come with zero cost?
>
> Filesystems with O(1) directory lookups like ZFS would probably be hurt by this

O(1) or O(n)?

> , but the impact should be negligible. Filesystems with O(log n) directory lookups would see faster directory lookups.
>
> Outside of directory lookups, this could speed up up searches and sort operations when listing everything with just about any filesystem benefiting from the improvement.
>
> Listing directories on such filesystems should not benefit from this unless you are using ls where the default behavior is to sort the directory contents (which is where the improvement when sorting comes into play). The need to sort the directory contents by default keeps ls from displaying anything until it has scanned the entire directory. The asymptotic complexity of a fast comparison based sort improves in this situation from O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory independently. A further speed up could be obtained by doing multithreading to parallelize the sort operations.
>
> Since I know someone will call me out on that comment, I will explain. Each bucket has roughly n/b items in it where n is the total number and b is the number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each of the b buckets. The buckets are pre-sorted by prefix, so the result is now sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn) comparison sort on this very special case where you call it multiple times on data that has been persorted by prefix into buckets.
>
> Is there any other benefit to this or did I get everything?

Listings for individual directories won't cause major pain to browsers
anymore.  Not that there's much reason to do them.

All kinds of per-direction operations will consume less memory
and be potentially faster.

--
Best regards,
Michał Górny


signature.asc (631 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Richard Yao-2


> On Oct 19, 2019, at 2:17 AM, Michał Górny <[hidden email]> wrote:
>
> On Fri, 2019-10-18 at 21:09 -0400, Richard Yao wrote:
>>>> On Oct 18, 2019, at 4:49 PM, Michał Górny <[hidden email]> wrote:
>>>
>>> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
>>>>>>>> On Oct 18, 2019, at 9:42 AM, Michał Górny <[hidden email]> wrote:
>>>>>>> Hi, everybody.
>>>>>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>>>>>> to a new distfile mirror layout.  Users will be switching to the new
>>>>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>>>>>> already -- as their caches expire (24hrs).
>>>>>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>>>>>> having a 60000+ files in a single directory have been a problem.
>>>>>>> However, I suppose some of you also found e.g. the directory index
>>>>>>> hardly usable due to its size.
>>>> This sounds like a filesystem issue. Do we know which filesystems are suffering?
>>>> ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that.
>>>
>>> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
>>> may apply only to older ntfs versions.  NFS has been mentioned too.
>>
>> ext2 and vfat are not surprises to me (outside of the idea that anyone would use them for a mirror). NTFS and NFS are though.
>
> Are you surprised that people use NTFS on Windows?  Or that they use
> local mirrors over NFS?  The latter still needs to be addressed
> separatel, provided that they mount it on DISTDIR.
I am surprised that it was an issue on NTFS because it uses B-trees. As for NFS, I had expected that to be more dependent on the local filesystem than on NFS itself. If it has a slowdown when used on a filesystem that had fast directory operations, that might be a bug.
>
>>> However, just because modern filesystems can handle them efficiently, it
>>> doesn't mean having directories that huge comes with zero cost.
>> While I am okay with the change, what do you mean when you say that having huge directories does not come with zero cost?
>>
>> Filesystems with O(1) directory lookups like ZFS would probably be hurt by this
>
> O(1) or O(n)?
ZFS uses extendible hashing for its directories, so the data structure used is amortized O(1). You might consider it O(log n) due to the indirect tree traversal needed to find the direct block containing the hash table entry. With caching of indirect blocks, it should be amortized O(1) to find the direct block in practice as far as read IOs are considered. In addition, the base of the logarithm is 128 or 1024 depending on the pool feature flags.

>
>> , but the impact should be negligible. Filesystems with O(log n) directory lookups would see faster directory lookups.
>>
>> Outside of directory lookups, this could speed up up searches and sort operations when listing everything with just about any filesystem benefiting from the improvement.
>>
>> Listing directories on such filesystems should not benefit from this unless you are using ls where the default behavior is to sort the directory contents (which is where the improvement when sorting comes into play). The need to sort the directory contents by default keeps ls from displaying anything until it has scanned the entire directory. The asymptotic complexity of a fast comparison based sort improves in this situation from O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory independently. A further speed up could be obtained by doing multithreading to parallelize the sort operations.
>>
>> Since I know someone will call me out on that comment, I will explain. Each bucket has roughly n/b items in it where n is the total number and b is the number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each of the b buckets. The buckets are pre-sorted by prefix, so the result is now sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn) comparison sort on this very special case where you call it multiple times on data that has been persorted by prefix into buckets.
>>
>> Is there any other benefit to this or did I get everything?
>
> Listings for individual directories won't cause major pain to browsers
> anymore.  Not that there's much reason to do them.
That makes sense.
>
> All kinds of per-direction operations will consume less memory
> and be potentially faster.
Userland would save memory when sorting or grepping a directory listing by virtue of having to process less data for grep and less data at a time for sorting (if it takes advantage of this). That would have performance benefits in userland.

The kernel would have little memory savings and in some cases might be slightly worse. It is negligible. Performance in the kernel ought to be slightly better on filesystems with O(log n) directory operations, but I would only expect the really bad ones to show much improvement.
> --
> Best regards,
> Michał Górny
>


Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Fabian Groffen-2
In reply to this post by Michał Górny-5
Hi,

On 18-10-2019 15:41:32 +0200, Michał Górny wrote:

> 3. Directly fetching files from distfiles.gentoo.org will become
> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> to use something like:
>
> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> 1b
> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> ...
>
>
> Alternatively, you can:
>
> $ wget http://distfiles.gentoo.org/distfiles/INDEX
>
> and grep for the right path there.  This INDEX is also a more
> lightweight alternative to HTML indexes generated by the servers.
Would it be possible to run a service that sends a 302 for the
distfiles/foo-1.tar.gz to the appropriate bucket such that manual
fetching doesn't require to calculate the hash?

I prototyped this myself for distfiles.prefix, and seems like a nice
guesture for at least the transition period?

Thanks,
Fabian


--
Fabian Groffen
Gentoo on a different level

signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Michał Górny-5
On Sat, 2019-10-19 at 15:31 +0200, Fabian Groffen wrote:

> Hi,
>
> On 18-10-2019 15:41:32 +0200, Michał Górny wrote:
> > 3. Directly fetching files from distfiles.gentoo.org will become
> > a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> > to use something like:
> >
> > $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> > 1b
> > $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> > ...
> >
> >
> > Alternatively, you can:
> >
> > $ wget http://distfiles.gentoo.org/distfiles/INDEX
> >
> > and grep for the right path there.  This INDEX is also a more
> > lightweight alternative to HTML indexes generated by the servers.
>
> Would it be possible to run a service that sends a 302 for the
> distfiles/foo-1.tar.gz to the appropriate bucket such that manual
> fetching doesn't require to calculate the hash?
>
> I prototyped this myself for distfiles.prefix, and seems like a nice
> guesture for at least the transition period?
>
That would only for servers whose admins would explicitly install
the service, i.e. not for anyone using GENTOO_MIRRORS.  If you're
talking purely about distfiles.gentoo.org, we may add something like
that by the end of transitional period.

--
Best regards,
Michał Górny


signature.asc (631 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Richard Yao-2
In reply to this post by Richard Yao-2


> On Oct 18, 2019, at 9:10 PM, Richard Yao <[hidden email]> wrote:
>
> 
>>> On Oct 18, 2019, at 4:49 PM, Michał Górny <[hidden email]> wrote:
>> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
>>>>>>>> On Oct 18, 2019, at 9:42 AM, Michał Górny <[hidden email]> wrote:
>>>>>>> Hi, everybody.
>>>>>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>>>>>> to a new distfile mirror layout.  Users will be switching to the new
>>>>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>>>>>> already -- as their caches expire (24hrs).
>>>>>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>>>>>> having a 60000+ files in a single directory have been a problem.
>>>>>>> However, I suppose some of you also found e.g. the directory index
>>>>>>> hardly usable due to its size.
>>> This sounds like a filesystem issue. Do we know which filesystems are suffering?
>>> ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that.
>> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
>> may apply only to older ntfs versions.  NFS has been mentioned too.
>
> ext2 and vfat are not surprises to me (outside of the idea that anyone would use them for a mirror). NTFS and NFS are though.
>> However, just because modern filesystems can handle them efficiently, it
>> doesn't mean having directories that huge comes with zero cost.
> While I am okay with the change, what do you mean when you say that having huge directories does not come with zero cost?
>
> Filesystems with O(1) directory lookups like ZFS would probably be hurt by this, but the impact should be negligible. Filesystems with O(log n) directory lookups would see faster directory lookups.
>
> Outside of directory lookups, this could speed up up searches and sort operations when listing everything with just about any filesystem benefiting from the improvement.
>
> Listing directories on such filesystems should not benefit from this unless you are using ls where the default behavior is to sort the directory contents (which is where the improvement when sorting comes into play). The need to sort the directory contents by default keeps ls from displaying anything until it has scanned the entire directory. The asymptotic complexity of a fast comparison based sort improves in this situation from O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory independently. A further speed up could be obtained by doing multithreading to parallelize the sort operations.
I read your original email late at night and I misread the description of how this works.

At an initial glance, I thought we were doing a prefix approach (with the caveat that buckets are unbalanced). In reality, we are doing a cryptographic hash of the filenames.

That would keep all buckets balanced, which gives the best directory lookup times on O(log n) lookup filesystems, but I think there is something to be gained from using the less optimal approach of using filename prefixes:

* some regex searches on distfiles can be accelerated
* generating a sorted list of all distfiles becomes asymptotically faster
* it is easy for a user to find all versions of a given distfile
* no need to calculate a cryptographic hash

I realize that I am late to propose it, but could we consider a switch to this alternative arrangement?

The bulk of the performance gain should be realized with either approach.

> Since I know someone will call me out on that comment, I will explain. Each bucket has roughly n/b items in it where n is the total number and b is the number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each of the b buckets. The buckets are pre-sorted by prefix, so the result is now sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn) comparison sort on this very special case where you call it multiple times on data that has been persorted by prefix into buckets.
>
> Is there any other benefit to this or did I get everything?
>
> By the way, it is offtopic for the thread, but it occurs to me that a hybrid of radix sort and A comparison based sort could give us a general sorting algorithm that is asymptotically faster than O(nlogn).
>> [1] https://bugs.gentoo.org/534528
>> --
>> Best regards,
>> Michał Górny


Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Michał Górny-5
On Sat, 2019-10-19 at 15:26 -0400, Richard Yao wrote:

> > On Oct 18, 2019, at 9:10 PM, Richard Yao <[hidden email]> wrote:
> >
> > 
> > > > On Oct 18, 2019, at 4:49 PM, Michał Górny <[hidden email]> wrote:
> > > On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
> > > > > > > > > On Oct 18, 2019, at 9:42 AM, Michał Górny <[hidden email]> wrote:
> > > > > > > > Hi, everybody.
> > > > > > > > It is my pleasure to announce that yesterday (EU) evening we've switched
> > > > > > > > to a new distfile mirror layout.  Users will be switching to the new
> > > > > > > > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> > > > > > > > already -- as their caches expire (24hrs).
> > > > > > > > The new layout is mostly a bow towards mirror admins, for some of whom
> > > > > > > > having a 60000+ files in a single directory have been a problem.
> > > > > > > > However, I suppose some of you also found e.g. the directory index
> > > > > > > > hardly usable due to its size.
> > > > This sounds like a filesystem issue. Do we know which filesystems are suffering?
> > > > ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that.
> > > Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
> > > may apply only to older ntfs versions.  NFS has been mentioned too.
> >
> > ext2 and vfat are not surprises to me (outside of the idea that anyone would use them for a mirror). NTFS and NFS are though.
> > > However, just because modern filesystems can handle them efficiently, it
> > > doesn't mean having directories that huge comes with zero cost.
> > While I am okay with the change, what do you mean when you say that having huge directories does not come with zero cost?
> >
> > Filesystems with O(1) directory lookups like ZFS would probably be hurt by this, but the impact should be negligible. Filesystems with O(log n) directory lookups would see faster directory lookups.
> >
> > Outside of directory lookups, this could speed up up searches and sort operations when listing everything with just about any filesystem benefiting from the improvement.
> >
> > Listing directories on such filesystems should not benefit from this unless you are using ls where the default behavior is to sort the directory contents (which is where the improvement when sorting comes into play). The need to sort the directory contents by default keeps ls from displaying anything until it has scanned the entire directory. The asymptotic complexity of a fast comparison based sort improves in this situation from O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory independently. A further speed up could be obtained by doing multithreading to parallelize the sort operations.
> I read your original email late at night and I misread the description of how this works.
>
> At an initial glance, I thought we were doing a prefix approach (with the caveat that buckets are unbalanced). In reality, we are doing a cryptographic hash of the filenames.
>
> That would keep all buckets balanced, which gives the best directory lookup times on O(log n) lookup filesystems, but I think there is something to be gained from using the less optimal approach of using filename prefixes:
>
> * some regex searches on distfiles can be accelerated
> * generating a sorted list of all distfiles becomes asymptotically faster
> * it is easy for a user to find all versions of a given distfile
> * no need to calculate a cryptographic hash
>
> I realize that I am late to propose it, but could we consider a switch to this alternative arrangement?
No, we can't.  Please read either the original discussion on the bug, or
the linked article.  It's explained in detail why this won't work.

--
Best regards,
Michał Górny


signature.asc (631 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Richard Yao-2


> On Oct 19, 2019, at 4:03 PM, Michał Górny <[hidden email]> wrote:
>
> On Sat, 2019-10-19 at 15:26 -0400, Richard Yao wrote:
>>>> On Oct 18, 2019, at 9:10 PM, Richard Yao <[hidden email]> wrote:
>>>
>>> 
>>>>> On Oct 18, 2019, at 4:49 PM, Michał Górny <[hidden email]> wrote:
>>>> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
>>>>>>>>>> On Oct 18, 2019, at 9:42 AM, Michał Górny <[hidden email]> wrote:
>>>>>>>>> Hi, everybody.
>>>>>>>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>>>>>>>> to a new distfile mirror layout.  Users will be switching to the new
>>>>>>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>>>>>>>> already -- as their caches expire (24hrs).
>>>>>>>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>>>>>>>> having a 60000+ files in a single directory have been a problem.
>>>>>>>>> However, I suppose some of you also found e.g. the directory index
>>>>>>>>> hardly usable due to its size.
>>>>> This sounds like a filesystem issue. Do we know which filesystems are suffering?
>>>>> ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that.
>>>> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
>>>> may apply only to older ntfs versions.  NFS has been mentioned too.
>>>
>>> ext2 and vfat are not surprises to me (outside of the idea that anyone would use them for a mirror). NTFS and NFS are though.
>>>> However, just because modern filesystems can handle them efficiently, it
>>>> doesn't mean having directories that huge comes with zero cost.
>>> While I am okay with the change, what do you mean when you say that having huge directories does not come with zero cost?
>>>
>>> Filesystems with O(1) directory lookups like ZFS would probably be hurt by this, but the impact should be negligible. Filesystems with O(log n) directory lookups would see faster directory lookups.
>>>
>>> Outside of directory lookups, this could speed up up searches and sort operations when listing everything with just about any filesystem benefiting from the improvement.
>>>
>>> Listing directories on such filesystems should not benefit from this unless you are using ls where the default behavior is to sort the directory contents (which is where the improvement when sorting comes into play). The need to sort the directory contents by default keeps ls from displaying anything until it has scanned the entire directory. The asymptotic complexity of a fast comparison based sort improves in this situation from O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory independently. A further speed up could be obtained by doing multithreading to parallelize the sort operations.
>> I read your original email late at night and I misread the description of how this works.
>>
>> At an initial glance, I thought we were doing a prefix approach (with the caveat that buckets are unbalanced). In reality, we are doing a cryptographic hash of the filenames.
>>
>> That would keep all buckets balanced, which gives the best directory lookup times on O(log n) lookup filesystems, but I think there is something to be gained from using the less optimal approach of using filename prefixes:
>>
>> * some regex searches on distfiles can be accelerated
>> * generating a sorted list of all distfiles becomes asymptotically faster
>> * it is easy for a user to find all versions of a given distfile
>> * no need to calculate a cryptographic hash
>>
>> I realize that I am late to propose it, but could we consider a switch to this alternative arrangement?
>
> No, we can't.  Please read either the original discussion on the bug, or
> the linked article.  It's explained in detail why this won't work.
Alright. I am convinced. Thanks.
>
> --
> Best regards,
> Michał Górny
>


Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Joshua Kinard-2
In reply to this post by Michał Górny-5
On 10/18/2019 09:41, Michał Górny wrote:

> Hi, everybody.
>
> It is my pleasure to announce that yesterday (EU) evening we've switched
> to a new distfile mirror layout.  Users will be switching to the new
> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> already -- as their caches expire (24hrs).
>
> The new layout is mostly a bow towards mirror admins, for some of whom
> having a 60000+ files in a single directory have been a problem.
> However, I suppose some of you also found e.g. the directory index
> hardly usable due to its size.
>
> Throughout a transitional period (whose exact length hasn't been decided
> yet), both layouts will be available.  Afterwards, the old layout will
> be removed from mirrors.  This has a few implications:
>
> 1. Users who don't upgrade their package managers in time will lose
> the ability of fetching from Gentoo mirrors.  This shouldn't be that
> much of a problem given that the core software needed to upgrade Portage
> should all have reliable upstream SRC_URIs.
>
> 2. mirror://gentoo/file URIs will stop working.  While technically you
> could use mirror://gentoo/XX/file, I'd rather recommend finally
> discarding its usage and moving distfiles to devspace.
>
> 3. Directly fetching files from distfiles.gentoo.org will become
> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> to use something like:
>
> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> 1b
> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> ...
>
>
> Alternatively, you can:
>
> $ wget http://distfiles.gentoo.org/distfiles/INDEX
>
> and grep for the right path there.  This INDEX is also a more
> lightweight alternative to HTML indexes generated by the servers.
>
>
> If you're interested in more background details and some plots, see [1].
>
> [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
>

So the answer I didn't really see directly stated here is, where do new
distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
distfile to /space/distfiles-local.  What is the new directory I need to
use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
target, what would be the applicable prefix to use?

Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
got chastised for doing exactly that.  Too much possibility of fragmentation
as devs retire or package maintainership changes hands.

I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
hash-based naming scheme on the new distfiles layout.  I really kind prefer
breaking the directories up based on the first letter of the distfiles in
question, factoring case-sensitivity in (so you'd have 52 top-level
directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
directories, additional subdirectories for the next few letters (say,
letters 2-3).  Yes, this leads to some orphan cases where a distfile might
live on its own, but from a direct navigation standpoint, it's easy to find
for someone browsing the distfiles server and easy to predict where a
distfile is at.

No math, statistical analysis, or deep-rooted knowledge of filesystems
behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
need to go get a distfile off the Gentoo mirrors, and being able to quickly
find it in the mirror root is great.  Having to do hash calculations to work
out the file path will be *really* annoying.

--
Joshua Kinard
Gentoo/MIPS
[hidden email]
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Alec Warner-2


On Sat, Oct 19, 2019 at 4:24 PM Joshua Kinard <[hidden email]> wrote:
On 10/18/2019 09:41, Michał Górny wrote:
> Hi, everybody.
>
> It is my pleasure to announce that yesterday (EU) evening we've switched
> to a new distfile mirror layout.  Users will be switching to the new
> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> already -- as their caches expire (24hrs).
>
> The new layout is mostly a bow towards mirror admins, for some of whom
> having a 60000+ files in a single directory have been a problem.
> However, I suppose some of you also found e.g. the directory index
> hardly usable due to its size.
>
> Throughout a transitional period (whose exact length hasn't been decided
> yet), both layouts will be available.  Afterwards, the old layout will
> be removed from mirrors.  This has a few implications:
>
> 1. Users who don't upgrade their package managers in time will lose
> the ability of fetching from Gentoo mirrors.  This shouldn't be that
> much of a problem given that the core software needed to upgrade Portage
> should all have reliable upstream SRC_URIs.
>
> 2. mirror://gentoo/file URIs will stop working.  While technically you
> could use mirror://gentoo/XX/file, I'd rather recommend finally
> discarding its usage and moving distfiles to devspace.
>
> 3. Directly fetching files from distfiles.gentoo.org will become
> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> to use something like:
>
> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> 1b
> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> ...
>
>
> Alternatively, you can:
>
> $ wget http://distfiles.gentoo.org/distfiles/INDEX
>
> and grep for the right path there.  This INDEX is also a more
> lightweight alternative to HTML indexes generated by the servers.
>
>
> If you're interested in more background details and some plots, see [1].
>
> [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
>

So the answer I didn't really see directly stated here is, where do new
distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
distfile to /space/distfiles-local.  What is the new directory I need to
use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
target, what would be the applicable prefix to use?


 

Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
got chastised for doing exactly that.  Too much possibility of fragmentation
as devs retire or package maintainership changes hands.

I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
hash-based naming scheme on the new distfiles layout.  I really kind prefer
breaking the directories up based on the first letter of the distfiles in
question, factoring case-sensitivity in (so you'd have 52 top-level
directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
directories, additional subdirectories for the next few letters (say,
letters 2-3).  Yes, this leads to some orphan cases where a distfile might
live on its own, but from a direct navigation standpoint, it's easy to find
for someone browsing the distfiles server and easy to predict where a
distfile is at.

No math, statistical analysis, or deep-rooted knowledge of filesystems
behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
need to go get a distfile off the Gentoo mirrors, and being able to quickly
find it in the mirror root is great.  Having to do hash calculations to work
out the file path will be *really* annoying.

So if you want a tool that "downloads a distfile off of the mirrors" we should be able to build such a utility.

I'm not really sure why that tool needs to be:
*copy DISTFILENAME*

It could just `ebuild portageq download $DISTFILENAME or similar.`

-A




 

--
Joshua Kinard
Gentoo/MIPS
[hidden email]
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Joshua Kinard-2
On 10/19/2019 19:57, Alec Warner wrote:

> On Sat, Oct 19, 2019 at 4:24 PM Joshua Kinard <[hidden email]> wrote:
>
>> On 10/18/2019 09:41, Michał Górny wrote:
>>> Hi, everybody.
>>>
>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>> to a new distfile mirror layout.  Users will be switching to the new
>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>> already -- as their caches expire (24hrs).
>>>
>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>> having a 60000+ files in a single directory have been a problem.
>>> However, I suppose some of you also found e.g. the directory index
>>> hardly usable due to its size.
>>>
>>> Throughout a transitional period (whose exact length hasn't been decided
>>> yet), both layouts will be available.  Afterwards, the old layout will
>>> be removed from mirrors.  This has a few implications:
>>>
>>> 1. Users who don't upgrade their package managers in time will lose
>>> the ability of fetching from Gentoo mirrors.  This shouldn't be that
>>> much of a problem given that the core software needed to upgrade Portage
>>> should all have reliable upstream SRC_URIs.
>>>
>>> 2. mirror://gentoo/file URIs will stop working.  While technically you
>>> could use mirror://gentoo/XX/file, I'd rather recommend finally
>>> discarding its usage and moving distfiles to devspace.
>>>
>>> 3. Directly fetching files from distfiles.gentoo.org will become
>>> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
>>> to use something like:
>>>
>>> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
>>> 1b
>>> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
>>> ...
>>>
>>>
>>> Alternatively, you can:
>>>
>>> $ wget http://distfiles.gentoo.org/distfiles/INDEX
>>>
>>> and grep for the right path there.  This INDEX is also a more
>>> lightweight alternative to HTML indexes generated by the servers.
>>>
>>>
>>> If you're interested in more background details and some plots, see [1].
>>>
>>> [1]
>> https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
>>>
>>
>> So the answer I didn't really see directly stated here is, where do new
>> distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
>> distfile to /space/distfiles-local.  What is the new directory I need to
>> use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
>> target, what would be the applicable prefix to use?
>>
>
>
>
>
>>
>> Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
>> got chastised for doing exactly that.  Too much possibility of
>> fragmentation
>> as devs retire or package maintainership changes hands.
>>
>> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
>> hash-based naming scheme on the new distfiles layout.  I really kind prefer
>> breaking the directories up based on the first letter of the distfiles in
>> question, factoring case-sensitivity in (so you'd have 52 top-level
>> directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
>> directories, additional subdirectories for the next few letters (say,
>> letters 2-3).  Yes, this leads to some orphan cases where a distfile might
>> live on its own, but from a direct navigation standpoint, it's easy to find
>> for someone browsing the distfiles server and easy to predict where a
>> distfile is at.
>>
>> No math, statistical analysis, or deep-rooted knowledge of filesystems
>> behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
>> need to go get a distfile off the Gentoo mirrors, and being able to quickly
>> find it in the mirror root is great.  Having to do hash calculations to
>> work
>> out the file path will be *really* annoying.
>>
>
> So if you want a tool that "downloads a distfile off of the mirrors" we
> should be able to build such a utility.
>
> I'm not really sure why that tool needs to be:
> *copy DISTFILENAME*
> wget distilfes.gentoo.org/$PASTE
>
> It could just `ebuild portageq download $DISTFILENAME or similar.`
>
> -A

Sometimes, I'm not on a Gentoo system, or even a Linux/Unix platform, when I
go to fetch a distfile.  Could (and have) fetched as such off of Debian's
mirrors before, but Gentoo is what I know and fetching a distfile off of
those mirrors manually was generally very straight forward.

Not a common case, and certainly not a blocker.  I was just pointing out
that hashed-based naming is decidedly a lot less human-friendly.  But,
that's been the general trend for all-things technology these last few years.

--
Joshua Kinard
Gentoo/MIPS
[hidden email]
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Michał Górny-5
In reply to this post by Joshua Kinard-2
On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote:

> On 10/18/2019 09:41, Michał Górny wrote:
> > Hi, everybody.
> >
> > It is my pleasure to announce that yesterday (EU) evening we've switched
> > to a new distfile mirror layout.  Users will be switching to the new
> > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> > already -- as their caches expire (24hrs).
> >
> > The new layout is mostly a bow towards mirror admins, for some of whom
> > having a 60000+ files in a single directory have been a problem.
> > However, I suppose some of you also found e.g. the directory index
> > hardly usable due to its size.
> >
> > Throughout a transitional period (whose exact length hasn't been decided
> > yet), both layouts will be available.  Afterwards, the old layout will
> > be removed from mirrors.  This has a few implications:
> >
> > 1. Users who don't upgrade their package managers in time will lose
> > the ability of fetching from Gentoo mirrors.  This shouldn't be that
> > much of a problem given that the core software needed to upgrade Portage
> > should all have reliable upstream SRC_URIs.
> >
> > 2. mirror://gentoo/file URIs will stop working.  While technically you
> > could use mirror://gentoo/XX/file, I'd rather recommend finally
> > discarding its usage and moving distfiles to devspace.
> >
> > 3. Directly fetching files from distfiles.gentoo.org will become
> > a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> > to use something like:
> >
> > $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> > 1b
> > $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> > ...
> >
> >
> > Alternatively, you can:
> >
> > $ wget http://distfiles.gentoo.org/distfiles/INDEX
> >
> > and grep for the right path there.  This INDEX is also a more
> > lightweight alternative to HTML indexes generated by the servers.
> >
> >
> > If you're interested in more background details and some plots, see [1].
> >
> > [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
> >
>
> So the answer I didn't really see directly stated here is, where do new
> distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
> distfile to /space/distfiles-local.  What is the new directory I need to
> use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
> target, what would be the applicable prefix to use?
>
> Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
> got chastised for doing exactly that.  Too much possibility of fragmentation
> as devs retire or package maintainership changes hands.
Today you get chastised for using /space/distfiles-local and not
following policy changes.  The devmanual states that it's deprecated
since at least 2011, and talks of using d.g.o [1].

> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
> hash-based naming scheme on the new distfiles layout.  I really kind prefer
> breaking the directories up based on the first letter of the distfiles in
> question, factoring case-sensitivity in (so you'd have 52 top-level
> directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
> directories, additional subdirectories for the next few letters (say,
> letters 2-3).  Yes, this leads to some orphan cases where a distfile might
> live on its own, but from a direct navigation standpoint, it's easy to find
> for someone browsing the distfiles server and easy to predict where a
> distfile is at.
>
> No math, statistical analysis, or deep-rooted knowledge of filesystems
> behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
> need to go get a distfile off the Gentoo mirrors, and being able to quickly
> find it in the mirror root is great.  Having to do hash calculations to work
> out the file path will be *really* annoying.
Your solution still doesn't solve the problem of having 8k-24k files
in a single directory, even if you use 7 letters of prefix.  So it just
creates a lot of tiny directory noise for no practical gain.

[1] https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suitable-download-hosts

--
Best regards,
Michał Górny


signature.asc (631 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Joshua Kinard-2
On 10/20/2019 02:51, Michał Górny wrote:

> On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote:
>> On 10/18/2019 09:41, Michał Górny wrote:
>>> Hi, everybody.
>>>
>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>> to a new distfile mirror layout.  Users will be switching to the new
>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>> already -- as their caches expire (24hrs).
>>>
>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>> having a 60000+ files in a single directory have been a problem.
>>> However, I suppose some of you also found e.g. the directory index
>>> hardly usable due to its size.
>>>
>>> Throughout a transitional period (whose exact length hasn't been decided
>>> yet), both layouts will be available.  Afterwards, the old layout will
>>> be removed from mirrors.  This has a few implications:
>>>
>>> 1. Users who don't upgrade their package managers in time will lose
>>> the ability of fetching from Gentoo mirrors.  This shouldn't be that
>>> much of a problem given that the core software needed to upgrade Portage
>>> should all have reliable upstream SRC_URIs.
>>>
>>> 2. mirror://gentoo/file URIs will stop working.  While technically you
>>> could use mirror://gentoo/XX/file, I'd rather recommend finally
>>> discarding its usage and moving distfiles to devspace.
>>>
>>> 3. Directly fetching files from distfiles.gentoo.org will become
>>> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
>>> to use something like:
>>>
>>> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
>>> 1b
>>> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
>>> ...
>>>
>>>
>>> Alternatively, you can:
>>>
>>> $ wget http://distfiles.gentoo.org/distfiles/INDEX
>>>
>>> and grep for the right path there.  This INDEX is also a more
>>> lightweight alternative to HTML indexes generated by the servers.
>>>
>>>
>>> If you're interested in more background details and some plots, see [1].
>>>
>>> [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
>>>
>>
>> So the answer I didn't really see directly stated here is, where do new
>> distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
>> distfile to /space/distfiles-local.  What is the new directory I need to
>> use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
>> target, what would be the applicable prefix to use?
>>
>> Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
>> got chastised for doing exactly that.  Too much possibility of fragmentation
>> as devs retire or package maintainership changes hands.
>
> Today you get chastised for using /space/distfiles-local and not
> following policy changes.  The devmanual states that it's deprecated
> since at least 2011, and talks of using d.g.o [1].

I don't recall this change being added as far back as 2011.  Maybe my memory
is bad, but if it was done that long ago, it was done quietly, and it was
not enforced.  I checked my local mailing list archives for gentoo-dev and
don't see any mention of distfiles-local being deprecated back then.  Why
has it taken 8 years for this to get addressed?

In any event, I still think using devspace is a bad idea.  A centralized
distfiles repo is what most other distros use, and it's what we should use.


>> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
>> hash-based naming scheme on the new distfiles layout.  I really kind prefer
>> breaking the directories up based on the first letter of the distfiles in
>> question, factoring case-sensitivity in (so you'd have 52 top-level
>> directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
>> directories, additional subdirectories for the next few letters (say,
>> letters 2-3).  Yes, this leads to some orphan cases where a distfile might
>> live on its own, but from a direct navigation standpoint, it's easy to find
>> for someone browsing the distfiles server and easy to predict where a
>> distfile is at.
>>
>> No math, statistical analysis, or deep-rooted knowledge of filesystems
>> behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
>> need to go get a distfile off the Gentoo mirrors, and being able to quickly
>> find it in the mirror root is great.  Having to do hash calculations to work
>> out the file path will be *really* annoying.
>
> Your solution still doesn't solve the problem of having 8k-24k files
> in a single directory, even if you use 7 letters of prefix.  So it just
> creates a lot of tiny directory noise for no practical gain.

Why is having a max ~24k files in a directory a bad idea?  Modern
filesystems are more than capable of handling that.

  - ext4: unlimited files in a directory
  - xfs: virtually unlimited (hard limit of 2^64-1 total files per volume)
  - ntfs: 4,294,967,295

And 24k is a bit more than 1/3rd of all distfiles that we currently have.
Under which scenario do you wind up with 24k files in a single directory?  I
consider the tex package an outlier in this case (one package should not be
the sole dictator of policy).

--
Joshua Kinard
Gentoo/MIPS
[hidden email]
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Michał Górny-5
On Sun, 2019-10-20 at 04:25 -0400, Joshua Kinard wrote:

> On 10/20/2019 02:51, Michał Górny wrote:
> > On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote:
> > > On 10/18/2019 09:41, Michał Górny wrote:
> > > > Hi, everybody.
> > > >
> > > > It is my pleasure to announce that yesterday (EU) evening we've switched
> > > > to a new distfile mirror layout.  Users will be switching to the new
> > > > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> > > > already -- as their caches expire (24hrs).
> > > >
> > > > The new layout is mostly a bow towards mirror admins, for some of whom
> > > > having a 60000+ files in a single directory have been a problem.
> > > > However, I suppose some of you also found e.g. the directory index
> > > > hardly usable due to its size.
> > > >
> > > > Throughout a transitional period (whose exact length hasn't been decided
> > > > yet), both layouts will be available.  Afterwards, the old layout will
> > > > be removed from mirrors.  This has a few implications:
> > > >
> > > > 1. Users who don't upgrade their package managers in time will lose
> > > > the ability of fetching from Gentoo mirrors.  This shouldn't be that
> > > > much of a problem given that the core software needed to upgrade Portage
> > > > should all have reliable upstream SRC_URIs.
> > > >
> > > > 2. mirror://gentoo/file URIs will stop working.  While technically you
> > > > could use mirror://gentoo/XX/file, I'd rather recommend finally
> > > > discarding its usage and moving distfiles to devspace.
> > > >
> > > > 3. Directly fetching files from distfiles.gentoo.org will become
> > > > a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> > > > to use something like:
> > > >
> > > > $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> > > > 1b
> > > > $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> > > > ...
> > > >
> > > >
> > > > Alternatively, you can:
> > > >
> > > > $ wget http://distfiles.gentoo.org/distfiles/INDEX
> > > >
> > > > and grep for the right path there.  This INDEX is also a more
> > > > lightweight alternative to HTML indexes generated by the servers.
> > > >
> > > >
> > > > If you're interested in more background details and some plots, see [1].
> > > >
> > > > [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
> > > >
> > >
> > > So the answer I didn't really see directly stated here is, where do new
> > > distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
> > > distfile to /space/distfiles-local.  What is the new directory I need to
> > > use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
> > > target, what would be the applicable prefix to use?
> > >
> > > Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
> > > got chastised for doing exactly that.  Too much possibility of fragmentation
> > > as devs retire or package maintainership changes hands.
> >
> > Today you get chastised for using /space/distfiles-local and not
> > following policy changes.  The devmanual states that it's deprecated
> > since at least 2011, and talks of using d.g.o [1].
>
> I don't recall this change being added as far back as 2011.  Maybe my memory
> is bad, but if it was done that long ago, it was done quietly, and it was
> not enforced.  I checked my local mailing list archives for gentoo-dev and
> don't see any mention of distfiles-local being deprecated back then.  Why
> has it taken 8 years for this to get addressed?
Don't ask me.  I think I was already taught to use d.g.o back when I was
recruited.

> In any event, I still think using devspace is a bad idea.  A centralized
> distfiles repo is what most other distros use, and it's what we should use.

Talking doesn't make things happen.  Coming up with good proposals that
address all the problems (e.g. those listed in devmanual) does.

> > > I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
> > > hash-based naming scheme on the new distfiles layout.  I really kind prefer
> > > breaking the directories up based on the first letter of the distfiles in
> > > question, factoring case-sensitivity in (so you'd have 52 top-level
> > > directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
> > > directories, additional subdirectories for the next few letters (say,
> > > letters 2-3).  Yes, this leads to some orphan cases where a distfile might
> > > live on its own, but from a direct navigation standpoint, it's easy to find
> > > for someone browsing the distfiles server and easy to predict where a
> > > distfile is at.
> > >
> > > No math, statistical analysis, or deep-rooted knowledge of filesystems
> > > behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
> > > need to go get a distfile off the Gentoo mirrors, and being able to quickly
> > > find it in the mirror root is great.  Having to do hash calculations to work
> > > out the file path will be *really* annoying.
> >
> > Your solution still doesn't solve the problem of having 8k-24k files
> > in a single directory, even if you use 7 letters of prefix.  So it just
> > creates a lot of tiny directory noise for no practical gain.
>
> Why is having a max ~24k files in a directory a bad idea?  Modern
> filesystems are more than capable of handling that.
>
>   - ext4: unlimited files in a directory
>   - xfs: virtually unlimited (hard limit of 2^64-1 total files per volume)
>   - ntfs: 4,294,967,295
>
> And 24k is a bit more than 1/3rd of all distfiles that we currently have.
For the same reason having ~60k files in a directory was a problem.
There is really no point in changing anything if you change BIG_NUMBER
to SMALLER_BIG_NUMBER.

> Under which scenario do you wind up with 24k files in a single directory?  I
> consider the tex package an outlier in this case (one package should not be
> the sole dictator of policy).

Three versions of TeXLive living simultaneously.  If one package falls
completely out of bounds, no problem is solved by the change, so what's
the point of making it?

--
Best regards,
Michał Górny


signature.asc (631 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Joshua Kinard-2
On 10/20/2019 04:32, Michał Górny wrote:

> On Sun, 2019-10-20 at 04:25 -0400, Joshua Kinard wrote:
>> On 10/20/2019 02:51, Michał Górny wrote:
>>> On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote:
>>>> On 10/18/2019 09:41, Michał Górny wrote:
>>>>> Hi, everybody.
>>>>>
>>>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>>>> to a new distfile mirror layout.  Users will be switching to the new
>>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>>>> already -- as their caches expire (24hrs).
>>>>>
>>>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>>>> having a 60000+ files in a single directory have been a problem.
>>>>> However, I suppose some of you also found e.g. the directory index
>>>>> hardly usable due to its size.
>>>>>
>>>>> Throughout a transitional period (whose exact length hasn't been decided
>>>>> yet), both layouts will be available.  Afterwards, the old layout will
>>>>> be removed from mirrors.  This has a few implications:
>>>>>
>>>>> 1. Users who don't upgrade their package managers in time will lose
>>>>> the ability of fetching from Gentoo mirrors.  This shouldn't be that
>>>>> much of a problem given that the core software needed to upgrade Portage
>>>>> should all have reliable upstream SRC_URIs.
>>>>>
>>>>> 2. mirror://gentoo/file URIs will stop working.  While technically you
>>>>> could use mirror://gentoo/XX/file, I'd rather recommend finally
>>>>> discarding its usage and moving distfiles to devspace.
>>>>>
>>>>> 3. Directly fetching files from distfiles.gentoo.org will become
>>>>> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
>>>>> to use something like:
>>>>>
>>>>> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
>>>>> 1b
>>>>> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
>>>>> ...
>>>>>
>>>>>
>>>>> Alternatively, you can:
>>>>>
>>>>> $ wget http://distfiles.gentoo.org/distfiles/INDEX
>>>>>
>>>>> and grep for the right path there.  This INDEX is also a more
>>>>> lightweight alternative to HTML indexes generated by the servers.
>>>>>
>>>>>
>>>>> If you're interested in more background details and some plots, see [1].
>>>>>
>>>>> [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
>>>>>
>>>>
>>>> So the answer I didn't really see directly stated here is, where do new
>>>> distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
>>>> distfile to /space/distfiles-local.  What is the new directory I need to
>>>> use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
>>>> target, what would be the applicable prefix to use?
>>>>
>>>> Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
>>>> got chastised for doing exactly that.  Too much possibility of fragmentation
>>>> as devs retire or package maintainership changes hands.
>>>
>>> Today you get chastised for using /space/distfiles-local and not
>>> following policy changes.  The devmanual states that it's deprecated
>>> since at least 2011, and talks of using d.g.o [1].
>>
>> I don't recall this change being added as far back as 2011.  Maybe my memory
>> is bad, but if it was done that long ago, it was done quietly, and it was
>> not enforced.  I checked my local mailing list archives for gentoo-dev and
>> don't see any mention of distfiles-local being deprecated back then.  Why
>> has it taken 8 years for this to get addressed?
>
> Don't ask me.  I think I was already taught to use d.g.o back when I was
> recruited.
>
>> In any event, I still think using devspace is a bad idea.  A centralized
>> distfiles repo is what most other distros use, and it's what we should use.
>
> Talking doesn't make things happen.  Coming up with good proposals that
> address all the problems (e.g. those listed in devmanual) does.

Proposing changes when a direction has already been decided, the rudder
position changed, and engines put to full power is equally as pointless.
You're the defacto captain of this ship lately.  I expect you to not rock
the boat too hard.  This change is a pretty hard jolt, IMHO.


>>>> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
>>>> hash-based naming scheme on the new distfiles layout.  I really kind prefer
>>>> breaking the directories up based on the first letter of the distfiles in
>>>> question, factoring case-sensitivity in (so you'd have 52 top-level
>>>> directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
>>>> directories, additional subdirectories for the next few letters (say,
>>>> letters 2-3).  Yes, this leads to some orphan cases where a distfile might
>>>> live on its own, but from a direct navigation standpoint, it's easy to find
>>>> for someone browsing the distfiles server and easy to predict where a
>>>> distfile is at.
>>>>
>>>> No math, statistical analysis, or deep-rooted knowledge of filesystems
>>>> behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
>>>> need to go get a distfile off the Gentoo mirrors, and being able to quickly
>>>> find it in the mirror root is great.  Having to do hash calculations to work
>>>> out the file path will be *really* annoying.
>>>
>>> Your solution still doesn't solve the problem of having 8k-24k files
>>> in a single directory, even if you use 7 letters of prefix.  So it just
>>> creates a lot of tiny directory noise for no practical gain.
>>
>> Why is having a max ~24k files in a directory a bad idea?  Modern
>> filesystems are more than capable of handling that.
>>
>>   - ext4: unlimited files in a directory
>>   - xfs: virtually unlimited (hard limit of 2^64-1 total files per volume)
>>   - ntfs: 4,294,967,295
>>
>> And 24k is a bit more than 1/3rd of all distfiles that we currently have.
>
> For the same reason having ~60k files in a directory was a problem.
> There is really no point in changing anything if you change BIG_NUMBER
> to SMALLER_BIG_NUMBER.

That doesn't answer my question.  Why is it a problem?  What criteria are
you using to decide that 24k is a "smaller big number"?  Is there some issue
highlighted by the mirror admins where having 24k files in a single
directory offers no significant relief versus the current 60k files?


>> Under which scenario do you wind up with 24k files in a single directory?  I
>> consider the tex package an outlier in this case (one package should not be
>> the sole dictator of policy).
>
> Three versions of TeXLive living simultaneously.  If one package falls
> completely out of bounds, no problem is solved by the change, so what's
> the point of making it?

The problem in this case is with texlive, not our current, or future,
distfiles methodology.  Has anyone looked at how other distros deal with
texlive?  Has anyone complained or filed a bug to texlive developers
upstream about their excessive amount of distfiles and the burden it places
on distro maintainers?

--
Joshua Kinard
Gentoo/MIPS
[hidden email]
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Michał Górny-5
On Sun, 2019-10-20 at 05:21 -0400, Joshua Kinard wrote:

> On 10/20/2019 04:32, Michał Górny wrote:
> > On Sun, 2019-10-20 at 04:25 -0400, Joshua Kinard wrote:
> > > Why is having a max ~24k files in a directory a bad idea?  Modern
> > > filesystems are more than capable of handling that.
> > >
> > >   - ext4: unlimited files in a directory
> > >   - xfs: virtually unlimited (hard limit of 2^64-1 total files per volume)
> > >   - ntfs: 4,294,967,295
> > >
> > > And 24k is a bit more than 1/3rd of all distfiles that we currently have.
> >
> > For the same reason having ~60k files in a directory was a problem.
> > There is really no point in changing anything if you change BIG_NUMBER
> > to SMALLER_BIG_NUMBER.
>
> That doesn't answer my question.  Why is it a problem?  What criteria are
> you using to decide that 24k is a "smaller big number"?  Is there some issue
> highlighted by the mirror admins where having 24k files in a single
> directory offers no significant relief versus the current 60k files?
IIRC Robin set the goal as:

| the number of files in a single directory should not exceed 1000, [1]

I don't recall how that number was chosen but it's probably pretty
arbitrary.  In any case, I can notice the difference between working
with a listing of 1k files and 24k files, on the hardware running
masterdist.

> > > Under which scenario do you wind up with 24k files in a single directory?  I
> > > consider the tex package an outlier in this case (one package should not be
> > > the sole dictator of policy).
> >
> > Three versions of TeXLive living simultaneously.  If one package falls
> > completely out of bounds, no problem is solved by the change, so what's
> > the point of making it?
>
> The problem in this case is with texlive, not our current, or future,
> distfiles methodology.
Is it?  Are you suggesting we should ban upstream from using multiple
distfiles with similar prefix?  What about other potential packages that
may suffer from the same problem in the future?  Go packages have a good
potential, given that majority of them starts with 'github.com'.

> Has anyone looked at how other distros deal with texlive?

Other distros don't mirror original distfiles.

>   Has anyone complained or filed a bug to texlive developers
> upstream about their excessive amount of distfiles and the burden it places
> on distro maintainers?

You believe it to be a problem.  Don't expect others to bother upstream
with your preferences.


[1] https://www.gentoo.org/glep/glep-0075.html#algorithm-for-splitting-distfiles

>
--
Best regards,
Michał Górny


signature.asc (631 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Matt Turner-5
In reply to this post by Joshua Kinard-2
On Sun, Oct 20, 2019 at 1:25 AM Joshua Kinard <[hidden email]> wrote:
> In any event, I still think using devspace is a bad idea.  A centralized
> distfiles repo is what most other distros use, and it's what we should use.

I agree, but let's discuss that in a separate topic.

123