New distfile mirror layout

classic Classic list List threaded Threaded
56 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Joshua Kinard-2
On 10/20/2019 05:44, Michał Górny wrote:

> On Sun, 2019-10-20 at 05:21 -0400, Joshua Kinard wrote:
>> On 10/20/2019 04:32, Michał Górny wrote:
>>> On Sun, 2019-10-20 at 04:25 -0400, Joshua Kinard wrote:
>>>> Why is having a max ~24k files in a directory a bad idea?  Modern
>>>> filesystems are more than capable of handling that.
>>>>
>>>>   - ext4: unlimited files in a directory
>>>>   - xfs: virtually unlimited (hard limit of 2^64-1 total files per volume)
>>>>   - ntfs: 4,294,967,295
>>>>
>>>> And 24k is a bit more than 1/3rd of all distfiles that we currently have.
>>>
>>> For the same reason having ~60k files in a directory was a problem.
>>> There is really no point in changing anything if you change BIG_NUMBER
>>> to SMALLER_BIG_NUMBER.
>>
>> That doesn't answer my question.  Why is it a problem?  What criteria are
>> you using to decide that 24k is a "smaller big number"?  Is there some issue
>> highlighted by the mirror admins where having 24k files in a single
>> directory offers no significant relief versus the current 60k files?
>
> IIRC Robin set the goal as:
>
> | the number of files in a single directory should not exceed 1000, [1]
>
> I don't recall how that number was chosen but it's probably pretty
> arbitrary.  In any case, I can notice the difference between working
> with a listing of 1k files and 24k files, on the hardware running
> masterdist.

I think it would be prudent then to get some data to help underpin why that
number was chosen and add that to the GLEP, possibly as one of the
references at the bottom.  Your personal observations of a system
(masterdist) that few of us have access to is not good enough, especially
for future developers who may revisit this topic long after you or I are gone.


>
>>>> Under which scenario do you wind up with 24k files in a single directory?  I
>>>> consider the tex package an outlier in this case (one package should not be
>>>> the sole dictator of policy).
>>>
>>> Three versions of TeXLive living simultaneously.  If one package falls
>>> completely out of bounds, no problem is solved by the change, so what's
>>> the point of making it?
>>
>> The problem in this case is with texlive, not our current, or future,
>> distfiles methodology.
>
> Is it?  Are you suggesting we should ban upstream from using multiple
> distfiles with similar prefix?  What about other potential packages that
> may suffer from the same problem in the future?  Go packages have a good
> potential, given that majority of them starts with 'github.com'.

Please highlight which of my words imply in any way that I want to ban
something.  I simply said texlive's significant number of distfiles is a
problem.  That doesn't mean that I want to resolve the problem by banning
it, or future packages that employ that method.

My concern is that out of the tens of thousands of packages we have, we're
allowing ONE package to dictate how we shape a major piece of Gentoo
infrastructure, and I don't feel that the proposed solution seeks to address
it.  Rather, it seeks to band-aid it by wrapping the entire distro up like a
mummy.


>> Has anyone looked at how other distros deal with texlive?
>
> Other distros don't mirror original distfiles.

Has thought be given to doing the same?  This is arguably a better approach
than mirroring original distfiles in devspace.  This would significantly
reduce the infrastructure burden on the project.


>>   Has anyone complained or filed a bug to texlive developers
>> upstream about their excessive amount of distfiles and the burden it places
>> on distro maintainers?
>
> You believe it to be a problem.  Don't expect others to bother upstream
> with your preferences.

Hah.  So you consider texlive having 16k+ distfiles to be completely within
operating norms then?

I did a quick look, and it looks like the TeX project has a fairly
comprehensive mirroring system distributed around the world.  In fact, it
looks like they emulate Perl's CPAN system with "CTAN":

https://ctan.org/

I don't know the history of the texlive and other associated tex packages in
Gentoo, but my guess is instead of doing what our Perl packages do, someone
just decided to mirror the CTAN archive directly on the Gentoo distfiles
system.  It seems to me that what should actually happen is that we leverage
CTAN itself, much like CPAN, and use their mirroring system instead of
burdening our infrastructure as an unofficial CTAN archive.

I know we've got a ton of Perl packages for the core set of Perl modules,
but doesn't the CPAN eclass also have the capability to auto-generate an
ebuild package for virtually any Perl package distributed via CPAN?  Can
that logic be used with the CTAN system in its own eclass and then we remove
the 16k+ texlive modules off of our mirrors completely?  Or at the worst, we
might just have to generate ebuilds for texlive modules and treat them as
discrete, installed packages.

--
Joshua Kinard
Gentoo/MIPS
[hidden email]
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Joshua Kinard-2
On 10/20/2019 16:57, Joshua Kinard wrote:> On 10/20/2019 05:44, Michal Górny wrote:
>> On Sun, 2019-10-20 at 05:21 -0400, Joshua Kinard wrote:
>>> On 10/20/2019 04:32, Michal Górny wrote:
[snip]

>> You believe it to be a problem.  Don't expect others to bother upstream
>> with your preferences.
>
> Hah.  So you consider texlive having 16k+ distfiles to be completely within
> operating norms then?
>
> I did a quick look, and it looks like the TeX project has a fairly
> comprehensive mirroring system distributed around the world.  In fact, it
> looks like they emulate Perl's CPAN system with "CTAN":
>
> https://ctan.org/
>
> I don't know the history of the texlive and other associated tex packages in
> Gentoo, but my guess is instead of doing what our Perl packages do, someone
> just decided to mirror the CTAN archive directly on the Gentoo distfiles
> system.  It seems to me that what should actually happen is that we leverage
> CTAN itself, much like CPAN, and use their mirroring system instead of
> burdening our infrastructure as an unofficial CTAN archive.
>
> I know we've got a ton of Perl packages for the core set of Perl modules,
> but doesn't the CPAN eclass also have the capability to auto-generate an
> ebuild package for virtually any Perl package distributed via CPAN?  Can
> that logic be used with the CTAN system in its own eclass and then we remove
> the 16k+ texlive modules off of our mirrors completely?  Or at the worst, we
> might just have to generate ebuilds for texlive modules and treat them as
> discrete, installed packages.

So looking at texlive-latexextra-2019-r2.ebuild, it defines three variables:

  - TEXLIVE_MODULE_CONTENTS, with 1,241 space-delimited module names
  - TEXLIVE_MODULE_DOC_CONTENTS, with 1,227 space-delimited doc names
  - TEXLIVE_MODULE_SRC_CONTENTS, with 745 space-delimited src names

Then, in texlive-module.eclass, there's these loops:

for i in ${TEXLIVE_MODULE_CONTENTS}; do
        SRC_URI="${SRC_URI} mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}"
done

# Forge doc SRC_URI
[ -n "${TEXLIVE_MODULE_DOC_CONTENTS}" ] && SRC_URI="${SRC_URI} doc? ("
for i in ${TEXLIVE_MODULE_DOC_CONTENTS}; do
        SRC_URI="${SRC_URI} mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}"
done
[ -n "${TEXLIVE_MODULE_DOC_CONTENTS}" ] && SRC_URI="${SRC_URI} )"

# Forge source SRC_URI
if [ -n "${TEXLIVE_MODULE_SRC_CONTENTS}" ] ; then
        SRC_URI="${SRC_URI} source? ("
        for i in ${TEXLIVE_MODULE_SRC_CONTENTS}; do
                SRC_URI="${SRC_URI} mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}"
        done
        SRC_URI="${SRC_URI} )"
fi

I think this is definitely an issue with how this package is laying out its
needed distfiles.  It really should be leveraging CTAN system at a minimum
to fetch all of the needed distfiles so we can get them off of our
distfiles mirror.  Then it would be interesting to re-run the math on
the distfiles distribution using the different schemes highlighted in the
GLEP-75 paper.

Longer-term, I think this entire approach should be revisited by the TeX
team to make it behave more like Perl or Python packages by having discrete
ebuilds for these modules.  That's not exactly a small undertaking, but
this current approach feels very kludgy in its design and is probably
asking for trouble.  I looked at several of the modules on CTAN, and they
each have their own version and even have different licenses.

E.g.,

  - altfont is licensed under "GNU General Public License" (version ??)
  - achemso is licensed under "The LaTeX Project Public License 1.3c"
  - arraysort is licensed under "The LaTeX Project Public License 1.2"
  - amsfonts is licensed under "The SIL Open Font License"
  - a0poster is licensed under "The LaTeX Project Public License" (ver ??)
  - arydshln is licensed under "The LaTeX Project Public License 1"
  - aurl is licensed under "Public Domain Software"

That's just a random selection from the 'a' category.  Do we have copies
of those licenses in the tree?  Do they allow redistribution of the
distfiles?  For the users that want "free" software, do any of the licenses
in any of the TeX modules put up any disagreeable restrictions?

Etc...

--
Joshua Kinard
Gentoo/MIPS
[hidden email]
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Ulrich Mueller-2
>>>>> On Mon, 21 Oct 2019, Joshua Kinard wrote:

>   - altfont is licensed under "GNU General Public License" (version ??)
>   - achemso is licensed under "The LaTeX Project Public License 1.3c"
>   - arraysort is licensed under "The LaTeX Project Public License 1.2"
>   - amsfonts is licensed under "The SIL Open Font License"
>   - a0poster is licensed under "The LaTeX Project Public License" (ver ??)
>   - arydshln is licensed under "The LaTeX Project Public License 1"
>   - aurl is licensed under "Public Domain Software"

> That's just a random selection from the 'a' category. Do we have
> copies of those licenses in the tree?

Yes.

> Do they allow redistribution of the distfiles?

Yes.

> For the users that want "free" software, do any of the licenses in any
> of the TeX modules put up any disagreeable restrictions?

All of TeXLive should be free software. Upstream doesn't accept anything
that is non-free. (Mistakes can happen, though. There was one non-free
module in texlive-latexextra-2019, which was sorted out in bug 687328.)

Ulrich

signature.asc (497 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Kent Fredric-2
In reply to this post by Joshua Kinard-2
On Sun, 20 Oct 2019 16:57:54 -0400
Joshua Kinard <[hidden email]> wrote:

> I know we've got a ton of Perl packages for the core set of Perl modules,
> but doesn't the CPAN eclass also have the capability to auto-generate an
> ebuild package for virtually any Perl package distributed via CPAN?  Can
> that logic be used with the CTAN system in its own eclass and then we remove
> the 16k+ texlive modules off of our mirrors completely?  Or at the worst, we
> might just have to generate ebuilds for texlive modules and treat them as
> discrete, installed packages.

- Perl packages never have more than 1:1 source archives per ebuild
- Perl upstream naming doesn't habitually use "perl-" as an archive prefix
- Everything that is packaged for Perl in Gentoo is mirrored to the
  Gentoo distfiles mirror, and this causes no issues.

So I don't think any comparison here makes sense.

attachment0 (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Kent Fredric-2
In reply to this post by Joshua Kinard-2
On Sun, 20 Oct 2019 20:05:40 -0400
Joshua Kinard <[hidden email]> wrote:

> Longer-term, I think this entire approach should be revisited by the TeX
> team to make it behave more like Perl or Python packages by having discrete
> ebuilds for these modules.  That's not exactly a small undertaking, but
> this current approach feels very kludgy in its design and is probably
> asking for trouble.  I looked at several of the modules on CTAN, and they
> each have their own version and even have different licenses.

With the current state of the portage dependency resolver, and with
regards to the constant problems end users face with it, I really can't
advise this unless you need to.

Currently working on vendoring rust in an overlay, and 128 ebuilds just
to satisfy the dependencies enough to test *one* package is a bit of a
piss-take.

I'd suggest waiting a few years for portage to see some improvements
here before taking on something that ambitious when the current
approach works well enough.

attachment0 (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Richard Yao-2
In reply to this post by Michał Górny-5


> On Oct 20, 2019, at 2:51 AM, Michał Górny <[hidden email]> wrote:
>
> On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote:
>>> On 10/18/2019 09:41, Michał Górny wrote:
>>> Hi, everybody.
>>>
>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>> to a new distfile mirror layout.  Users will be switching to the new
>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>> already -- as their caches expire (24hrs).
>>>
>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>> having a 60000+ files in a single directory have been a problem.
>>> However, I suppose some of you also found e.g. the directory index
>>> hardly usable due to its size.
>>>
>>> Throughout a transitional period (whose exact length hasn't been decided
>>> yet), both layouts will be available.  Afterwards, the old layout will
>>> be removed from mirrors.  This has a few implications:
>>>
>>> 1. Users who don't upgrade their package managers in time will lose
>>> the ability of fetching from Gentoo mirrors.  This shouldn't be that
>>> much of a problem given that the core software needed to upgrade Portage
>>> should all have reliable upstream SRC_URIs.
>>>
>>> 2. mirror://gentoo/file URIs will stop working.  While technically you
>>> could use mirror://gentoo/XX/file, I'd rather recommend finally
>>> discarding its usage and moving distfiles to devspace.
>>>
>>> 3. Directly fetching files from distfiles.gentoo.org will become
>>> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
>>> to use something like:
>>>
>>> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
>>> 1b
>>> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
>>> ...
>>>
>>>
>>> Alternatively, you can:
>>>
>>> $ wget http://distfiles.gentoo.org/distfiles/INDEX
>>>
>>> and grep for the right path there.  This INDEX is also a more
>>> lightweight alternative to HTML indexes generated by the servers.
>>>
>>>
>>> If you're interested in more background details and some plots, see [1].
>>>
>>> [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
>>>
>>
>> So the answer I didn't really see directly stated here is, where do new
>> distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
>> distfile to /space/distfiles-local.  What is the new directory I need to
>> use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
>> target, what would be the applicable prefix to use?
>>
>> Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
>> got chastised for doing exactly that.  Too much possibility of fragmentation
>> as devs retire or package maintainership changes hands.
>
> Today you get chastised for using /space/distfiles-local and not
> following policy changes.  The devmanual states that it's deprecated
> since at least 2011, and talks of using d.g.o [1].
>
>> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
>> hash-based naming scheme on the new distfiles layout.  I really kind prefer
>> breaking the directories up based on the first letter of the distfiles in
>> question, factoring case-sensitivity in (so you'd have 52 top-level
>> directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
>> directories, additional subdirectories for the next few letters (say,
>> letters 2-3).  Yes, this leads to some orphan cases where a distfile might
>> live on its own, but from a direct navigation standpoint, it's easy to find
>> for someone browsing the distfiles server and easy to predict where a
>> distfile is at.
>>
>> No math, statistical analysis, or deep-rooted knowledge of filesystems
>> behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
>> need to go get a distfile off the Gentoo mirrors, and being able to quickly
>> find it in the mirror root is great.  Having to do hash calculations to work
>> out the file path will be *really* annoying.
>
> Your solution still doesn't solve the problem of having 8k-24k files
> in a single directory, even if you use 7 letters of prefix.  So it just
> creates a lot of tiny directory noise for no practical gain.
>
> [1] https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suitable-download-hosts

If we consider the access frequency, it might actually not be that bad. Consider a simple example with 500 files and two directory buckets. If we have 250 in each, then the size of the directory is always 250. However, if 50 files are accessed 90% of the time, then putting 450 into one directory and that 50 into another directory, we end up with the performance of the O(n) directory lookup being consistent with there being only 90 files in each directory.

I am not sure if we should be discarding all other considerations to make changes to benefit O(n) directory lookup filesystems, but if we are, then the hashing approach is not necessarily the best one. It is only the best when all files are accessed with equal frequency, which would be an incorrect assumption. A more human friendly approach might still be better. I doubt that we have the data to determine that though.

Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds.
>
> --
> Best regards,
> Michał Górny
>


Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Mikle Kolyada-2
In reply to this post by Joshua Kinard-2

On 21.10.2019 3:05, Joshua Kinard wrote:

> So looking at texlive-latexextra-2019-r2.ebuild, it defines three variables:
>
>   - TEXLIVE_MODULE_CONTENTS, with 1,241 space-delimited module names
>   - TEXLIVE_MODULE_DOC_CONTENTS, with 1,227 space-delimited doc names
>   - TEXLIVE_MODULE_SRC_CONTENTS, with 745 space-delimited src names
>
> Then, in texlive-module.eclass, there's these loops:
>
> for i in ${TEXLIVE_MODULE_CONTENTS}; do
>         SRC_URI="${SRC_URI} mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}"
> done
>
> # Forge doc SRC_URI
> [ -n "${TEXLIVE_MODULE_DOC_CONTENTS}" ] && SRC_URI="${SRC_URI} doc? ("
> for i in ${TEXLIVE_MODULE_DOC_CONTENTS}; do
>         SRC_URI="${SRC_URI} mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}"
> done
> [ -n "${TEXLIVE_MODULE_DOC_CONTENTS}" ] && SRC_URI="${SRC_URI} )"
>
> # Forge source SRC_URI
> if [ -n "${TEXLIVE_MODULE_SRC_CONTENTS}" ] ; then
>         SRC_URI="${SRC_URI} source? ("
>         for i in ${TEXLIVE_MODULE_SRC_CONTENTS}; do
>                 SRC_URI="${SRC_URI} mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}"
>         done
>         SRC_URI="${SRC_URI} )"
> fi
>
> I think this is definitely an issue with how this package is laying out its
> needed distfiles.  It really should be leveraging CTAN system at a minimum
> to fetch all of the needed distfiles so we can get them off of our
> distfiles mirror.  Then it would be interesting to re-run the math on
> the distfiles distribution using the different schemes highlighted in the
> GLEP-75 paper.
TexLive distributes collections of macros, not  packages separately,
they make their packaging based on CTAN. In the meantime CTAN packages
are not versioned, they only have internal release number, no tags,
releases and so on, see [1].

I also fail to see what problem you try to solve when suggest fetching
macros from CTAN, you are going to have the same amount of data mirrored
as a result.

[1] - https://ctan.org/tex-archive/systems/texlive/tlnet/archive


signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Matt Turner-5
In reply to this post by Richard Yao-2
On Mon, Oct 21, 2019 at 9:42 AM Richard Yao <[hidden email]> wrote:
> Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds.

It probably would have been better to make these suggestions when the
GLEP was discussed close to two years ago.

I'm glad that we have ideas for improvements but I worry that we're
just backseat driving at this point given that the GLEP's now
implemented.

Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

James Cloos-9
In reply to this post by Richard Yao-2
>>>>> "RY" == Richard Yao <[hidden email]> writes:

RY> ext4 is probably okay, but don’t quote me on that.

Ext4 works fine here for a local distfiles mirror.

-JimC
--
James Cloos <[hidden email]>         OpenPGP: 0x997A9F17ED7DAEA6

Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Jaco Kroon
In reply to this post by Richard Yao-2

Hi All,


On 2019/10/21 18:42, Richard Yao wrote:

If we consider the access frequency, it might actually not be that bad. Consider a simple example with 500 files and two directory buckets. If we have 250 in each, then the size of the directory is always 250. However, if 50 files are accessed 90% of the time, then putting 450 into one directory and that 50 into another directory, we end up with the performance of the O(n) directory lookup being consistent with there being only 90 files in each directory.

I am not sure if we should be discarding all other considerations to make changes to benefit O(n) directory lookup filesystems, but if we are, then the hashing approach is not necessarily the best one. It is only the best when all files are accessed with equal frequency, which would be an incorrect assumption. A more human friendly approach might still be better. I doubt that we have the data to determine that though.

Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds.


Experience:

ext4 sucks at targeting name lookups without dir_index feature (O(n) lookups - scans all entries in the folder).  With dir_index readdir performance is crap.  Pick your poison I guess.  Most of our larger filesystems (2TB+, but especially the 80TB+ ones) we've reverted to disabling dir_index as the benefit is outweighed by the crappy readdir() and glob() performance.

There doesn't seem to be a real specific tip-over point, and it seems to depend a lot on RAM availability and harddrive speed (obviously).  So if dentries gets cached, disk speeds becomes less of an issue.  However, on large folders (where I typically use 10k as a value for large based on "gut feeling" and "unquantifiable experience" and "nothing scientific at all") I find that even with lots of RAM two consecutive ls commands remains terribly slow.  Switch off dir_index and that becomes an order of magnitude faster.

I don't have a great deal of experience with XFS, but on those systems where we do it's generally on a VM, and perceivably (again, not scientific) our experience has been that it feels slower.  Again, not scientific, just perception.

I'm in support for the change.  This will bucket to 256 folders and should have a reasonably even split between folders.  If required a second layer could be introduced by using the 3rd and 4th digits of the hash for a second layer.  Any hash should be fine, it really doesn't need to be cryptographically strong, it just needs to provide a good spread and be really fast.  Generally a hash table should have a prime number of buckets to assist with hash bias, but frankly, that's over complicating the situation here.

I also agree with others that it used to be easy to get distfiles as and when needed, so an alternative structure could mirror that of the portage tree itself, in other words "cat/pkg/distfile".  This perhaps just shifts the issue:

jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d -name "*-*" | wc -l
167
jkroon@plastiekpoot /usr/portage $ find *-* -maxdepth 1 -type d | wc -l
19412
jkroon@plastiekpoot /usr/portage $ for i in *-*; do echo $(find $i -maxdepth 1 -type d | wc -l) $i; done | sort -g | tail -n10
347 net-misc
373 media-sound
395 media-libs
399 dev-util
505 dev-libs
528 dev-java
684 dev-haskell
690 dev-ruby
1601 dev-perl
1889 dev-python

So that's average 116 sub folders under the top layer (only two over 1000), and then presumably less than 100 distfiles maximum per package?  Probably overkill but would (should) solve both the too many files per folder as well as the easy lookup by hand issue.

I don't have a preference on either solution though but do agree that "easy finding of distfiles" are handy.  The INDEX mechanism is fine for me.

Kind Regards,

Jaco
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Ulrich Mueller-2
>>>>> On Tue, 22 Oct 2019, Jaco Kroon wrote:

> I also agree with others that it used to be easy to get distfiles as
> and when needed, so an alternative structure could mirror that of the
> portage tree itself, in other words "cat/pkg/distfile".

Not a good idea, because some distfiles are shared between packages.
For example, sys-kernel/*-sources use the same distfiles. (It won't
work with categories either, e.g., there are dev-lang/ruby and
app-emacs/ruby-mode.)

Ulrich

signature.asc (497 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Jaco Kroon
Hi,

On 2019/10/22 10:43, Ulrich Mueller wrote:

>>>>>> On Tue, 22 Oct 2019, Jaco Kroon wrote:
>> I also agree with others that it used to be easy to get distfiles as
>> and when needed, so an alternative structure could mirror that of the
>> portage tree itself, in other words "cat/pkg/distfile".
> Not a good idea, because some distfiles are shared between packages.
> For example, sys-kernel/*-sources use the same distfiles. (It won't
> work with categories either, e.g., there are dev-lang/ruby and
> app-emacs/ruby-mode.)
>
> Ulrich

You are absolutely correct.  I then fully agree with current implementation.

Kind Regards,
Jaco


Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Rich Freeman
In reply to this post by Richard Yao-2
On Mon, Oct 21, 2019 at 12:42 PM Richard Yao <[hidden email]> wrote:
>
> Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds.

I think something that is getting missed in this discussion is that we
don't control all of our mirrors, and they're generally donated
resources.  Somebody has some webserver, and they stick a Debian
mirror in one directory tree, and an Arch one in another, and they're
kind enough to give us one too.

That is why we're seeing odder situations like ntfs and so on being
mentioned.  They're not necessarily even running Linux, let alone zfs
or some other optimized filesystem.  And their webserver might be set
up to do browsable directory indexes which could perform terribly even
if the filesystem itself is fine with direct filename lookups.  It
doesn't matter if you have hashed b-trees or whatever for filename
lookups if you're going to ask the filesystem to give you a list of
every file in a large directory - it is going to have to traverse
whatever data structure it uses entirely to do so.

If we want to start putting requirements on hosting a mirror, then
we'll end up with less mirrors, and with mirrors more is usually
better.  Ideally a mirror should just be a black box to us - we don't
really care what they're running because we don't depend on any mirror
individually.  Likewise if we negatively impact mirror hosts we'll end
up with less mirrors.  Sure, maybe those hosts have odd
configurations, but we're still better off with them than without.
That said we do seem to have a lot of mirrors so it probably isn't the
end of the world if we lose a limited number.

And there is nothing to say that we can't have some infra mirror set
up more for interactive browsing that we don't have people fetch from
but which dispenses with all the hashing or which bins by the first
letter of the filename/etc.  It seems like most of the use cases where
hashing is inconvenient are for more casual use.

To avoid another reply, people are talking about having utilities that
can fetch distfiles using the new scheme.  I'd think that "ebuild
foo.ebuild fetch" is probably the simplest solution for this.  Chances
are that you're dealing with SRC_URI strings that have variable
substitution in them anyway, so just letting ebuild do the fetching
means you're not substituting ${PV} and so on, let alone all the stuff
versionator and its ilk do.  And of course you can always just fetch
from upstream anyway if you do have a clean URI.

--
Rich

Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Joshua Kinard-2
In reply to this post by Kent Fredric-2
On 10/21/2019 06:13, Kent Fredric wrote:

> On Sun, 20 Oct 2019 16:57:54 -0400
> Joshua Kinard <[hidden email]> wrote:
>
>> I know we've got a ton of Perl packages for the core set of Perl modules,
>> but doesn't the CPAN eclass also have the capability to auto-generate an
>> ebuild package for virtually any Perl package distributed via CPAN?  Can
>> that logic be used with the CTAN system in its own eclass and then we remove
>> the 16k+ texlive modules off of our mirrors completely?  Or at the worst, we
>> might just have to generate ebuilds for texlive modules and treat them as
>> discrete, installed packages.
>
> - Perl packages never have more than 1:1 source archives per ebuild
> - Perl upstream naming doesn't habitually use "perl-" as an archive prefix
> - Everything that is packaged for Perl in Gentoo is mirrored to the
>   Gentoo distfiles mirror, and this causes no issues.
>
> So I don't think any comparison here makes sense.

I have to disagree on the "doesn't make sense" bit.  Regardless of what it
is that TexLive is packaging, the problem that I feel exists is storing
these macro packages on our mirrors is what is responsible for 20% of *all*
distfiles that we store.  That's lopsided that a small collection of
ebuilds, due to the way their build logic is architected, has that many
distfiles on the mirrors.

It's scenarios like that which led to Michał developing the GLEP the way he
did.  His approach is more broad, seeking to future-proof the mirroring
issue regardless of package mirroring decisions, whereas I'm more curious
why texlive needs all of those packages on our mirrors when it appears to
have a fairly comprehensive mirroring system of its own.  Why reinvent the
wheel?

Since CTAN exists as a worldwide mirroring system, I think at a minimum, we
should try to fetch from that directly instead of mirroring them on our own
systems and partner mirrors.  Or we could go the other way and become an
official CTAN mirror ourselves.  After all, if we're going to reinvent the
wheel, do all four instead of just one.

And for Perl or Python, I think we should be making an effort to leverage
their respective mirroring systems first before putting their distfiles onto
our mirrors.  Perl's got CPAN, and Python has pypi.  For things that don't
exist on those systems, then we use our mirrors.

--
Joshua Kinard
Gentoo/MIPS
[hidden email]
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

Joshua Kinard-2
In reply to this post by Matt Turner-5
On 10/21/2019 19:36, Matt Turner wrote:
> On Mon, Oct 21, 2019 at 9:42 AM Richard Yao <[hidden email]> wrote:
>> Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds.
>
> It probably would have been better to make these suggestions when the
> GLEP was discussed close to two years ago.
>
> I'm glad that we have ideas for improvements but I worry that we're
> just backseat driving at this point given that the GLEP's now
> implemented.

Agreed, although, I don't even remember this coming up two years ago.  But,
I was tied up with a lot of work-related stress and tasks, so probably just
my memory storage backend not having enough cycles to commit it to...neurons.

IMHO, perhaps future GLEPs should have a defined window to implement them
following discussion.  Having the discussion, then waiting a few years
before implementing them leads to discussions like this where we're arguing
about the color of the boat after the boat has sailed off into the distance.

--
Joshua Kinard
Gentoo/MIPS
[hidden email]
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

William Hubbs
On Wed, Oct 23, 2019 at 01:18:02AM -0400, Joshua Kinard wrote:

> On 10/21/2019 19:36, Matt Turner wrote:
> > On Mon, Oct 21, 2019 at 9:42 AM Richard Yao <[hidden email]> wrote:
> >> Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds.
> >
> > It probably would have been better to make these suggestions when the
> > GLEP was discussed close to two years ago.
> >
> > I'm glad that we have ideas for improvements but I worry that we're
> > just backseat driving at this point given that the GLEP's now
> > implemented.
 
 Nothing is really etched in stone, so we could change it again if we
 see fit.

*snip*

> IMHO, perhaps future GLEPs should have a defined window to implement them
> following discussion.  Having the discussion, then waiting a few years
> before implementing them leads to discussions like this where we're arguing
> about the color of the boat after the boat has sailed off into the distance.

Agreed. I will work on a proposal for the next council meeting.

Thanks,

William


signature.asc (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

William Hubbs
On Wed, Oct 23, 2019 at 12:06:24PM -0500, William Hubbs wrote:

> On Wed, Oct 23, 2019 at 01:18:02AM -0400, Joshua Kinard wrote:
> > On 10/21/2019 19:36, Matt Turner wrote:
> > > On Mon, Oct 21, 2019 at 9:42 AM Richard Yao <[hidden email]> wrote:
> > >> Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds.
> > >
> > > It probably would have been better to make these suggestions when the
> > > GLEP was discussed close to two years ago.
> > >
> > > I'm glad that we have ideas for improvements but I worry that we're
> > > just backseat driving at this point given that the GLEP's now
> > > implemented.
>  
>  Nothing is really etched in stone, so we could change it again if we
>  see fit.
 
 Actually, which glep are we talking about? If we are talking about glep
 75, I don't see where the council approved it [1], so it definitely
 should be discussed/approved before any implementation changes are
 made, or we should see where it was approved.

 William

 [1] https://www.gentoo.org/glep/glep-0075.html

signature.asc (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New distfile mirror layout

William Hubbs
In reply to this post by Joshua Kinard-2
On Wed, Oct 23, 2019 at 01:18:02AM -0400, Joshua Kinard wrote:

> On 10/21/2019 19:36, Matt Turner wrote:
> > On Mon, Oct 21, 2019 at 9:42 AM Richard Yao <[hidden email]> wrote:
> >> Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds.
> >
> > It probably would have been better to make these suggestions when the
> > GLEP was discussed close to two years ago.
> >
> > I'm glad that we have ideas for improvements but I worry that we're
> > just backseat driving at this point given that the GLEP's now
> > implemented.
>
> Agreed, although, I don't even remember this coming up two years ago.  But,
> I was tied up with a lot of work-related stress and tasks, so probably just
> my memory storage backend not having enough cycles to commit it to...neurons.
 
 After looking at this further, I found that the glep was presented to
 us in Jan 2018 on the dev ml [1].

I checked all council meeting logs and discovered that this was never
brought to us formally for approval.

It looks like the developers decided to do this as an
infrastructure/portage project and because of that they felt like they
didn't need a glep.

Thanks,

William

 [1] https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba

> IMHO, perhaps future GLEPs should have a defined window to implement them
> following discussion.  Having the discussion, then waiting a few years
> before implementing them leads to discussions like this where we're arguing
> about the color of the boat after the boat has sailed off into the distance.
>
> --
> Joshua Kinard
> Gentoo/MIPS
> [hidden email]
> rsa6144/5C63F4E3F5C6C943 2015-04-27
> 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943
>
> "The past tempts us, the present confuses us, the future frightens us.  And
> our lives slip away, moment by moment, lost in that vast, terrible in-between."
>
> --Emperor Turhan, Centauri Republic
>

signature.asc (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

ext4 readdir performance - was Re: [gentoo-dev] New distfile mirror layout

Richard Yao-2
In reply to this post by Jaco Kroon
On 10/22/19 2:51 AM, Jaco Kroon wrote:

> Hi All,
>
>
> On 2019/10/21 18:42, Richard Yao wrote:
>>
>> If we consider the access frequency, it might actually not be that
>> bad. Consider a simple example with 500 files and two directory
>> buckets. If we have 250 in each, then the size of the directory is
>> always 250. However, if 50 files are accessed 90% of the time, then
>> putting 450 into one directory and that 50 into another directory, we
>> end up with the performance of the O(n) directory lookup being
>> consistent with there being only 90 files in each directory.
>>
>> I am not sure if we should be discarding all other considerations to
>> make changes to benefit O(n) directory lookup filesystems, but if we
>> are, then the hashing approach is not necessarily the best one. It is
>> only the best when all files are accessed with equal frequency, which
>> would be an incorrect assumption. A more human friendly approach might
>> still be better. I doubt that we have the data to determine that though.
>>
>> Also, another idea is to use a cheap hash function (e.g. fletcher) and
>> just have the mirrors do the hashing behind the scenes. Then we would
>> have the best of both worlds.
>
>
> Experience:
>
> ext4 sucks at targeting name lookups without dir_index feature (O(n)
> lookups - scans all entries in the folder).  With dir_index readdir
> performance is crap.  Pick your poison I guess.  Most of our larger
> filesystems (2TB+, but especially the 80TB+ ones) we've reverted to
> disabling dir_index as the benefit is outweighed by the crappy readdir()
> and glob() performance.
My read of the ext4 disk layout documentation is that the read operation
should work mostly the same way, except with a penalty from reading
larger directories caused by the addition of the tree's metadata and
from having more partially filled blocks:

https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Directory_Entries

The code itself is the same traversal code:

https://github.com/torvalds/linux/blob/v5.3/fs/ext4/dir.c#L106

However, a couple of things stand out to me here at a glance:

1. `cond_resched()` adds scheduler delay for no apparent reason.
`cond_resched()` is meant to be used in places where we could block
excessively on non-PREEMPT kernels, but this doesn't strike me as one of
those places. The fact that we block on disk on uncached reads naturally
serves the same purpose, so an explicit rescheduling point here is
redundant. PREEMPT kernels should perform better in readdir() on ext4 by
virtue of making `cond_resched()` a no-op.

2. read-ahead is implemented in a way that appears to be over-reading
the directory whenever the needed information is not cached. This is
technically read-ahead, although it is not a great way of doing it. A
much better way to do this would be to pipeline `readdir()` by
initiating asynchronous read operations in anticipation of future reads.

Both of thse should affect both variants of ext4's directories, but the
penalties I mentioned earlier mean that the dir_index variant would be
affected more.

If you have a way to benchmark things, a simple idea to evaluate would
be deleting the `cond_resched()` line. If we had data showing an
improvement, I would be happy to send a small one-line patch deleting
the line to Ted to get the change into mainline.

> There doesn't seem to be a real specific tip-over point, and it seems to
> depend a lot on RAM availability and harddrive speed (obviously).  So if
> dentries gets cached, disk speeds becomes less of an issue.  However, on
> large folders (where I typically use 10k as a value for large based on
> "gut feeling" and "unquantifiable experience" and "nothing scientific at
> all") I find that even with lots of RAM two consecutive ls commands
> remains terribly slow. Switch off dir_index and that becomes an order of
> magnitude faster.
>
> I don't have a great deal of experience with XFS, but on those systems
> where we do it's generally on a VM, and perceivably (again, not
> scientific) our experience has been that it feels slower.  Again, not
> scientific, just perception.
>
> I'm in support for the change.  This will bucket to 256 folders and
> should have a reasonably even split between folders.  If required a
> second layer could be introduced by using the 3rd and 4th digits of the
> hash for a second layer.  Any hash should be fine, it really doesn't
> need to be cryptographically strong, it just needs to provide a good
> spread and be really fast.  Generally a hash table should have a prime
> number of buckets to assist with hash bias, but frankly, that's over
> complicating the situation here.
>
> I also agree with others that it used to be easy to get distfiles as and
> when needed, so an alternative structure could mirror that of the
> portage tree itself, in other words "cat/pkg/distfile". This perhaps
> just shifts the issue:
>
> jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d -name
> "*-*" | wc -l
> 167
> jkroon@plastiekpoot /usr/portage $ find *-* -maxdepth 1 -type d | wc -l
> 19412
> jkroon@plastiekpoot /usr/portage $ for i in *-*; do echo $(find $i
> -maxdepth 1 -type d | wc -l) $i; done | sort -g | tail -n10
> 347 net-misc
> 373 media-sound
> 395 media-libs
> 399 dev-util
> 505 dev-libs
> 528 dev-java
> 684 dev-haskell
> 690 dev-ruby
> 1601 dev-perl
> 1889 dev-python
>
> So that's average 116 sub folders under the top layer (only two over
> 1000), and then presumably less than 100 distfiles maximum per package? 
> Probably overkill but would (should) solve both the too many files per
> folder as well as the easy lookup by hand issue.
>
> I don't have a preference on either solution though but do agree that
> "easy finding of distfiles" are handy.  The INDEX mechanism is fine for me.
>
> Kind Regards,
>
> Jaco
>


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ext4 readdir performance - was Re: [gentoo-dev] New distfile mirror layout

Richard Yao-2

> On Oct 23, 2019, at 7:48 PM, Richard Yao <[hidden email]> wrote:
>
> On 10/22/19 2:51 AM, Jaco Kroon wrote:
>> Hi All,
>>
>>
>>> On 2019/10/21 18:42, Richard Yao wrote:
>>>
>>> If we consider the access frequency, it might actually not be that
>>> bad. Consider a simple example with 500 files and two directory
>>> buckets. If we have 250 in each, then the size of the directory is
>>> always 250. However, if 50 files are accessed 90% of the time, then
>>> putting 450 into one directory and that 50 into another directory, we
>>> end up with the performance of the O(n) directory lookup being
>>> consistent with there being only 90 files in each directory.
>>>
>>> I am not sure if we should be discarding all other considerations to
>>> make changes to benefit O(n) directory lookup filesystems, but if we
>>> are, then the hashing approach is not necessarily the best one. It is
>>> only the best when all files are accessed with equal frequency, which
>>> would be an incorrect assumption. A more human friendly approach might
>>> still be better. I doubt that we have the data to determine that though.
>>>
>>> Also, another idea is to use a cheap hash function (e.g. fletcher) and
>>> just have the mirrors do the hashing behind the scenes. Then we would
>>> have the best of both worlds.
>>
>>
>> Experience:
>>
>> ext4 sucks at targeting name lookups without dir_index feature (O(n)
>> lookups - scans all entries in the folder).  With dir_index readdir
>> performance is crap.  Pick your poison I guess.  Most of our larger
>> filesystems (2TB+, but especially the 80TB+ ones) we've reverted to
>> disabling dir_index as the benefit is outweighed by the crappy readdir()
>> and glob() performance.
> My read of the ext4 disk layout documentation is that the read operation
> should work mostly the same way, except with a penalty from reading
> larger directories caused by the addition of the tree's metadata and
> from having more partially filled blocks:
>
> https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Directory_Entries
>
> The code itself is the same traversal code:
>
> https://github.com/torvalds/linux/blob/v5.3/fs/ext4/dir.c#L106
>
> However, a couple of things stand out to me here at a glance:
>
> 1. `cond_resched()` adds scheduler delay for no apparent reason.
> `cond_resched()` is meant to be used in places where we could block
> excessively on non-PREEMPT kernels, but this doesn't strike me as one of
> those places. The fact that we block on disk on uncached reads naturally
> serves the same purpose, so an explicit rescheduling point here is
> redundant. PREEMPT kernels should perform better in readdir() on ext4 by
> virtue of making `cond_resched()` a no-op.
I just realized that the way that I worded this could be confusing, so please allow me to clarify what I meant. cond_resched() is meant for when a kernel thread will tie up a CPU for a long period of time. Blocking on disk will cause the CPU to be released to another thread. When we do not block on disk, this operation is quick. There is no good reason for putting cond_resched() here as far as I can tell.

> 2. read-ahead is implemented in a way that appears to be over-reading
> the directory whenever the needed information is not cached. This is
> technically read-ahead, although it is not a great way of doing it. A
> much better way to do this would be to pipeline `readdir()` by
> initiating asynchronous read operations in anticipation of future reads.
>
> Both of thse should affect both variants of ext4's directories, but the
> penalties I mentioned earlier mean that the dir_index variant would be
> affected more.
>
> If you have a way to benchmark things, a simple idea to evaluate would
> be deleting the `cond_resched()` line. If we had data showing an
> improvement, I would be happy to send a small one-line patch deleting
> the line to Ted to get the change into mainline.
The more I think about this, the more absurd having cond_resched() here seems to me. I think I will sit on it for a few days. If it still seems absurd to me after sitting on it, I’ll send Ted a patch to delete that line with the remark that the use of cond_resched() is redundant with blocking on disk.

>> There doesn't seem to be a real specific tip-over point, and it seems to
>> depend a lot on RAM availability and harddrive speed (obviously).  So if
>> dentries gets cached, disk speeds becomes less of an issue.  However, on
>> large folders (where I typically use 10k as a value for large based on
>> "gut feeling" and "unquantifiable experience" and "nothing scientific at
>> all") I find that even with lots of RAM two consecutive ls commands
>> remains terribly slow. Switch off dir_index and that becomes an order of
>> magnitude faster.
>>
>> I don't have a great deal of experience with XFS, but on those systems
>> where we do it's generally on a VM, and perceivably (again, not
>> scientific) our experience has been that it feels slower.  Again, not
>> scientific, just perception.
>>
>> I'm in support for the change.  This will bucket to 256 folders and
>> should have a reasonably even split between folders.  If required a
>> second layer could be introduced by using the 3rd and 4th digits of the
>> hash for a second layer.  Any hash should be fine, it really doesn't
>> need to be cryptographically strong, it just needs to provide a good
>> spread and be really fast.  Generally a hash table should have a prime
>> number of buckets to assist with hash bias, but frankly, that's over
>> complicating the situation here.
>>
>> I also agree with others that it used to be easy to get distfiles as and
>> when needed, so an alternative structure could mirror that of the
>> portage tree itself, in other words "cat/pkg/distfile". This perhaps
>> just shifts the issue:
>>
>> jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d -name
>> "*-*" | wc -l
>> 167
>> jkroon@plastiekpoot /usr/portage $ find *-* -maxdepth 1 -type d | wc -l
>> 19412
>> jkroon@plastiekpoot /usr/portage $ for i in *-*; do echo $(find $i
>> -maxdepth 1 -type d | wc -l) $i; done | sort -g | tail -n10
>> 347 net-misc
>> 373 media-sound
>> 395 media-libs
>> 399 dev-util
>> 505 dev-libs
>> 528 dev-java
>> 684 dev-haskell
>> 690 dev-ruby
>> 1601 dev-perl
>> 1889 dev-python
>>
>> So that's average 116 sub folders under the top layer (only two over
>> 1000), and then presumably less than 100 distfiles maximum per package?
>> Probably overkill but would (should) solve both the too many files per
>> folder as well as the easy lookup by hand issue.
>>
>> I don't have a preference on either solution though but do agree that
>> "easy finding of distfiles" are handy.  The INDEX mechanism is fine for me.
>>
>> Kind Regards,
>>
>> Jaco
>>
>
>


123