[RFC] Improving Gentoo package format

classic Classic list List threaded Threaded
39 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[RFC] Improving Gentoo package format

Michał Górny-5
Hi, everyone.

The Gentoo's tbz2/xpak package format is quite old.  We've made a few
incompatible changes in the past (most notably, allowing non-bzip2
compression and multi-instance naming) but the core design stayed
the same.  I think we should consider changing it, for the reasons
outlined below.

The rough format description can be found in xpak(5).  Basically, it's
a regular compressed tarball with binary metadata blob appended
to the end.  As such, it looks like a regular compressed tarball
to the compression tools (with some ignored junk at the end).
The metadata is entirely custom format and needs dedicated tools
to manipulate.


The current format has a few advantages whose preserving would probably
be worthwhile:

+ The binary package is a single flat file.

+ It is reasonably compatible with regular compressed tarball,
so the users can unpack it using standard tools (except for metadata).

+ The metadata is uncompressed and can be quickly found without touching
the compressed data.

+ The metadata can be updated (e.g. as result of pkgmove) without
touching the compressed data.


However, it has a few disadvantages as well:

- The metadata is entirely custom binary format, requiring dedicated
tools to read or edit.

- The metadata format is relying on customary behavior of compression
tools that ignore junk following the compressed data.

- By placing the metadata at the end of file, we make it rather hard to
read the metadata from remote location (via FTP, HTTP) without fetching
the whole file.  [NB: it's technically possible but probably not worth
the effort]

- By requiring the custom format to be at the end of file, we make it
impossible to trivially cover it with a OpenPGP signature without
introducing another custom format.

- While the format might allow for some extensibility, it's rather
evolutionary dead end.


I think the key points of the new format should be:

1. It should reuse common file formats as much as possible, with
inventing as little custom code as possible.

2. It should allow for easy introspection and editing by users without
dedicated tools.

3. The metadata should allow for lookup without fetching the whole
binary package.

4. The format should allow for some extensions without having to
reinvent the wheel every time.

5. It would be nice to preserve the existing advantages.


My proposal
===========

Basic format
------------
The base of the format is a regular compressed tarball.  There's no junk
appended to it but the metadata is stored inside it as
/var/db/pkg/${PF}.  The contents are as compatible with the actual vdb
format as possible.

This has the following advantages:

+ Binary package is still stored as a single file.

+ It uses a standard compressed .tar format, with minimal customization.

+ The user can easily inspect and modify the packages with standard
tools (tar and the compressor).

+ If we can maintain reasonable level of vdb compatibility, the user can
even emergency-install a package without causing too much hassle (as it
will be recorded in vdb); ideally Portage would detect this vdb entry
and support fixing the install afterwards.


Optimizing for easy recognition
-------------------------------
In order to make it possible for magic-based tools such as file(1) to
easily distinguish Gentoo binary packages from regular tarballs, we
could (ab)use the volume label field, e.g. use:

  $ tar -V 'gpkg: app-foo/bar-1' -c ...

This will add a volume label as the first file entry inside the tarball,
which does not affect extracting but can be trivially matched via magic
rules.

Note: this is meant to be used as a method for fast binary package
recognition; I don't think we should reject (hand-modified) binary
packages that lack this label.


Optimizing for metadata reading/manipulation performance
--------------------------------------------------------
The main problem with using a single tarball for both metadata and data
is that normally you'd have to decompress everything to reliably unpack
metadata, and recompress everything to update it.  This problem can be
addressed by a few optimization tricks.

Firstly, all metadata files are packed to the archive before data files.
 With a slightly customized unpacker, we can stop decompressing as soon
as we're past metadata and avoid decompressing the whole archive.  This
will also make it possible to read metadata from remote files without
fetching far past the compressed metadata block.

Secondly, if we're up for some more tricks, we could technically split
the tarball into metadata and data blocks compressed separately.  This
will need a bit of archiver customization but it will make it possible
to decompress the metadata part without even touching compressed data,
and to replace it without recompressing data.

What's important is that both tricks proposed maintain backwards
compatibility with regular compressed tarballs.  That is, the user will
still be able to extract it with regular archiving tools.


Adding OpenPGP signatures
-------------------------
This is the main XXX here.

Technically, the most obvious solution is to cover the entire tarball
with OpenPGP signature.  However, this has the disadvantage that
the verification requires fetching the whole file.

I will look into possibility of having partial signatures.


--
Best regards,
Michał Górny

signature.asc (981 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Improving Gentoo package format

Alec Warner-2

On Sat, Nov 10, 2018 at 8:09 AM Michał Górny <[hidden email]> wrote:
Hi, everyone.

The Gentoo's tbz2/xpak package format is quite old.  We've made a few
incompatible changes in the past (most notably, allowing non-bzip2
compression and multi-instance naming) but the core design stayed
the same.  I think we should consider changing it, for the reasons
outlined below.

The rough format description can be found in xpak(5).  Basically, it's
a regular compressed tarball with binary metadata blob appended
to the end.  As such, it looks like a regular compressed tarball
to the compression tools (with some ignored junk at the end).
The metadata is entirely custom format and needs dedicated tools
to manipulate.


The current format has a few advantages whose preserving would probably
be worthwhile:

+ The binary package is a single flat file.

+ It is reasonably compatible with regular compressed tarball,
so the users can unpack it using standard tools (except for metadata).

+ The metadata is uncompressed and can be quickly found without touching
the compressed data.

+ The metadata can be updated (e.g. as result of pkgmove) without
touching the compressed data.


However, it has a few disadvantages as well:

- The metadata is entirely custom binary format, requiring dedicated
tools to read or edit.

- The metadata format is relying on customary behavior of compression
tools that ignore junk following the compressed data.

I agree this is a problem in theory, but I haven't seen it as a problem in practice. Have you observed any problems around this setup?
 

- By placing the metadata at the end of file, we make it rather hard to
read the metadata from remote location (via FTP, HTTP) without fetching
the whole file.  [NB: it's technically possible but probably not worth
the effort] 

- By requiring the custom format to be at the end of file, we make it
impossible to trivially cover it with a OpenPGP signature without
introducing another custom format.

Its trivial to cover with a detached sig, no?
 

- While the format might allow for some extensibility, it's rather
evolutionary dead end.

I'm not even sure how to quantify this, it just sounds like your subjective opinion (which is fine, but its not factual.)
 


I think the key points of the new format should be:

1. It should reuse common file formats as much as possible, with
inventing as little custom code as possible.

2. It should allow for easy introspection and editing by users without
dedicated tools.

So I'm less confident in the editing use cases; do users edit their binpkgs on a regular basis?
 

3. The metadata should allow for lookup without fetching the whole
binary package.

4. The format should allow for some extensions without having to
reinvent the wheel every time.

5. It would be nice to preserve the existing advantages.


My proposal
===========

Basic format
------------
The base of the format is a regular compressed tarball.  There's no junk
appended to it but the metadata is stored inside it as
/var/db/pkg/${PF}.  The contents are as compatible with the actual vdb
format as possible.

Just to clarify, you are suggesting we store the metadata inside the contents of the binary package itself (e.g. where the other files that get merged to the liveFS are?) What about collisions?

E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine that already has 'machine-images/gentoo-disk-image-1.2.3' installed, won't it overwrite files in the VDB at qmerge time?
 

This has the following advantages:

+ Binary package is still stored as a single file.

+ It uses a standard compressed .tar format, with minimal customization.

+ The user can easily inspect and modify the packages with standard
tools (tar and the compressor).

+ If we can maintain reasonable level of vdb compatibility, the user can
even emergency-install a package without causing too much hassle (as it
will be recorded in vdb); ideally Portage would detect this vdb entry
and support fixing the install afterwards.

I'm not certain this is really desired.
 


Optimizing for easy recognition
-------------------------------
In order to make it possible for magic-based tools such as file(1) to
easily distinguish Gentoo binary packages from regular tarballs, we
could (ab)use the volume label field, e.g. use:

  $ tar -V 'gpkg: app-foo/bar-1' -c ...

This will add a volume label as the first file entry inside the tarball,
which does not affect extracting but can be trivially matched via magic
rules.

Note: this is meant to be used as a method for fast binary package
recognition; I don't think we should reject (hand-modified) binary
packages that lack this label.


Optimizing for metadata reading/manipulation performance
--------------------------------------------------------
The main problem with using a single tarball for both metadata and data
is that normally you'd have to decompress everything to reliably unpack
metadata, and recompress everything to update it.  This problem can be
addressed by a few optimization tricks.

These performance goals seem a little bit ill defined.

1) Where are users reporting slowness in binpkg operations?
2) What is the cause of the slowness?

Like I could easily see a potential user with many large binpkgs, and the current implementation causing them issues because
they have to decompress and seek a bunch to read the metadata out of their 1.2GB binpkg. But i'm pretty sure this isn't most users.
 

Firstly, all metadata files are packed to the archive before data files.
 With a slightly customized unpacker, we can stop decompressing as soon
as we're past metadata and avoid decompressing the whole archive.  This
will also make it possible to read metadata from remote files without
fetching far past the compressed metadata block.

So this seems to basically go against your goals of simple common tooling?
 

Secondly, if we're up for some more tricks, we could technically split
the tarball into metadata and data blocks compressed separately.  This
will need a bit of archiver customization but it will make it possible
to decompress the metadata part without even touching compressed data,
and to replace it without recompressing data.

What's important is that both tricks proposed maintain backwards
compatibility with regular compressed tarballs.  That is, the user will
still be able to extract it with regular archiving tools.

So my recollection is that debian uses common format AR files for the main deb.
Then they have 2 compressed tarballs, one for metadata, and one for data.

This format seems to jive with many of your requirements:

 - 'ar' can retrieve individual files from the archive.
 - The deb file itself is not compressed, but the tarballs inside *are* compressed.
 - The metadata and data are compressed separately.
 - Anyone can edit this with normal tooling (ar, tar)

In short; why should we event a new format?
 


Adding OpenPGP signatures
-------------------------
This is the main XXX here.

Technically, the most obvious solution is to cover the entire tarball
with OpenPGP signature.  However, this has the disadvantage that
the verification requires fetching the whole file.

I will look into possibility of having partial signatures.


--
Best regards,
Michał Górny
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Improving Gentoo package format

Zac Medico-2
On 11/10/2018 06:37 AM, Alec Warner wrote:

>
> On Sat, Nov 10, 2018 at 8:09 AM Michał Górny <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi, everyone.
>
>     The Gentoo's tbz2/xpak package format is quite old.  We've made a few
>     incompatible changes in the past (most notably, allowing non-bzip2
>     compression and multi-instance naming) but the core design stayed
>     the same.  I think we should consider changing it, for the reasons
>     outlined below.
>
>     The rough format description can be found in xpak(5).  Basically, it's
>     a regular compressed tarball with binary metadata blob appended
>     to the end.  As such, it looks like a regular compressed tarball
>     to the compression tools (with some ignored junk at the end).
>     The metadata is entirely custom format and needs dedicated tools
>     to manipulate.
>
>
>     The current format has a few advantages whose preserving would probably
>     be worthwhile:
>
>     + The binary package is a single flat file.
>
>     + It is reasonably compatible with regular compressed tarball,
>     so the users can unpack it using standard tools (except for metadata).
>
>     + The metadata is uncompressed and can be quickly found without touching
>     the compressed data.
>
>     + The metadata can be updated (e.g. as result of pkgmove) without
>     touching the compressed data.
>
>
>     However, it has a few disadvantages as well:
>
>     - The metadata is entirely custom binary format, requiring dedicated
>     tools to read or edit.
>
>     - The metadata format is relying on customary behavior of compression
>     tools that ignore junk following the compressed data.
>
>
> I agree this is a problem in theory, but I haven't seen it as a problem
> in practice. Have you observed any problems around this setup?
In portage we use head -c to selected the compressed data, since zstd
doesn't handle the xpak trailer well.

>
>     - By placing the metadata at the end of file, we make it rather hard to
>     read the metadata from remote location (via FTP, HTTP) without fetching
>     the whole file.  [NB: it's technically possible but probably not worth
>     the effort] 
>
>
>     - By requiring the custom format to be at the end of file, we make it
>     impossible to trivially cover it with a OpenPGP signature without
>     introducing another custom format.
>
>
> Its trivial to cover with a detached sig, no?
>  
>
>
>     - While the format might allow for some extensibility, it's rather
>     evolutionary dead end.
>
>
> I'm not even sure how to quantify this, it just sounds like your
> subjective opinion (which is fine, but its not factual.)
Yeah the xpak trailer is flexible enough, but I'm not opposed to
supporting a different format.

>
>     I think the key points of the new format should be:
>
>     1. It should reuse common file formats as much as possible, with
>     inventing as little custom code as possible.
>
>     2. It should allow for easy introspection and editing by users without
>     dedicated tools.
>
>
> So I'm less confident in the editing use cases; do users edit their
> binpkgs on a regular basis?
Yes, gentoo/profiles/updates package renames an slot moves are a form of
this.

>
>     3. The metadata should allow for lookup without fetching the whole
>     binary package.
>
>     4. The format should allow for some extensions without having to
>     reinvent the wheel every time.
>
>     5. It would be nice to preserve the existing advantages.
>
>
>     My proposal
>     ===========
>
>     Basic format
>     ------------
>     The base of the format is a regular compressed tarball.  There's no junk
>     appended to it but the metadata is stored inside it as
>     /var/db/pkg/${PF}.  The contents are as compatible with the actual vdb
>     format as possible.
>
>
> Just to clarify, you are suggesting we store the metadata inside the
> contents of the binary package itself (e.g. where the other files that
> get merged to the liveFS are?) What about collisions?
>
> E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine
> that already has 'machine-images/gentoo-disk-image-1.2.3' installed,
> won't it overwrite files in the VDB at qmerge time?
I haven't looked into it but maybe we can use "nil control directory
names" to embed things, like http://savannah.gnu.org/projects/swbis
claims to use.

>     This has the following advantages:
>
>     + Binary package is still stored as a single file.
>
>     + It uses a standard compressed .tar format, with minimal customization.
>
>     + The user can easily inspect and modify the packages with standard
>     tools (tar and the compressor).
>
>     + If we can maintain reasonable level of vdb compatibility, the user can
>     even emergency-install a package without causing too much hassle (as it
>     will be recorded in vdb); ideally Portage would detect this vdb entry
>     and support fixing the install afterwards.
>
>
> I'm not certain this is really desired.
Yeah I don't like it either, I'd prefer to keep the metadata someplace
where it can't overwrite files in the installed package database.

>
>     Optimizing for easy recognition
>     -------------------------------
>     In order to make it possible for magic-based tools such as file(1) to
>     easily distinguish Gentoo binary packages from regular tarballs, we
>     could (ab)use the volume label field, e.g. use:
>
>       $ tar -V 'gpkg: app-foo/bar-1' -c ...
>
>     This will add a volume label as the first file entry inside the tarball,
>     which does not affect extracting but can be trivially matched via magic
>     rules.
>
>     Note: this is meant to be used as a method for fast binary package
>     recognition; I don't think we should reject (hand-modified) binary
>     packages that lack this label.
>
>
>     Optimizing for metadata reading/manipulation performance
>     --------------------------------------------------------
>     The main problem with using a single tarball for both metadata and data
>     is that normally you'd have to decompress everything to reliably unpack
>     metadata, and recompress everything to update it.  This problem can be
>     addressed by a few optimization tricks.
>
>
> These performance goals seem a little bit ill defined.
>
> 1) Where are users reporting slowness in binpkg operations?
> 2) What is the cause of the slowness?
Yeah I'd like more information here too.

> Like I could easily see a potential user with many large binpkgs, and
> the current implementation causing them issues because
> they have to decompress and seek a bunch to read the metadata out of
> their 1.2GB binpkg. But i'm pretty sure this isn't most users.
>  
>
>
>     Firstly, all metadata files are packed to the archive before data files.
>      With a slightly customized unpacker, we can stop decompressing as soon
>     as we're past metadata and avoid decompressing the whole archive.  This
>     will also make it possible to read metadata from remote files without
>     fetching far past the compressed metadata block.
>
>
> So this seems to basically go against your goals of simple common tooling?
>  
>
>
>     Secondly, if we're up for some more tricks, we could technically split
>     the tarball into metadata and data blocks compressed separately.  This
>     will need a bit of archiver customization but it will make it possible
>     to decompress the metadata part without even touching compressed data,
>     and to replace it without recompressing data.
>
>     What's important is that both tricks proposed maintain backwards
>     compatibility with regular compressed tarballs.  That is, the user will
>     still be able to extract it with regular archiving tools.
>
>
> So my recollection is that debian uses common format AR files for the
> main deb.
> Then they have 2 compressed tarballs, one for metadata, and one for data.
>
> This format seems to jive with many of your requirements:
>
>  - 'ar' can retrieve individual files from the archive.
>  - The deb file itself is not compressed, but the tarballs inside *are*
> compressed.
>  - The metadata and data are compressed separately.
>  - Anyone can edit this with normal tooling (ar, tar)
>
> In short; why should we event a new format?
Maybe we can borrow some ideas from
http://savannah.gnu.org/projects/swbis which claims to be capable of
creating and verifying a tarball with GPG signatures embedded in the
tarball.

>
>     Adding OpenPGP signatures
>     -------------------------
>     This is the main XXX here.
>
>     Technically, the most obvious solution is to cover the entire tarball
>     with OpenPGP signature.  However, this has the disadvantage that
>     the verification requires fetching the whole file.
>
>     I will look into possibility of having partial signatures.
>
>
>     --
>     Best regards,
>     Michał Górny
>

--
Thanks,
Zac


signature.asc (1000 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Improving Gentoo package format

Michał Górny-5
In reply to this post by Alec Warner-2
On Sat, 2018-11-10 at 09:37 -0500, Alec Warner wrote:

> On Sat, Nov 10, 2018 at 8:09 AM Michał Górny <[hidden email]> wrote:
>
> > Hi, everyone.
> >
> > The Gentoo's tbz2/xpak package format is quite old.  We've made a few
> > incompatible changes in the past (most notably, allowing non-bzip2
> > compression and multi-instance naming) but the core design stayed
> > the same.  I think we should consider changing it, for the reasons
> > outlined below.
> >
> > The rough format description can be found in xpak(5).  Basically, it's
> > a regular compressed tarball with binary metadata blob appended
> > to the end.  As such, it looks like a regular compressed tarball
> > to the compression tools (with some ignored junk at the end).
> > The metadata is entirely custom format and needs dedicated tools
> > to manipulate.
> >
> >
> > The current format has a few advantages whose preserving would probably
> > be worthwhile:
> >
> > + The binary package is a single flat file.
> >
> > + It is reasonably compatible with regular compressed tarball,
> > so the users can unpack it using standard tools (except for metadata).
> >
> > + The metadata is uncompressed and can be quickly found without touching
> > the compressed data.
> >
> > + The metadata can be updated (e.g. as result of pkgmove) without
> > touching the compressed data.
> >
> >
> > However, it has a few disadvantages as well:
> >
> > - The metadata is entirely custom binary format, requiring dedicated
> > tools to read or edit.
> >
> > - The metadata format is relying on customary behavior of compression
> > tools that ignore junk following the compressed data.
> >
>
> I agree this is a problem in theory, but I haven't seen it as a problem in
> practice. Have you observed any problems around this setup?
Historically one of the parallel compressor variants did not support
this.

> >
> > - By placing the metadata at the end of file, we make it rather hard to
> > read the metadata from remote location (via FTP, HTTP) without fetching
> > the whole file.  [NB: it's technically possible but probably not worth
> > the effort]
>
>
> > - By requiring the custom format to be at the end of file, we make it
> > impossible to trivially cover it with a OpenPGP signature without
> > introducing another custom format.
> >
>
> Its trivial to cover with a detached sig, no?
>
>
> >
> > - While the format might allow for some extensibility, it's rather
> > evolutionary dead end.
> >
>
> I'm not even sure how to quantify this, it just sounds like your subjective
> opinion (which is fine, but its not factual.)
>
>
> >
> >
> > I think the key points of the new format should be:
> >
> > 1. It should reuse common file formats as much as possible, with
> > inventing as little custom code as possible.
> >
> > 2. It should allow for easy introspection and editing by users without
> > dedicated tools.
> >
>
> So I'm less confident in the editing use cases; do users edit their binpkgs
> on a regular basis?
It's useful for debugging stuff.  I had to use hexedit on xpak
in the past.  Believe me, it's nowhere close to pleasant.

> > 3. The metadata should allow for lookup without fetching the whole
> > binary package.
> >
> > 4. The format should allow for some extensions without having to
> > reinvent the wheel every time.
> >
> > 5. It would be nice to preserve the existing advantages.
> >
> >
> > My proposal
> > ===========
> >
> > Basic format
> > ------------
> > The base of the format is a regular compressed tarball.  There's no junk
> > appended to it but the metadata is stored inside it as
> > /var/db/pkg/${PF}.  The contents are as compatible with the actual vdb
> > format as possible.
> >
>
> Just to clarify, you are suggesting we store the metadata inside the
> contents of the binary package itself (e.g. where the other files that get
> merged to the liveFS are?) What about collisions?
>
> E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine that
> already has 'machine-images/gentoo-disk-image-1.2.3' installed, won't it
> overwrite files in the VDB at qmerge time?
Portage will obviously move the files out, and process them as metadata.
 The idea is to precisely use a directory that can't be normally part
of binary packages, so can't cause collisions with real files (even if
they're very unlikely to ever happen).

> > This has the following advantages:
> >
> > + Binary package is still stored as a single file.
> >
> > + It uses a standard compressed .tar format, with minimal customization.
> >
> > + The user can easily inspect and modify the packages with standard
> > tools (tar and the compressor).
> >
> > + If we can maintain reasonable level of vdb compatibility, the user can
> > even emergency-install a package without causing too much hassle (as it
> > will be recorded in vdb); ideally Portage would detect this vdb entry
> > and support fixing the install afterwards.
> >
>
> I'm not certain this is really desired.
Are you saying it's better that user emergency-installs a package
without recording it in vdb, and ends up with mess of collisions
and untracked files?

Just because you don't like some use case doesn't mean it's not gonna
happen.  Either you prepare for it and make the best of it, or you
pretend it's not gonna happen and cause extra pain to users.

> > Optimizing for easy recognition
> > -------------------------------
> > In order to make it possible for magic-based tools such as file(1) to
> > easily distinguish Gentoo binary packages from regular tarballs, we
> > could (ab)use the volume label field, e.g. use:
> >
> >   $ tar -V 'gpkg: app-foo/bar-1' -c ...
> >
> > This will add a volume label as the first file entry inside the tarball,
> > which does not affect extracting but can be trivially matched via magic
> > rules.
> >
> > Note: this is meant to be used as a method for fast binary package
> > recognition; I don't think we should reject (hand-modified) binary
> > packages that lack this label.
> >
> >
> > Optimizing for metadata reading/manipulation performance
> > --------------------------------------------------------
> > The main problem with using a single tarball for both metadata and data
> > is that normally you'd have to decompress everything to reliably unpack
> > metadata, and recompress everything to update it.  This problem can be
> > addressed by a few optimization tricks.
> >
>
> These performance goals seem a little bit ill defined.
>
> 1) Where are users reporting slowness in binpkg operations?
> 2) What is the cause of the slowness?
Those are optimizations to not cause slowness compared to the current
format.  Main use case is recreating package index which would require
rereading the metadata of all binary packages.

> Like I could easily see a potential user with many large binpkgs, and the
> current implementation causing them issues because
> they have to decompress and seek a bunch to read the metadata out of their
> 1.2GB binpkg. But i'm pretty sure this isn't most users.
>
>
> >
> > Firstly, all metadata files are packed to the archive before data files.
> >  With a slightly customized unpacker, we can stop decompressing as soon
> > as we're past metadata and avoid decompressing the whole archive.  This
> > will also make it possible to read metadata from remote files without
> > fetching far past the compressed metadata block.
> >
>
> So this seems to basically go against your goals of simple common tooling?
No.  My goal is to make it compatible with simple common tooling.  You
can still use the simple tooling to read/write them.  The optimized
tools are only needed to efficiently handle special use cases.

> > Secondly, if we're up for some more tricks, we could technically split
> > the tarball into metadata and data blocks compressed separately.  This
> > will need a bit of archiver customization but it will make it possible
> > to decompress the metadata part without even touching compressed data,
> > and to replace it without recompressing data.
> >
> > What's important is that both tricks proposed maintain backwards
> > compatibility with regular compressed tarballs.  That is, the user will
> > still be able to extract it with regular archiving tools.
>
>
> So my recollection is that debian uses common format AR files for the main
> deb.
> Then they have 2 compressed tarballs, one for metadata, and one for data.
>
> This format seems to jive with many of your requirements:
>
>  - 'ar' can retrieve individual files from the archive.
>  - The deb file itself is not compressed, but the tarballs inside *are*
> compressed.
>  - The metadata and data are compressed separately.
>  - Anyone can edit this with normal tooling (ar, tar)
>
> In short; why should we event a new format?
Because nobody knows how to use 'ar', compared to how almost every
Gentoo user can use 'tar' immediately?  Of course we could alternatively
just use a nested tarball but I wanted to keep the possibility
of actually being able to 'tar -xf' it without having to extract nested
archives.

--
Best regards,
Michał Górny

signature.asc (981 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Improving Gentoo package format

Rich Freeman
On Sun, Nov 11, 2018 at 3:29 AM Michał Górny <[hidden email]> wrote:

>
> On Sat, 2018-11-10 at 09:37 -0500, Alec Warner wrote:
> > On Sat, Nov 10, 2018 at 8:09 AM Michał Górny <[hidden email]> wrote:
> >
> > >
> > > + If we can maintain reasonable level of vdb compatibility, the user can
> > > even emergency-install a package without causing too much hassle (as it
> > > will be recorded in vdb); ideally Portage would detect this vdb entry
> > > and support fixing the install afterwards.
> > >

IMO just overwriting vdb doesn't seem like a great idea.  If somebody
does do a plain untar then the package manager has no ability to
sanitize anything going in, which means dealing with a potential mess
after-the-fact.  Users wouldn't think to go messing with /var/db/pkg
on their own, but they certainly will be tempted to just untar a file.

Perhaps a package with the same name/version was already installed,
but the new files aren't the same as the old files.  Now we have
orphans because the package manager never had a chance to clean up and
lost all its state regarding what was already there.

Plus, this is basically in-band signaling and that is the sort of
thing that usually ends up being regretted sooner or later.

I'm not sure if vdb is entirely optimal, but if we wanted to stick
with that metadata format, why not just stick it in a separate
tarball?

>
> Are you saying it's better that user emergency-installs a package
> without recording it in vdb, and ends up with mess of collisions
> and untracked files?
>

IMO this is no different than a user unpacking any other random
tarball that goes and creates orphans.  I think the better solution is
some kind of tool to cleanly install a tarball, assuming it doesn't
already exist.  Short of turning /usr into a squashfs or whatever I'm
not sure any distro has a great solution for keeping users from
bypassing the package manager entirely.

> > In short; why should we event a new format?
>
> Because nobody knows how to use 'ar', compared to how almost every
> Gentoo user can use 'tar' immediately?  Of course we could alternatively
> just use a nested tarball but I wanted to keep the possibility
> of actually being able to 'tar -xf' it without having to extract nested
> archives.
>

IMO a nested tarball would be a better solution.  I agree that ar is
obscure, and I don't see how it adds any value.  If we were talking
about something going into a bootloader then optimizing for unpacking
efficiency might be a concern, but there is no reason not to use more
standard tools, unless there was something about the .deb format
itself we wanted to completely preserve (seems unlikely).

Overall though, I definitely think that a better binary format makes a
lot of sense, and I think you're on the right track.

One thing you didn't touch on is file naming.  Right now two binary
packages with different USE/etc configurations are going to collide.
Would it make sense to toss in some kind of content hash of certain
metadata in the filename or something so that it would be much simpler
to host and auto-fetch binary packages?  I realize this is going
beyond your initial scope, but if we wanted to do this it would be a
good time to do so.

--
Rich

Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Improving Gentoo package format

Alec Warner-2
In reply to this post by Michał Górny-5


On Sun, Nov 11, 2018 at 3:29 AM Michał Górny <[hidden email]> wrote:
On Sat, 2018-11-10 at 09:37 -0500, Alec Warner wrote:
> On Sat, Nov 10, 2018 at 8:09 AM Michał Górny <[hidden email]> wrote:
>
> > Hi, everyone.
> >
> > The Gentoo's tbz2/xpak package format is quite old.  We've made a few
> > incompatible changes in the past (most notably, allowing non-bzip2
> > compression and multi-instance naming) but the core design stayed
> > the same.  I think we should consider changing it, for the reasons
> > outlined below.
> >
> > The rough format description can be found in xpak(5).  Basically, it's
> > a regular compressed tarball with binary metadata blob appended
> > to the end.  As such, it looks like a regular compressed tarball
> > to the compression tools (with some ignored junk at the end).
> > The metadata is entirely custom format and needs dedicated tools
> > to manipulate.
> >
> >
> > The current format has a few advantages whose preserving would probably
> > be worthwhile:
> >
> > + The binary package is a single flat file.
> >
> > + It is reasonably compatible with regular compressed tarball,
> > so the users can unpack it using standard tools (except for metadata).
> >
> > + The metadata is uncompressed and can be quickly found without touching
> > the compressed data.
> >
> > + The metadata can be updated (e.g. as result of pkgmove) without
> > touching the compressed data.
> >
> >
> > However, it has a few disadvantages as well:
> >
> > - The metadata is entirely custom binary format, requiring dedicated
> > tools to read or edit.
> >
> > - The metadata format is relying on customary behavior of compression
> > tools that ignore junk following the compressed data.
> >
>
> I agree this is a problem in theory, but I haven't seen it as a problem in
> practice. Have you observed any problems around this setup?

Historically one of the parallel compressor variants did not support
this.

> >
> > - By placing the metadata at the end of file, we make it rather hard to
> > read the metadata from remote location (via FTP, HTTP) without fetching
> > the whole file.  [NB: it's technically possible but probably not worth
> > the effort]
>
>
> > - By requiring the custom format to be at the end of file, we make it
> > impossible to trivially cover it with a OpenPGP signature without
> > introducing another custom format.
> >
>
> Its trivial to cover with a detached sig, no?
>
>
> >
> > - While the format might allow for some extensibility, it's rather
> > evolutionary dead end.
> >
>
> I'm not even sure how to quantify this, it just sounds like your subjective
> opinion (which is fine, but its not factual.)
>
>
> >
> >
> > I think the key points of the new format should be:
> >
> > 1. It should reuse common file formats as much as possible, with
> > inventing as little custom code as possible.
> >
> > 2. It should allow for easy introspection and editing by users without
> > dedicated tools.
> >
>
> So I'm less confident in the editing use cases; do users edit their binpkgs
> on a regular basis?

It's useful for debugging stuff.  I had to use hexedit on xpak
in the past.  Believe me, it's nowhere close to pleasant.

> > 3. The metadata should allow for lookup without fetching the whole
> > binary package.
> >
> > 4. The format should allow for some extensions without having to
> > reinvent the wheel every time.
> >
> > 5. It would be nice to preserve the existing advantages.
> >
> >
> > My proposal
> > ===========
> >
> > Basic format
> > ------------
> > The base of the format is a regular compressed tarball.  There's no junk
> > appended to it but the metadata is stored inside it as
> > /var/db/pkg/${PF}.  The contents are as compatible with the actual vdb
> > format as possible.
> >
>
> Just to clarify, you are suggesting we store the metadata inside the
> contents of the binary package itself (e.g. where the other files that get
> merged to the liveFS are?) What about collisions?
>
> E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine that
> already has 'machine-images/gentoo-disk-image-1.2.3' installed, won't it
> overwrite files in the VDB at qmerge time?

Portage will obviously move the files out, and process them as metadata.
 The idea is to precisely use a directory that can't be normally part
of binary packages, so can't cause collisions with real files (even if
they're very unlikely to ever happen).

> > This has the following advantages:
> >
> > + Binary package is still stored as a single file.
> >
> > + It uses a standard compressed .tar format, with minimal customization.
> >
> > + The user can easily inspect and modify the packages with standard
> > tools (tar and the compressor).
> >
> > + If we can maintain reasonable level of vdb compatibility, the user can
> > even emergency-install a package without causing too much hassle (as it
> > will be recorded in vdb); ideally Portage would detect this vdb entry
> > and support fixing the install afterwards.
> >
>
> I'm not certain this is really desired.

Are you saying it's better that user emergency-installs a package
without recording it in vdb, and ends up with mess of collisions
and untracked files?

Just because you don't like some use case doesn't mean it's not gonna
happen.  Either you prepare for it and make the best of it, or you
pretend it's not gonna happen and cause extra pain to users.

I would argue that I would split the requirements into 3 bands.

1) Must do
2) Should do
3) Nice to have

To me, manually unpacking a tarball and having it recorded in the VDB is a 'nice to have' feature. If we can make it work, great.
I tend to lean with Rich here that recording the data in-band is risky. I think there is also this premise; that binpkgs can 'maintain VDB compatibility'. Like I could make a binpkg, wait 2 years, then install it; and we have to make sure that everything still works.

IMHO its a pretty high cost to pay (tight coupling) for what, to me, is a nice to have feature.
 

> > Optimizing for easy recognition
> > -------------------------------
> > In order to make it possible for magic-based tools such as file(1) to
> > easily distinguish Gentoo binary packages from regular tarballs, we
> > could (ab)use the volume label field, e.g. use:
> >
> >   $ tar -V 'gpkg: app-foo/bar-1' -c ...
> >
> > This will add a volume label as the first file entry inside the tarball,
> > which does not affect extracting but can be trivially matched via magic
> > rules.
> >
> > Note: this is meant to be used as a method for fast binary package
> > recognition; I don't think we should reject (hand-modified) binary
> > packages that lack this label.
> >
> >
> > Optimizing for metadata reading/manipulation performance
> > --------------------------------------------------------
> > The main problem with using a single tarball for both metadata and data
> > is that normally you'd have to decompress everything to reliably unpack
> > metadata, and recompress everything to update it.  This problem can be
> > addressed by a few optimization tricks.
> >
>
> These performance goals seem a little bit ill defined.
>
> 1) Where are users reporting slowness in binpkg operations?
> 2) What is the cause of the slowness?

Those are optimizations to not cause slowness compared to the current
format.  Main use case is recreating package index which would require
rereading the metadata of all binary packages.

> Like I could easily see a potential user with many large binpkgs, and the
> current implementation causing them issues because
> they have to decompress and seek a bunch to read the metadata out of their
> 1.2GB binpkg. But i'm pretty sure this isn't most users.
>
>
> >
> > Firstly, all metadata files are packed to the archive before data files.
> >  With a slightly customized unpacker, we can stop decompressing as soon
> > as we're past metadata and avoid decompressing the whole archive.  This
> > will also make it possible to read metadata from remote files without
> > fetching far past the compressed metadata block.
> >
>
> So this seems to basically go against your goals of simple common tooling?

No.  My goal is to make it compatible with simple common tooling.  You
can still use the simple tooling to read/write them.  The optimized
tools are only needed to efficiently handle special use cases.

> > Secondly, if we're up for some more tricks, we could technically split
> > the tarball into metadata and data blocks compressed separately.  This
> > will need a bit of archiver customization but it will make it possible
> > to decompress the metadata part without even touching compressed data,
> > and to replace it without recompressing data.
> >
> > What's important is that both tricks proposed maintain backwards
> > compatibility with regular compressed tarballs.  That is, the user will
> > still be able to extract it with regular archiving tools.
>
>
> So my recollection is that debian uses common format AR files for the main
> deb.
> Then they have 2 compressed tarballs, one for metadata, and one for data.
>
> This format seems to jive with many of your requirements:
>
>  - 'ar' can retrieve individual files from the archive.
>  - The deb file itself is not compressed, but the tarballs inside *are*
> compressed.
>  - The metadata and data are compressed separately.
>  - Anyone can edit this with normal tooling (ar, tar)
>
> In short; why should we event a new format?

Because nobody knows how to use 'ar', compared to how almost every
Gentoo user can use 'tar' immediately?  Of course we could alternatively
just use a nested tarball but I wanted to keep the possibility
of actually being able to 'tar -xf' it without having to extract nested
archives.

I think man 'ar' could help them pretty easily. That being said I'm not wed to 'ar', but trying to show how this problem was solved in a similar domain.
 

--
Best regards,
Michał Górny
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Improving Gentoo package format

Francesco Riosa-3
In reply to this post by Michał Górny-5


Il giorno dom 11 nov 2018 alle ore 09:29 Michał Górny <[hidden email]> ha scritto:
On Sat, 2018-11-10 at 09:37 -0500, Alec Warner wrote:
> On Sat, Nov 10, 2018 at 8:09 AM Michał Górny <[hidden email]> wrote:
[...]
> > My proposal
> > ===========
> >
> > Basic format
> > ------------
> > The base of the format is a regular compressed tarball.  There's no junk
> > appended to it but the metadata is stored inside it as
> > /var/db/pkg/${PF}.  The contents are as compatible with the actual vdb
> > format as possible.
> >
>
> Just to clarify, you are suggesting we store the metadata inside the
> contents of the binary package itself (e.g. where the other files that get
> merged to the liveFS are?) What about collisions?
>
> E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine that
> already has 'machine-images/gentoo-disk-image-1.2.3' installed, won't it
> overwrite files in the VDB at qmerge time?

Portage will obviously move the files out, and process them as metadata.
 The idea is to precisely use a directory that can't be normally part
of binary packages, so can't cause collisions with real files (even if
they're very unlikely to ever happen).

> > This has the following advantages:
> >
> > + Binary package is still stored as a single file.
> >
> > + It uses a standard compressed .tar format, with minimal customization.
> >
> > + The user can easily inspect and modify the packages with standard
> > tools (tar and the compressor).
> >
> > + If we can maintain reasonable level of vdb compatibility, the user can
> > even emergency-install a package without causing too much hassle (as it
> > will be recorded in vdb); ideally Portage would detect this vdb entry
> > and support fixing the install afterwards.
> >
>
> I'm not certain this is really desired.

Are you saying it's better that user emergency-installs a package
without recording it in vdb, and ends up with mess of collisions
and untracked files?

Just because you don't like some use case doesn't mean it's not gonna
happen.  Either you prepare for it and make the best of it, or you
pretend it's not gonna happen and cause extra pain to users.


Another option would be to install in a near but not overlapping directory, example:
/var/db/pkg/${PF}-binpkg

this way the user that know what to do with that data can play with it, also portage could be instructed to stat() that directory and take action (halt maybe?) if present.
 
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Improving Gentoo package format

Duncan-42
Francesco Riosa posted on Sun, 11 Nov 2018 17:05:37 +0100 as excerpted:

> Il giorno dom 11 nov 2018 alle ore 09:29 Michał Górny
> <[hidden email]>
> ha scritto:
>
>> On Sat, 2018-11-10 at 09:37 -0500, Alec Warner wrote:
>>> On Sat, Nov 10, 2018 at 8:09 AM Michał Górny <[hidden email]>
>>> wrote:
>> [...]
>>>> My proposal ===========
>>>>
>>>> Basic format ------------
>>>> The base of the format is a regular compressed tarball.
>>>> There's no junk appended to it but the metadata is stored
>>>> inside it as /var/db/pkg/${PF}.  The contents are as compatible
>>>> with the actual vdb format as possible.
>>>>
>>>>
>>> Just to clarify, you are suggesting we store the metadata inside
>>> the contents of the binary package itself (e.g. where the other
>>> files that get merged to the liveFS are?) What about collisions?
>>>
>>> E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine
>>> that already has 'machine-images/gentoo-disk-image-1.2.3' installed,
>>> won't it overwrite files in the VDB at qmerge time?
>>
>> Portage will obviously move the files out, and process them as
>> metadata.

>> The idea is to precisely use a directory that can't be normally part
>> of binary packages, so can't cause collisions with real files (even if
>> they're very unlikely to ever happen).
>>
>>>> This has the following advantages:
>>>>
>>>> + Binary package is still stored as a single file.

Breaking these down into RFC style MUST/SHOULD/MAY levels (as already
suggested elsewhere), for me, this is...

SHOULD/MAY

(Would be a MAY, nice to have, but the existing solution has it, thus
arguably raising the priority to SHOULD.)

>>>> + It uses a standard compressed .tar format, with minimal
>>>> customization.

MUST

(Losing the existing functionality here would be horrible.  FWIW I
routinely use binpkgs as a reference, for "clean" config files, comparing
install trees of old and new versions, etc.  Having tools that allow
browsing standard compressed tar archives as virtual extensions to the
filesystem makes that a breeze. =:^)

>>>> + The user can easily inspect and modify the packages with standard
>>>> tools (tar and the compressor).

MUST

(As pointed out, portage itself already does this when doing binpkg
moves, etc.  Losing that would be horrible!)

>>>> + If we can maintain reasonable level of vdb compatibility, the
>>>> user can even emergency-install a package without causing too much
>>>> hassle (as it will be recorded in vdb); ideally Portage would
>>>> detect this vdb entry and support fixing the install afterwards.
>>>>
>>>>
>>> I'm not certain this is really desired.

SHOULD/MAY

(I'd say SHOULD, but while possible to emergency-install via untarring
now, portage doesn't do anything with it at all, so the detect and fix
functionality is a bonus, thus arguably lowering it to a MAY.)

>> Are you saying it's better that user emergency-installs a package
>> without recording it in vdb, and ends up with mess of collisions and
>> untracked files?
>>
>> Just because you don't like some use case doesn't mean it's not gonna
>> happen.  Either you prepare for it and make the best of it, or you
>> pretend it's not gonna happen and cause extra pain to users.

I think I've had to do this twice in ~1.5 decades, plus once reaching
into the tarball to extract a single file that was broken in a newly
installed glibc, breaking portage (and much of the system, but bunzip
still worked!) so I couldn't undo it using portage.

The first time I didn't know enough to clean up manually, but the second
time (and the reach-in time) I did.  It'd *definitely* be nice to have
portage be able to clean up automatically.

> Another option would be to install in a near but not overlapping
> directory,
> example:
> /var/db/pkg/${PF}-binpkg
>
> this way the user that know what to do with that data can play with it,
> also portage could be instructed to stat() that directory and take
> action (halt maybe?) if present.

Idea ++

Detect and fix has already been proposed, but detect and halt with an
error and a pointer to manual fix instructions is arguably already better
than current.

Which suggests an easy implementation split, delaying the "fix" step
until later, if it would complicate the initial implementation too much.

[Bikeshed]  I was thinking binpkg-${PF} to emphasize the binpkg part and
group any emergency-installed packages together in an alphabetical
listing.  But whichever's easiest for portage to work with, which
probably makes the -binpkg suffix version a better choice, requiring less
modification to existing code.


Is there any interest at all in binpkgs, perhaps when improved, from the
other PMs?  Or are they effectively dead now or not interested in binpkgs
even if the format were to be improved, or simply too hard to work with?  
Because "it'd be nice" (aka MAY level) to have this formally standardized
to PMS... if there's any interest from the other PMs.

--
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Improving Gentoo package format

M. J. Everitt
On 11/11/18 18:15, Duncan wrote [excerpted]:
> Is there any interest at all in binpkgs, perhaps when improved, from the
> other PMs?  Or are they effectively dead now or not interested in binpkgs
> even if the format were to be improved, or simply too hard to work with?  
> Because "it'd be nice" (aka MAY level) to have this formally standardized
> to PMS... if there's any interest from the other PMs.
>
Binpkgs are an important part of catalyst/releng stage-building runs, as it
allows portage to 'cache' a lot of the packages needed/used.

Binpkgs are also a popular component of a few downstream distro's based on
Gentoo (thinking pentoo right now as an easy example).

So we don't want to break existing users of this format without considering
the ramifications for these scenarios, as you'll have some very grumpy devs...


signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Improving Gentoo package format

Rich Freeman
On Sun, Nov 11, 2018 at 12:31 PM M. J. Everitt <[hidden email]> wrote:
>
> Binpkgs are also a popular component of a few downstream distro's based on
> Gentoo (thinking pentoo right now as an easy example).
>
> So we don't want to break existing users of this format without considering
> the ramifications for these scenarios, as you'll have some very grumpy devs...
>

I'd argue that they'd be more important for Gentoo if they were more
useful.  IMO the main limitation with them is the inability to
auto-download them from a repository, detecting the binpkg USE flags
BEFORE downloading.  This is why I suggested hashing the USE flags or
similar and sticking that in the filename.

Obviously you can't host a repository with all the USE combinations.
However, you could have a reference repo and the package manager could
check it before doing a build.  If you get a hit then you can install
the binpkg.  If you don't then you can do a source build.

Portage already checks the USE flags inside the binpkg before merging
it and by default doesn't use a non-matching binpkg.  The problem with
the current approach is:
1.  You have to download the package to check this (could be a big file).
2.  You can't host multiple versions of a binpkg with different USE
flags since the filenames collide.

I suggested a content hash because you can use it for an arbitrary
amount of metadata, vs having to cram arch/USE/multilib and I'm sure
something I'm missing into a filename.  Make the hash as short as is
economical - it isn't like we have THAT many permutations, the PM can
still check the internal metadata, and this isn't a security feature.

--
Rich

Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Improving Gentoo package format

M. J. Everitt
On 11/11/18 18:41, Rich Freeman wrote:

> On Sun, Nov 11, 2018 at 12:31 PM M. J. Everitt <[hidden email]> wrote:
>> Binpkgs are also a popular component of a few downstream distro's based on
>> Gentoo (thinking pentoo right now as an easy example).
>>
>> So we don't want to break existing users of this format without considering
>> the ramifications for these scenarios, as you'll have some very grumpy devs...
>>
> I'd argue that they'd be more important for Gentoo if they were more
> useful.  IMO the main limitation with them is the inability to
> auto-download them from a repository, detecting the binpkg USE flags
> BEFORE downloading.  This is why I suggested hashing the USE flags or
> similar and sticking that in the filename.
>
> Obviously you can't host a repository with all the USE combinations.
> However, you could have a reference repo and the package manager could
> check it before doing a build.  If you get a hit then you can install
> the binpkg.  If you don't then you can do a source build.
>
> Portage already checks the USE flags inside the binpkg before merging
> it and by default doesn't use a non-matching binpkg.  The problem with
> the current approach is:
> 1.  You have to download the package to check this (could be a big file).
> 2.  You can't host multiple versions of a binpkg with different USE
> flags since the filenames collide.
>
> I suggested a content hash because you can use it for an arbitrary
> amount of metadata, vs having to cram arch/USE/multilib and I'm sure
> something I'm missing into a filename.  Make the hash as short as is
> economical - it isn't like we have THAT many permutations, the PM can
> still check the internal metadata, and this isn't a security feature.
>
If you can really present a decent argument for replicating the
functionality of other distros like Debian, Arch, Ubuntu etc then let's
here it. For now, the strength of Gentoo is being able to fully customise a
system to your own requirements, not being trapped by some distro
maintainer's arbitrary choices. Play to your USP's and strengths rather
than chasing rainbows ..


signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Improving Gentoo package format

Rich Freeman
On Sun, Nov 11, 2018 at 1:02 PM M. J. Everitt <[hidden email]> wrote:
>
> If you can really present a decent argument for replicating the
> functionality of other distros like Debian, Arch, Ubuntu etc then let's
> here it. For now, the strength of Gentoo is being able to fully customise a
> system to your own requirements, not being trapped by some distro
> maintainer's arbitrary choices. Play to your USP's and strengths rather
> than chasing rainbows ..
>

Why do we support binary packages at all?  Simple: compiling packages
is expensive, and if you happen to already have them compiled, fully
customized to your own requirements, then there is no point in
recompiling them.  You're just spending a ton of resources to build
the exact same files you already have.

The only change I'm suggesting is that portage could take all the
configuration you're already supplying, and then optionally go see if
somebody you trust has already built the package that meets your
requirements.  If so, then it would be downloaded and installed,
otherwise it would just compile from source.

You get the exact same files installed on your system either way.

--
Rich

Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Improving Gentoo package format

M. J. Everitt
On 11/11/18 19:20, Rich Freeman wrote:

> On Sun, Nov 11, 2018 at 1:02 PM M. J. Everitt <[hidden email]> wrote:
>> If you can really present a decent argument for replicating the
>> functionality of other distros like Debian, Arch, Ubuntu etc then let's
>> here it. For now, the strength of Gentoo is being able to fully customise a
>> system to your own requirements, not being trapped by some distro
>> maintainer's arbitrary choices. Play to your USP's and strengths rather
>> than chasing rainbows ..
>>
> Why do we support binary packages at all?  Simple: compiling packages
> is expensive, and if you happen to already have them compiled, fully
> customized to your own requirements, then there is no point in
> recompiling them.  You're just spending a ton of resources to build
> the exact same files you already have.
>
> The only change I'm suggesting is that portage could take all the
> configuration you're already supplying, and then optionally go see if
> somebody you trust has already built the package that meets your
> requirements.  If so, then it would be downloaded and installed,
> otherwise it would just compile from source.
>
> You get the exact same files installed on your system either way.
>
Ok so I get the principle, but who's gonna provide the tools to make this
feasible, and perhaps more interestingly, who's going to curate, provide,
host and maintain the binpkg repos you propose? We barely have enough
developers to maintain a working source package repository, let alone
adding new distro "features" .. unless perhaps you have a few hours every
week to spare?

I see no sense in reinventing the wheel here, besides #thegentooway....


signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Improving Gentoo package format

Alec Warner-2
In reply to this post by Rich Freeman
On Sun, Nov 11, 2018 at 2:21 PM Rich Freeman <[hidden email]> wrote:
On Sun, Nov 11, 2018 at 1:02 PM M. J. Everitt <[hidden email]> wrote:
>
> If you can really present a decent argument for replicating the
> functionality of other distros like Debian, Arch, Ubuntu etc then let's
> here it. For now, the strength of Gentoo is being able to fully customise a
> system to your own requirements, not being trapped by some distro
> maintainer's arbitrary choices. Play to your USP's and strengths rather
> than chasing rainbows ..
>

Why do we support binary packages at all?  Simple: compiling packages
is expensive, and if you happen to already have them compiled, fully
customized to your own requirements, then there is no point in
recompiling them.  You're just spending a ton of resources to build
the exact same files you already have.

The only change I'm suggesting is that portage could take all the
configuration you're already supplying, and then optionally go see if
somebody you trust has already built the package that meets your
requirements.  If so, then it would be downloaded and installed,
otherwise it would just compile from source.

You get the exact same files installed on your system either way.

I think this conversation is a bit off track. I'm not saying this isn't a great idea, but I think its very orthogonal to the binpkg format itself.

For example, the binhost pkg index file can contain this metadata and portage can be designed to fetch the binpkg index metadata and do matching (afaik it already does this; it just needs extending with more metadata.) The binpkg format itself seems not too relevant to this.

-A
 

--
Rich

Reply | Threaded
Open this post in threaded view
|

[RFC] gpkg format proposal v2 (was: Re: [gentoo-portage-dev] [RFC] Improving Gentoo package format)

Michał Górny-5
In reply to this post by Michał Górny-5
Hi,

Ok, here's the second version integrating the feedback received.
The format is much simpler, based on nested tarballs inspired by Debian.

The outer tarball is uncompressed and uses '.gpkg.tar' suffix.  It
contains (preferably in order but PM should also handle packages with
mismatched order):

1. Optional (but recommended) "gpkg: ${PF}" package label that can be
used to quickly distinguish Gentoo binpkgs from regular tarballs
(for file(1)).

2. "metadata.tar${comp}" tarball containing binary package metadata
as files.

3. Optional "metadata.tar${comp}.sig" containing detached signature
for the metadata archive.

4. "contents.tar${comp}" tarball containing files to be installed.

5. Optional "contents.tar${comp}.sig" containing detached signature for
the contents archive.

Notes:

a. ${comp} can be any compression format supported by binary packages.
Technically, metadata and content archives may use different
compression.  Either or both may be uncompressed as well.

b. While signatures are optional, the PM should have a switch
controlling whether to expect them, and fail hard if they're not present
when expected.


Advantages
----------
Guaranteed:

+ The binary package is still one file, so can be fetched easily.

+ File format is trivial and can be extracted using tar(1) + compressor.

+ The metadata and contents are compressed independently, and so can be
easily extracted or modified independently.

+ The package format provides for separate metadata and content
signatures, so they can be verified independently.

+ Metadata can be compressed now.

Achieved by regular archives (but might be broken if modified by user):

+ Easy recognition by magic(1).

+ The metadata archive (and its signature) is packed first, so it may be
read without fetching the whole binpkg.


Why not .ar format?
-------------------
The use of .ar format has been proposed, akin to Debian.  While
the option is mostly feasible, and the simplicity of .ar format would
reduce the outer size of binary packages, I think the format is simply
too obscure.  It lives mostly as static library format, and the tooling
for it is part of binutils.  LSB considers it deprecated.  While I don't
see it going away anytime soon, I'd rather not rely on it in order to
save a few KiB.


Is there anything left to address?

--
Best regards,
Michał Górny

signature.asc (981 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] gpkg format proposal v2 (was: Re: [gentoo-portage-dev] [RFC] Improving Gentoo package format)

Michał Górny-5
On Sun, 2018-11-11 at 21:53 +0100, Michał Górny wrote:

> Hi,
>
> Ok, here's the second version integrating the feedback received.
> The format is much simpler, based on nested tarballs inspired by Debian.
>
> The outer tarball is uncompressed and uses '.gpkg.tar' suffix.  It
> contains (preferably in order but PM should also handle packages with
> mismatched order):
>
> 1. Optional (but recommended) "gpkg: ${PF}" package label that can be
> used to quickly distinguish Gentoo binpkgs from regular tarballs
> (for file(1)).
>
> 2. "metadata.tar${comp}" tarball containing binary package metadata
> as files.
>
> 3. Optional "metadata.tar${comp}.sig" containing detached signature
> for the metadata archive.
>
> 4. "contents.tar${comp}" tarball containing files to be installed.
>
> 5. Optional "contents.tar${comp}.sig" containing detached signature for
> the contents archive.
>
> Notes:
>
> a. ${comp} can be any compression format supported by binary packages.
> Technically, metadata and content archives may use different
> compression.  Either or both may be uncompressed as well.
>
> b. While signatures are optional, the PM should have a switch
> controlling whether to expect them, and fail hard if they're not present
> when expected.
>
>
> Advantages
> ----------
> Guaranteed:
>
> + The binary package is still one file, so can be fetched easily.
>
> + File format is trivial and can be extracted using tar(1) + compressor.
>
> + The metadata and contents are compressed independently, and so can be
> easily extracted or modified independently.
>
> + The package format provides for separate metadata and content
> signatures, so they can be verified independently.
>
> + Metadata can be compressed now.
>
> Achieved by regular archives (but might be broken if modified by user):
>
> + Easy recognition by magic(1).
>
> + The metadata archive (and its signature) is packed first, so it may be
> read without fetching the whole binpkg.
>
>
> Why not .ar format?
> -------------------
> The use of .ar format has been proposed, akin to Debian.  While
> the option is mostly feasible, and the simplicity of .ar format would
> reduce the outer size of binary packages, I think the format is simply
> too obscure.  It lives mostly as static library format, and the tooling
> for it is part of binutils.  LSB considers it deprecated.  While I don't
> see it going away anytime soon, I'd rather not rely on it in order to
> save a few KiB.
>
>
> Is there anything left to address?
>
Hmm, I've missed one disadvantage compared to xpak and v1: at least with
the standard tools, we can't build the binary package on the fly without
creating temporary archives (and therefore duplicating disk space use).

In other words, xpak and v1 formats made it possible to tar
the installation image straight to the new package.

The v2 format requires creating "contents.tar${comp}" first, and then
creating the actual binary package with it.  I don't think we can avoid
this without creating a custom .tar writing tool that supports adding
data on-the-fly (e.g. by writing the file data, then seeking back to
update the size record).

Of course, one option would be to use ZIP ;-).

--
Best regards,
Michał Górny

signature.asc (981 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] gpkg format proposal v2 (was: Re: [gentoo-portage-dev] [RFC] Improving Gentoo package format)

Francesco Riosa-3


Il giorno dom 11 nov 2018 alle ore 22:17 Michał Górny <[hidden email]> ha scritto:
On Sun, 2018-11-11 at 21:53 +0100, Michał Górny wrote:
[...-]
Of course, one option would be to use ZIP ;-).

Zip archives have another big advantage; there is an index of files, so listing the archive contents and extracting a single file is very fast and does not depend from it's position in the archive.
The big disadvantage is that only "desktop" profile has unzip by default

Best regards,
-Francesco 
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] gpkg format proposal v2 (was: Re: [gentoo-portage-dev] [RFC] Improving Gentoo package format)

Michał Górny-5
On Mon, 2018-11-12 at 01:21 +0100, Francesco Riosa wrote:
> Il giorno dom 11 nov 2018 alle ore 22:17 Michał Górny <[hidden email]>
> ha scritto:
>
> > On Sun, 2018-11-11 at 21:53 +0100, Michał Górny wrote:
> > [...-]
> > Of course, one option would be to use ZIP ;-).

I wasn't serious there.

> Zip archives have another big advantage; there is an index of files, so
> listing the archive contents and extracting a single file is very fast and
> does not depend from it's position in the archive.
> The big disadvantage is that only "desktop" profile has unzip by default
>

The two main problems with ZIP is that:

1. As you noted, it's not present in core system packages.

2. It uses trailer format which means that you need to fetch the whole
file before being able to process it.

There was also some patent hassle back in the day but I think it's no
longer applicable today.

--
Best regards,
Michał Górny

signature.asc (981 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] gpkg format proposal v2 (was: Re: [gentoo-portage-dev] [RFC] Improving Gentoo package format)

Fabian Groffen-2
In reply to this post by Michał Górny-5
On 11-11-2018 21:53:33 +0100, Michał Górny wrote:

> Hi,
>
> Ok, here's the second version integrating the feedback received.
> The format is much simpler, based on nested tarballs inspired by Debian.
>
> The outer tarball is uncompressed and uses '.gpkg.tar' suffix.  It
> contains (preferably in order but PM should also handle packages with
> mismatched order):
>
> 1. Optional (but recommended) "gpkg: ${PF}" package label that can be
> used to quickly distinguish Gentoo binpkgs from regular tarballs
> (for file(1)).
>
> 2. "metadata.tar${comp}" tarball containing binary package metadata
> as files.
>
> 3. Optional "metadata.tar${comp}.sig" containing detached signature
> for the metadata archive.
>
> 4. "contents.tar${comp}" tarball containing files to be installed.
>
> 5. Optional "contents.tar${comp}.sig" containing detached signature for
> the contents archive.
>
> Notes:
>
> a. ${comp} can be any compression format supported by binary packages.
> Technically, metadata and content archives may use different
> compression.  Either or both may be uncompressed as well.
I'm wondering here, how much sense does it make to compress 2., 3.
and/or 4. if you compress the whole gpkg?  I have the impression
compression on compression isn't beneficial here.  Shouldn't just
compressing of the gpkg tar be sufficient?

As to allowing different compressors for a single gpkg, I think it would
be better to require all compressors to be the same, such that a PM or
tool can quickly see if it can "read" the file from the gpkg filename,
instead of having to fetch and open it first.  Obviously, if you drop
compression of the inner tars, this point goes away.

Thanks,
Fabian

> b. While signatures are optional, the PM should have a switch
> controlling whether to expect them, and fail hard if they're not present
> when expected.
>
>
> Advantages
> ----------
> Guaranteed:
>
> + The binary package is still one file, so can be fetched easily.
>
> + File format is trivial and can be extracted using tar(1) + compressor.
>
> + The metadata and contents are compressed independently, and so can be
> easily extracted or modified independently.
>
> + The package format provides for separate metadata and content
> signatures, so they can be verified independently.
>
> + Metadata can be compressed now.
>
> Achieved by regular archives (but might be broken if modified by user):
>
> + Easy recognition by magic(1).
>
> + The metadata archive (and its signature) is packed first, so it may be
> read without fetching the whole binpkg.
>
>
> Why not .ar format?
> -------------------
> The use of .ar format has been proposed, akin to Debian.  While
> the option is mostly feasible, and the simplicity of .ar format would
> reduce the outer size of binary packages, I think the format is simply
> too obscure.  It lives mostly as static library format, and the tooling
> for it is part of binutils.  LSB considers it deprecated.  While I don't
> see it going away anytime soon, I'd rather not rely on it in order to
> save a few KiB.
>
>
> Is there anything left to address?
>
> --
> Best regards,
> Michał Górny


--
Fabian Groffen
Gentoo on a different level

signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] gpkg format proposal v2 (was: Re: [gentoo-portage-dev] [RFC] Improving Gentoo package format)

Michał Górny-5
On Mon, 2018-11-12 at 17:51 +0100, Fabian Groffen wrote:

> On 11-11-2018 21:53:33 +0100, Michał Górny wrote:
> > Hi,
> >
> > Ok, here's the second version integrating the feedback received.
> > The format is much simpler, based on nested tarballs inspired by Debian.
> >
> > The outer tarball is uncompressed and uses '.gpkg.tar' suffix.  It
> > contains (preferably in order but PM should also handle packages with
> > mismatched order):
> >
> > 1. Optional (but recommended) "gpkg: ${PF}" package label that can be
> > used to quickly distinguish Gentoo binpkgs from regular tarballs
> > (for file(1)).
> >
> > 2. "metadata.tar${comp}" tarball containing binary package metadata
> > as files.
> >
> > 3. Optional "metadata.tar${comp}.sig" containing detached signature
> > for the metadata archive.
> >
> > 4. "contents.tar${comp}" tarball containing files to be installed.
> >
> > 5. Optional "contents.tar${comp}.sig" containing detached signature for
> > the contents archive.
> >
> > Notes:
> >
> > a. ${comp} can be any compression format supported by binary packages.
> > Technically, metadata and content archives may use different
> > compression.  Either or both may be uncompressed as well.
>
> I'm wondering here, how much sense does it make to compress 2., 3.
> and/or 4. if you compress the whole gpkg?  I have the impression
> compression on compression isn't beneficial here.  Shouldn't just
> compressing of the gpkg tar be sufficient?
>
Please read the spec again.  It explicitly says it's not compressed.

--
Best regards,
Michał Górny

signature.asc (981 bytes) Download Attachment
12