Speeding up Tree Verification

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Speeding up Tree Verification

Sid Spry
Hello,

I have some runnable pseudocode outlining a faster tree verification algorithm.
Before I create patches I'd like to see if there is any guidance on making the
changes as unobtrusive as possible. If the radical change in algorithm is
acceptable I can work on adding the changes.

Instead of composing any kind of structured data out of the portage tree my
algorithm just lists all files and then optionally batches them out to threads.
There is a noticeable speedup by eliding the tree traversal operations which
can be seen when running the algorithm with a single thread and comparing it to
the current algorithm in gemato (which should still be discussed here?).

Some simple tests like counting all objects traversed and verified returns the
same(ish). Once it is put into portage it could be tested in detail.

There is also my partial attempt at removing the brittle interface to GnuPG
(it's not as if the current code is badly designed, just that parsing the
output of GnuPG directly is likely not the best idea).

Needs gemato, dnspython, and requests. Slightly better than random code because
I took inspiration from the existing gemato classes.

```python (veriftree.py)
#!/usr/bin/env python3
import os, sys, zlib, hashlib, tempfile, shutil, timeit
import subprocess
from typing import List
from pprint import pprint

from gemato.manifest import (
    ManifestFile,
    ManifestFileEntry,
)
from wkd import (
    check_domain_signature,
    hash_localpart,
    build_web_key_uri,
    stream_to_file
)
from fetchmedia import (
    OpenPGPEnvironment,
    setup_verification_environment
)

# 0. Top level directory (repository) contains Manifest, a PGP signature of
#    blake2b and sha512 hashes of Manifest.files.gz.
# 1. Manifest.files contains hashes of each category Manifest.gz.
# 2. The category Manifest contains hashes of each package Manifest.
# 3. The package Manifest contains hashes of each package file.
#    Must be aware of PMS, e.g. aux tag specifies a file in files/.

# 0. Check signature of repo Manifest.
# 1. Merge items in Manifest.files, each category Manifest, and each package
#    Manifest into one big list. The path must be made absolute.
# 2. Distribute items to threads.

# To check operation compare directory tree to files appearing in all
# ManifestRecords.

class ManifestTree(object):
    __slots__ = ['_directory', '_manifest_list', '_manifest_records',
        '_manifest_results']

    def __init__(self, directory: str):
        self._directory = directory
        # Tuples of (base_path, full_path).
        self._manifest_list = []
        self._manifest_records = []
        self._manifest_results = []

    def build_manifest_list(self):
        for path, dirs, files in os.walk(self._directory):
            #if 'glsa' in path or 'news' in path:
            #if 'metadata' in path:
            #    continue # Skip the metadata directory for now.
                # It contains a repository. Current algo barfs on Manifest
                # containing only sig.

            if 'Manifest.files.gz' in files:
                self._manifest_list += [(path, path + '/Manifest.files.gz')]
            if 'Manifest.gz' in files:
                self._manifest_list += [(path, path + '/Manifest.gz')]
           
            if path == self._directory:
                continue # Skip the repo manifest. Order matters, fix eventually.
            if 'Manifest' in files:
                self._manifest_list += [(path, path + '/Manifest')]

    def parse_manifests(self):
        td = tempfile.TemporaryDirectory(dir='./')
        for manifest in self._manifest_list:
            def inner():
                if manifest[1].endswith('.gz'):
                    name = 'Manifest.files' # Need to also handle Manifest.gz.
                    path = '{0}/{1}'.format(td.name, name)
                    subprocess.run(['sh', '-c', 'gunzip -c {0} > {1}'
                        .format(manifest[1], path)])
                    for line in open(path):
                        mr = ManifestRecord(line)
                        mr.make_absolute(manifest[0])
                        self._manifest_records += [mr]
                else:
                    for line in open(manifest[1]):
                        if line.startswith('-'):
                            return # Skip the signed manifest.
                        mr = ManifestRecord(line)
                        mr.make_absolute(manifest[0])
                        self._manifest_records += [mr]
            inner()

    def verify_manifests(self):
        for record in self._manifest_records:
            self._manifest_results += [record.verify()]


class ManifestRecord(object):
    __slots__ = ['_tag', '_abs_path', '_path', '_size', '_hashes']

    def __init__(self, line: str=None):
        self._tag = None
        self._abs_path = None
        self._path = None
        self._size = None
        self._hashes = []
        if line:
            self.from_string(line)

    def from_string(self, line: str) -> None:
        parts = line.split()
        if len(parts) == 2:
            self._tag = 'ignore'
            return
        self._tag = parts[0]
        self._path = parts[1]
        self._size = parts[2]
        self._hashes = parts[3:]

    def make_absolute(self, abs_path: str) -> None:
        self._abs_path = abs_path
        try:
            pass
            #if 'md5-cache' in abs_path:
            #    print(abs_path + '/' + self._path)
        except TypeError as exc:
            return

    def verify(self) -> bool:
        if self._tag == 'ignore':
            return None

        # Where is best place to do this? Before?
        if self._tag.lower() == 'aux':
            self._path = self._abs_path + '/files/' + self._path
        elif self._abs_path:
            self._path = self._abs_path + '/' + self._path

        # Distfiles will not be present.
        if self._tag.lower() == 'dist':
            return None

        if not os.path.exists(self._path):
            return False
       
        fd = open(self._path, 'rb')
        sha512 = hashlib.sha512()
        blake2b = hashlib.blake2b()
        while True:
            d = fd.read(8192)
            if not d:
                break
            sha512.update(d)
            blake2b.update(d)
        rsha512 = sha512.hexdigest()
        rblake2b = blake2b.hexdigest()

        if rblake2b != self._hashes[1]:
            return False

        if rsha512 != self._hashes[3]:
            return False

        return True

    def __repr__(self) -> str:
        #return repr(self._hashes)
        return '\t'.join([self._tag, self._size, self._path])

def main() -> int:
    # Step 0: verify the repo manifest.
    #publishers = ['[hidden email]']
    #ev = setup_verification_environment(publishers)
    #mf = ManifestFile()
    #mf.load(open('/var/db/repos/gentoo/Manifest'),
    #    verify_openpgp=True, openpgp_env=ev)
    #pprint(mf)
    #pprint(mf.openpgp_signed)
    #pprint(mf.openpgp_signature)

    # Step 1: merge manifests.
    #mt = ManifestTree('/var/db/repos/gentoo')
    #mt.build_manifest_list()
    #mt.parse_manifests()
    #mt.verify_manifests()

    glsa = ManifestTree('/var/db/repos/gentoo')
    glsa.build_manifest_list()
    glsa.parse_manifests()
   
    start = timeit.default_timer()
    glsa.verify_manifests()
    end = timeit.default_timer()
    pprint(end - start)
   
    # Handled by checking for None.
    #no_ignore = filter(lambda x: x._tag != 'ignore', glsa_manifest_results)


    #pprint(glsa._manifest_results)
    real_files = [x for x in filter(lambda x: x is not None, glsa._manifest_results)]
    #pprint(real_files)
    pprint(len(glsa._manifest_results))
    pprint(len(real_files))

    all_files = []
    for path, dirs, files in os.walk('/var/db/repos/gentoo'):
        pass

    return 0

if __name__ == '__main__':
    sys.exit(main())
```

```python (wkd.py, likely unneeded but I didn't want to redo these files yet)
#!/usr/bin/env python3
import sys, hashlib
import dns
from dns import (
    name, query, dnssec,
    message, resolver, rdatatype
)
import shutil, requests

def check_domain_signature(domain: str) -> bool:
    response = dns.resolver.query(domain, dns.rdatatype.NS)
    nsname = response.rrset[0]
    response = dns.resolver.query(str(nsname), dns.rdatatype.A)
    nsaddr = response.rrset[0].to_text()
   
    # DNSKEY
    request = dns.message.make_query(domain,
        dns.rdatatype.DNSKEY, want_dnssec=True)
    response = dns.query.udp(request, nsaddr)
    if response.rcode() != 0:
        raise Exception('Unable to request dnskey.')

    answer = response.answer
    if len(answer) != 2:
        raise Exception('Malformed answer to dnskey query.')

    name = dns.name.from_text(domain)
    try:
        dns.dnssec.validate(answer[0], answer[1], {name: answer[0]})
    except dns.dnssec.ValidationFailure as exc:
        # Validation failed. The raise causes python to abort with status 1.
        #raise exc
        return False
    except AttributeError as exc:
        # Validation may have failed; DNSKEY missing signer attribute. dig may report
        # domain as valid.
        #
        # TODO: Additional state where subdomain of valid domain may fail with 3 nested
        # KeyErrors. Avoid temptation to wildcard catch. Safer to put in process?
        #raise exc
        return False
    else:
        return True

def hash_localpart(incoming: bytes) -> str:
    '''Z-base32 the localpart of an e-mail address

    https://tools.ietf.org/html/draft-koch-openpgp-webkey-service-08#section-3.1
    describes why this is needed.

    See https://tools.ietf.org/html/rfc6189#section-5.1.6 for a
    description of the z-base32 scheme.
    '''
    zb32 = "ybndrfg8ejkmcpqxot1uwisza345h769"

    b = hashlib.sha1(incoming).digest()
    ret = ""
    assert(len(b) * 8 == 160)
    for i in range(0, 160, 5):
        byte = i // 8
        offset = i - byte * 8
        # offset | bits remaining in k+1 | right-shift k+1
        # 3 | 0 | x
        # 4 | 1 | 7
        # 5 | 2 | 6
        # 6 | 3 | 5
        # 7 | 4 | 4
        if offset < 4:
            n = (b[byte] >> (3 - offset))
        else:
            n = (b[byte] << (offset - 3)) + (b[byte + 1] >> (11 - offset))

        ret += zb32[n & 0b11111]
    return ret

def build_web_key_uri(address: str) -> str:
    local, remote = address.split('@')
    local = hash_localpart(local.encode('utf-8'))
    return 'https://' + remote + '/.well-known/openpgpkey/hu/' + \
        local

def stream_to_file(uri: str, fname: str) -> None:
    with requests.get(uri, verify=True, stream=True) as r:
        from pprint import pprint
        pprint(r.headers)
        with open(fname, 'wb') as f:
            shutil.copyfileobj(r.raw, f)
```

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up Tree Verification

Sid Spry
On Mon, Jun 29, 2020, at 9:13 PM, Sid Spry wrote:
> Hello,
>
> I have some runnable pseudocode outlining a faster tree verification algorithm.

Ah, right. It's worth noting that even faster than this algorithm is simply verifying
a .tar.xz. Is that totally off the table? I realize it doesn't fit every usecase, but it
seems to be faster in both sync and verification time.

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up Tree Verification

Zac Medico-2
On 6/29/20 7:15 PM, Sid Spry wrote:
> On Mon, Jun 29, 2020, at 9:13 PM, Sid Spry wrote:
>> Hello,
>>
>> I have some runnable pseudocode outlining a faster tree verification algorithm.
>
> Ah, right. It's worth noting that even faster than this algorithm is simply verifying
> a .tar.xz. Is that totally off the table? I realize it doesn't fit every usecase, but it
> seems to be faster in both sync and verification time.

We've already got support for that with sync-type = webrsync. However, I
imagine sync-type = git is even better. All of the types are covered here:

https://wiki.gentoo.org/wiki/Portage_Security
--
Thanks,
Zac

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up Tree Verification

Fabian Groffen-2
In reply to this post by Sid Spry
Hi,

On 29-06-2020 21:13:43 -0500, Sid Spry wrote:

> Hello,
>
> I have some runnable pseudocode outlining a faster tree verification algorithm.
> Before I create patches I'd like to see if there is any guidance on making the
> changes as unobtrusive as possible. If the radical change in algorithm is
> acceptable I can work on adding the changes.
>
> Instead of composing any kind of structured data out of the portage tree my
> algorithm just lists all files and then optionally batches them out to threads.
> There is a noticeable speedup by eliding the tree traversal operations which
> can be seen when running the algorithm with a single thread and comparing it to
> the current algorithm in gemato (which should still be discussed here?).
I remember something that gemato used to use multiple threads, but
because it totally saturated disk-IO, it was brought back to a single
thread.  People were complaining about unusable systems.

In any case, can you share your performance results?  What speedup did
you see, on warm and hot FS caches?  Which type of disk do you use?

You could compare against qmanifest, which uses OpenMP-based
paralllelism while verifying the tree.  On SSDs this does help.

Thanks,
Fabian

>
> Some simple tests like counting all objects traversed and verified returns the
> same(ish). Once it is put into portage it could be tested in detail.
>
> There is also my partial attempt at removing the brittle interface to GnuPG
> (it's not as if the current code is badly designed, just that parsing the
> output of GnuPG directly is likely not the best idea).
>
> Needs gemato, dnspython, and requests. Slightly better than random code because
> I took inspiration from the existing gemato classes.
>
> ```python (veriftree.py)
> #!/usr/bin/env python3
> import os, sys, zlib, hashlib, tempfile, shutil, timeit
> import subprocess
> from typing import List
> from pprint import pprint
>
> from gemato.manifest import (
>     ManifestFile,
>     ManifestFileEntry,
> )
> from wkd import (
>     check_domain_signature,
>     hash_localpart,
>     build_web_key_uri,
>     stream_to_file
> )
> from fetchmedia import (
>     OpenPGPEnvironment,
>     setup_verification_environment
> )
>
> # 0. Top level directory (repository) contains Manifest, a PGP signature of
> #    blake2b and sha512 hashes of Manifest.files.gz.
> # 1. Manifest.files contains hashes of each category Manifest.gz.
> # 2. The category Manifest contains hashes of each package Manifest.
> # 3. The package Manifest contains hashes of each package file.
> #    Must be aware of PMS, e.g. aux tag specifies a file in files/.
>
> # 0. Check signature of repo Manifest.
> # 1. Merge items in Manifest.files, each category Manifest, and each package
> #    Manifest into one big list. The path must be made absolute.
> # 2. Distribute items to threads.
>
> # To check operation compare directory tree to files appearing in all
> # ManifestRecords.
>
> class ManifestTree(object):
>     __slots__ = ['_directory', '_manifest_list', '_manifest_records',
>         '_manifest_results']
>
>     def __init__(self, directory: str):
>         self._directory = directory
>         # Tuples of (base_path, full_path).
>         self._manifest_list = []
>         self._manifest_records = []
>         self._manifest_results = []
>
>     def build_manifest_list(self):
>         for path, dirs, files in os.walk(self._directory):
>             #if 'glsa' in path or 'news' in path:
>             #if 'metadata' in path:
>             #    continue # Skip the metadata directory for now.
>                 # It contains a repository. Current algo barfs on Manifest
>                 # containing only sig.
>
>             if 'Manifest.files.gz' in files:
>                 self._manifest_list += [(path, path + '/Manifest.files.gz')]
>             if 'Manifest.gz' in files:
>                 self._manifest_list += [(path, path + '/Manifest.gz')]
>            
>             if path == self._directory:
>                 continue # Skip the repo manifest. Order matters, fix eventually.
>             if 'Manifest' in files:
>                 self._manifest_list += [(path, path + '/Manifest')]
>
>     def parse_manifests(self):
>         td = tempfile.TemporaryDirectory(dir='./')
>         for manifest in self._manifest_list:
>             def inner():
>                 if manifest[1].endswith('.gz'):
>                     name = 'Manifest.files' # Need to also handle Manifest.gz.
>                     path = '{0}/{1}'.format(td.name, name)
>                     subprocess.run(['sh', '-c', 'gunzip -c {0} > {1}'
>                         .format(manifest[1], path)])
>                     for line in open(path):
>                         mr = ManifestRecord(line)
>                         mr.make_absolute(manifest[0])
>                         self._manifest_records += [mr]
>                 else:
>                     for line in open(manifest[1]):
>                         if line.startswith('-'):
>                             return # Skip the signed manifest.
>                         mr = ManifestRecord(line)
>                         mr.make_absolute(manifest[0])
>                         self._manifest_records += [mr]
>             inner()
>
>     def verify_manifests(self):
>         for record in self._manifest_records:
>             self._manifest_results += [record.verify()]
>
>
> class ManifestRecord(object):
>     __slots__ = ['_tag', '_abs_path', '_path', '_size', '_hashes']
>
>     def __init__(self, line: str=None):
>         self._tag = None
>         self._abs_path = None
>         self._path = None
>         self._size = None
>         self._hashes = []
>         if line:
>             self.from_string(line)
>
>     def from_string(self, line: str) -> None:
>         parts = line.split()
>         if len(parts) == 2:
>             self._tag = 'ignore'
>             return
>         self._tag = parts[0]
>         self._path = parts[1]
>         self._size = parts[2]
>         self._hashes = parts[3:]
>
>     def make_absolute(self, abs_path: str) -> None:
>         self._abs_path = abs_path
>         try:
>             pass
>             #if 'md5-cache' in abs_path:
>             #    print(abs_path + '/' + self._path)
>         except TypeError as exc:
>             return
>
>     def verify(self) -> bool:
>         if self._tag == 'ignore':
>             return None
>
>         # Where is best place to do this? Before?
>         if self._tag.lower() == 'aux':
>             self._path = self._abs_path + '/files/' + self._path
>         elif self._abs_path:
>             self._path = self._abs_path + '/' + self._path
>
>         # Distfiles will not be present.
>         if self._tag.lower() == 'dist':
>             return None
>
>         if not os.path.exists(self._path):
>             return False
>        
>         fd = open(self._path, 'rb')
>         sha512 = hashlib.sha512()
>         blake2b = hashlib.blake2b()
>         while True:
>             d = fd.read(8192)
>             if not d:
>                 break
>             sha512.update(d)
>             blake2b.update(d)
>         rsha512 = sha512.hexdigest()
>         rblake2b = blake2b.hexdigest()
>
>         if rblake2b != self._hashes[1]:
>             return False
>
>         if rsha512 != self._hashes[3]:
>             return False
>
>         return True
>
>     def __repr__(self) -> str:
>         #return repr(self._hashes)
>         return '\t'.join([self._tag, self._size, self._path])
>
> def main() -> int:
>     # Step 0: verify the repo manifest.
>     #publishers = ['[hidden email]']
>     #ev = setup_verification_environment(publishers)
>     #mf = ManifestFile()
>     #mf.load(open('/var/db/repos/gentoo/Manifest'),
>     #    verify_openpgp=True, openpgp_env=ev)
>     #pprint(mf)
>     #pprint(mf.openpgp_signed)
>     #pprint(mf.openpgp_signature)
>
>     # Step 1: merge manifests.
>     #mt = ManifestTree('/var/db/repos/gentoo')
>     #mt.build_manifest_list()
>     #mt.parse_manifests()
>     #mt.verify_manifests()
>
>     glsa = ManifestTree('/var/db/repos/gentoo')
>     glsa.build_manifest_list()
>     glsa.parse_manifests()
>    
>     start = timeit.default_timer()
>     glsa.verify_manifests()
>     end = timeit.default_timer()
>     pprint(end - start)
>    
>     # Handled by checking for None.
>     #no_ignore = filter(lambda x: x._tag != 'ignore', glsa_manifest_results)
>
>
>     #pprint(glsa._manifest_results)
>     real_files = [x for x in filter(lambda x: x is not None, glsa._manifest_results)]
>     #pprint(real_files)
>     pprint(len(glsa._manifest_results))
>     pprint(len(real_files))
>
>     all_files = []
>     for path, dirs, files in os.walk('/var/db/repos/gentoo'):
>         pass
>
>     return 0
>
> if __name__ == '__main__':
>     sys.exit(main())
> ```
>
> ```python (wkd.py, likely unneeded but I didn't want to redo these files yet)
> #!/usr/bin/env python3
> import sys, hashlib
> import dns
> from dns import (
>     name, query, dnssec,
>     message, resolver, rdatatype
> )
> import shutil, requests
>
> def check_domain_signature(domain: str) -> bool:
>     response = dns.resolver.query(domain, dns.rdatatype.NS)
>     nsname = response.rrset[0]
>     response = dns.resolver.query(str(nsname), dns.rdatatype.A)
>     nsaddr = response.rrset[0].to_text()
>    
>     # DNSKEY
>     request = dns.message.make_query(domain,
>         dns.rdatatype.DNSKEY, want_dnssec=True)
>     response = dns.query.udp(request, nsaddr)
>     if response.rcode() != 0:
>         raise Exception('Unable to request dnskey.')
>
>     answer = response.answer
>     if len(answer) != 2:
>         raise Exception('Malformed answer to dnskey query.')
>
>     name = dns.name.from_text(domain)
>     try:
>         dns.dnssec.validate(answer[0], answer[1], {name: answer[0]})
>     except dns.dnssec.ValidationFailure as exc:
>         # Validation failed. The raise causes python to abort with status 1.
>         #raise exc
>         return False
>     except AttributeError as exc:
>         # Validation may have failed; DNSKEY missing signer attribute. dig may report
>         # domain as valid.
>         #
>         # TODO: Additional state where subdomain of valid domain may fail with 3 nested
>         # KeyErrors. Avoid temptation to wildcard catch. Safer to put in process?
>         #raise exc
>         return False
>     else:
>         return True
>
> def hash_localpart(incoming: bytes) -> str:
>     '''Z-base32 the localpart of an e-mail address
>
>     https://tools.ietf.org/html/draft-koch-openpgp-webkey-service-08#section-3.1
>     describes why this is needed.
>
>     See https://tools.ietf.org/html/rfc6189#section-5.1.6 for a
>     description of the z-base32 scheme.
>     '''
>     zb32 = "ybndrfg8ejkmcpqxot1uwisza345h769"
>
>     b = hashlib.sha1(incoming).digest()
>     ret = ""
>     assert(len(b) * 8 == 160)
>     for i in range(0, 160, 5):
>         byte = i // 8
>         offset = i - byte * 8
>         # offset | bits remaining in k+1 | right-shift k+1
>         # 3 | 0 | x
>         # 4 | 1 | 7
>         # 5 | 2 | 6
>         # 6 | 3 | 5
>         # 7 | 4 | 4
>         if offset < 4:
>             n = (b[byte] >> (3 - offset))
>         else:
>             n = (b[byte] << (offset - 3)) + (b[byte + 1] >> (11 - offset))
>
>         ret += zb32[n & 0b11111]
>     return ret
>
> def build_web_key_uri(address: str) -> str:
>     local, remote = address.split('@')
>     local = hash_localpart(local.encode('utf-8'))
>     return 'https://' + remote + '/.well-known/openpgpkey/hu/' + \
>         local
>
> def stream_to_file(uri: str, fname: str) -> None:
>     with requests.get(uri, verify=True, stream=True) as r:
>         from pprint import pprint
>         pprint(r.headers)
>         with open(fname, 'wb') as f:
>             shutil.copyfileobj(r.raw, f)
> ```
>
--
Fabian Groffen
Gentoo on a different level

signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Speeding up Tree Verification

Michał Górny-5
In reply to this post by Sid Spry
Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry <[hidden email]> napisał(a):

>Hello,
>
>I have some runnable pseudocode outlining a faster tree verification
>algorithm.
>Before I create patches I'd like to see if there is any guidance on
>making the
>changes as unobtrusive as possible. If the radical change in algorithm
>is
>acceptable I can work on adding the changes.
>
>Instead of composing any kind of structured data out of the portage
>tree my
>algorithm just lists all files and then optionally batches them out to
>threads.
>There is a noticeable speedup by eliding the tree traversal operations
>which
>can be seen when running the algorithm with a single thread and
>comparing it to
>the current algorithm in gemato (which should still be discussed
>here?).

Without reading the code: does your algorithm correctly detect extraneous files?

>
>Some simple tests like counting all objects traversed and verified
>returns the
>same(ish). Once it is put into portage it could be tested in detail.
>
>There is also my partial attempt at removing the brittle interface to
>GnuPG
>(it's not as if the current code is badly designed, just that parsing
>the
>output of GnuPG directly is likely not the best idea).

The 'brittle interface' is well-defined machine-readable output.

>
>Needs gemato, dnspython, and requests. Slightly better than random code
>because
>I took inspiration from the existing gemato classes.

The code makes a lot of brittle assumptions about the structure. The GLEP was specifically designed to avoid that and let us adjust the structure in the future to meet our needs.

>
>```python (veriftree.py)
>#!/usr/bin/env python3
>import os, sys, zlib, hashlib, tempfile, shutil, timeit
>import subprocess
>from typing import List
>from pprint import pprint
>
>from gemato.manifest import (
>    ManifestFile,
>    ManifestFileEntry,
>)
>from wkd import (
>    check_domain_signature,
>    hash_localpart,
>    build_web_key_uri,
>    stream_to_file
>)
>from fetchmedia import (
>    OpenPGPEnvironment,
>    setup_verification_environment
>)
>
># 0. Top level directory (repository) contains Manifest, a PGP
>signature of
>#    blake2b and sha512 hashes of Manifest.files.gz.
># 1. Manifest.files contains hashes of each category Manifest.gz.
># 2. The category Manifest contains hashes of each package Manifest.
># 3. The package Manifest contains hashes of each package file.
>#    Must be aware of PMS, e.g. aux tag specifies a file in files/.
>
># 0. Check signature of repo Manifest.
># 1. Merge items in Manifest.files, each category Manifest, and each
>package
>#    Manifest into one big list. The path must be made absolute.
># 2. Distribute items to threads.
>
># To check operation compare directory tree to files appearing in all
># ManifestRecords.
>
>class ManifestTree(object):
>    __slots__ = ['_directory', '_manifest_list', '_manifest_records',
>        '_manifest_results']
>
>    def __init__(self, directory: str):
>        self._directory = directory
>        # Tuples of (base_path, full_path).
>        self._manifest_list = []
>        self._manifest_records = []
>        self._manifest_results = []
>
>    def build_manifest_list(self):
>        for path, dirs, files in os.walk(self._directory):
>            #if 'glsa' in path or 'news' in path:
>            #if 'metadata' in path:
>            #    continue # Skip the metadata directory for now.
>             # It contains a repository. Current algo barfs on Manifest
>                # containing only sig.
>
>            if 'Manifest.files.gz' in files:
>           self._manifest_list += [(path, path + '/Manifest.files.gz')]
>            if 'Manifest.gz' in files:
>                self._manifest_list += [(path, path + '/Manifest.gz')]
>            
>            if path == self._directory:
>      continue # Skip the repo manifest. Order matters, fix eventually.
>            if 'Manifest' in files:
>                self._manifest_list += [(path, path + '/Manifest')]
>
>    def parse_manifests(self):
>        td = tempfile.TemporaryDirectory(dir='./')
>        for manifest in self._manifest_list:
>            def inner():
>                if manifest[1].endswith('.gz'):
>             name = 'Manifest.files' # Need to also handle Manifest.gz.
>                    path = '{0}/{1}'.format(td.name, name)
>                    subprocess.run(['sh', '-c', 'gunzip -c {0} > {1}'
>                        .format(manifest[1], path)])
>                    for line in open(path):
>                        mr = ManifestRecord(line)
>                        mr.make_absolute(manifest[0])
>                        self._manifest_records += [mr]
>                else:
>                    for line in open(manifest[1]):
>                        if line.startswith('-'):
>                            return # Skip the signed manifest.
>                        mr = ManifestRecord(line)
>                        mr.make_absolute(manifest[0])
>                        self._manifest_records += [mr]
>            inner()
>
>    def verify_manifests(self):
>        for record in self._manifest_records:
>            self._manifest_results += [record.verify()]
>
>
>class ManifestRecord(object):
>    __slots__ = ['_tag', '_abs_path', '_path', '_size', '_hashes']
>
>    def __init__(self, line: str=None):
>        self._tag = None
>        self._abs_path = None
>        self._path = None
>        self._size = None
>        self._hashes = []
>        if line:
>            self.from_string(line)
>
>    def from_string(self, line: str) -> None:
>        parts = line.split()
>        if len(parts) == 2:
>            self._tag = 'ignore'
>            return
>        self._tag = parts[0]
>        self._path = parts[1]
>        self._size = parts[2]
>        self._hashes = parts[3:]
>
>    def make_absolute(self, abs_path: str) -> None:
>        self._abs_path = abs_path
>        try:
>            pass
>            #if 'md5-cache' in abs_path:
>            #    print(abs_path + '/' + self._path)
>        except TypeError as exc:
>            return
>
>    def verify(self) -> bool:
>        if self._tag == 'ignore':
>            return None
>
>        # Where is best place to do this? Before?
>        if self._tag.lower() == 'aux':
>            self._path = self._abs_path + '/files/' + self._path
>        elif self._abs_path:
>            self._path = self._abs_path + '/' + self._path
>
>        # Distfiles will not be present.
>        if self._tag.lower() == 'dist':
>            return None
>
>        if not os.path.exists(self._path):
>            return False
>        
>        fd = open(self._path, 'rb')
>        sha512 = hashlib.sha512()
>        blake2b = hashlib.blake2b()
>        while True:
>            d = fd.read(8192)
>            if not d:
>                break
>            sha512.update(d)
>            blake2b.update(d)
>        rsha512 = sha512.hexdigest()
>        rblake2b = blake2b.hexdigest()
>
>        if rblake2b != self._hashes[1]:
>            return False
>
>        if rsha512 != self._hashes[3]:
>            return False
>
>        return True
>
>    def __repr__(self) -> str:
>        #return repr(self._hashes)
>        return '\t'.join([self._tag, self._size, self._path])
>
>def main() -> int:
>    # Step 0: verify the repo manifest.
>    #publishers = ['[hidden email]']
>    #ev = setup_verification_environment(publishers)
>    #mf = ManifestFile()
>    #mf.load(open('/var/db/repos/gentoo/Manifest'),
>    #    verify_openpgp=True, openpgp_env=ev)
>    #pprint(mf)
>    #pprint(mf.openpgp_signed)
>    #pprint(mf.openpgp_signature)
>
>    # Step 1: merge manifests.
>    #mt = ManifestTree('/var/db/repos/gentoo')
>    #mt.build_manifest_list()
>    #mt.parse_manifests()
>    #mt.verify_manifests()
>
>    glsa = ManifestTree('/var/db/repos/gentoo')
>    glsa.build_manifest_list()
>    glsa.parse_manifests()
>    
>    start = timeit.default_timer()
>    glsa.verify_manifests()
>    end = timeit.default_timer()
>    pprint(end - start)
>    
>    # Handled by checking for None.
>#no_ignore = filter(lambda x: x._tag != 'ignore',
>glsa_manifest_results)
>
>
>    #pprint(glsa._manifest_results)
>real_files = [x for x in filter(lambda x: x is not None,
>glsa._manifest_results)]
>    #pprint(real_files)
>    pprint(len(glsa._manifest_results))
>    pprint(len(real_files))
>
>    all_files = []
>    for path, dirs, files in os.walk('/var/db/repos/gentoo'):
>        pass
>
>    return 0
>
>if __name__ == '__main__':
>    sys.exit(main())
>```
>
>```python (wkd.py, likely unneeded but I didn't want to redo these
>files yet)
>#!/usr/bin/env python3
>import sys, hashlib
>import dns
>from dns import (
>    name, query, dnssec,
>    message, resolver, rdatatype
>)
>import shutil, requests
>
>def check_domain_signature(domain: str) -> bool:
>    response = dns.resolver.query(domain, dns.rdatatype.NS)
>    nsname = response.rrset[0]
>    response = dns.resolver.query(str(nsname), dns.rdatatype.A)
>    nsaddr = response.rrset[0].to_text()
>    
>    # DNSKEY
>    request = dns.message.make_query(domain,
>        dns.rdatatype.DNSKEY, want_dnssec=True)
>    response = dns.query.udp(request, nsaddr)
>    if response.rcode() != 0:
>        raise Exception('Unable to request dnskey.')
>
>    answer = response.answer
>    if len(answer) != 2:
>        raise Exception('Malformed answer to dnskey query.')
>
>    name = dns.name.from_text(domain)
>    try:
>        dns.dnssec.validate(answer[0], answer[1], {name: answer[0]})
>    except dns.dnssec.ValidationFailure as exc:
>   # Validation failed. The raise causes python to abort with status 1.
>        #raise exc
>        return False
>    except AttributeError as exc:
># Validation may have failed; DNSKEY missing signer attribute. dig may
>report
>        # domain as valid.
>        #
># TODO: Additional state where subdomain of valid domain may fail with
>3 nested
># KeyErrors. Avoid temptation to wildcard catch. Safer to put in
>process?
>        #raise exc
>        return False
>    else:
>        return True
>
>def hash_localpart(incoming: bytes) -> str:
>    '''Z-base32 the localpart of an e-mail address
>
>https://tools.ietf.org/html/draft-koch-openpgp-webkey-service-08#section-3.1
>    describes why this is needed.
>
>    See https://tools.ietf.org/html/rfc6189#section-5.1.6 for a
>    description of the z-base32 scheme.
>    '''
>    zb32 = "ybndrfg8ejkmcpqxot1uwisza345h769"
>
>    b = hashlib.sha1(incoming).digest()
>    ret = ""
>    assert(len(b) * 8 == 160)
>    for i in range(0, 160, 5):
>        byte = i // 8
>        offset = i - byte * 8
>        # offset | bits remaining in k+1 | right-shift k+1
>        # 3 | 0 | x
>        # 4 | 1 | 7
>        # 5 | 2 | 6
>        # 6 | 3 | 5
>        # 7 | 4 | 4
>        if offset < 4:
>            n = (b[byte] >> (3 - offset))
>        else:
>         n = (b[byte] << (offset - 3)) + (b[byte + 1] >> (11 - offset))
>
>        ret += zb32[n & 0b11111]
>    return ret
>
>def build_web_key_uri(address: str) -> str:
>    local, remote = address.split('@')
>    local = hash_localpart(local.encode('utf-8'))
>    return 'https://' + remote + '/.well-known/openpgpkey/hu/' + \
>        local
>
>def stream_to_file(uri: str, fname: str) -> None:
>    with requests.get(uri, verify=True, stream=True) as r:
>        from pprint import pprint
>        pprint(r.headers)
>        with open(fname, 'wb') as f:
>            shutil.copyfileobj(r.raw, f)
>```


--
Best regards,
Michał Górny

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up Tree Verification

Michał Górny-5
In reply to this post by Fabian Groffen-2
Dnia June 30, 2020 6:20:37 AM UTC, Fabian Groffen <[hidden email]> napisał(a):

>Hi,
>
>On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
>> Hello,
>>
>> I have some runnable pseudocode outlining a faster tree verification
>algorithm.
>> Before I create patches I'd like to see if there is any guidance on
>making the
>> changes as unobtrusive as possible. If the radical change in
>algorithm is
>> acceptable I can work on adding the changes.
>>
>> Instead of composing any kind of structured data out of the portage
>tree my
>> algorithm just lists all files and then optionally batches them out
>to threads.
>> There is a noticeable speedup by eliding the tree traversal
>operations which
>> can be seen when running the algorithm with a single thread and
>comparing it to
>> the current algorithm in gemato (which should still be discussed
>here?).
>
>I remember something that gemato used to use multiple threads, but
>because it totally saturated disk-IO, it was brought back to a single
>thread.  People were complaining about unusable systems.

No, it gave significant speedup even on spinning HDDs. However, it hang on some people due to some bug that I couldn't reproduce.

>
>In any case, can you share your performance results?  What speedup did
>you see, on warm and hot FS caches?  Which type of disk do you use?
>
>You could compare against qmanifest, which uses OpenMP-based
>paralllelism while verifying the tree.  On SSDs this does help.
>
>Thanks,
>Fabian
>
>>
>> Some simple tests like counting all objects traversed and verified
>returns the
>> same(ish). Once it is put into portage it could be tested in detail.
>>
>> There is also my partial attempt at removing the brittle interface to
>GnuPG
>> (it's not as if the current code is badly designed, just that parsing
>the
>> output of GnuPG directly is likely not the best idea).
>>
>> Needs gemato, dnspython, and requests. Slightly better than random
>code because
>> I took inspiration from the existing gemato classes.
>>
>> ```python (veriftree.py)
>> #!/usr/bin/env python3
>> import os, sys, zlib, hashlib, tempfile, shutil, timeit
>> import subprocess
>> from typing import List
>> from pprint import pprint
>>
>> from gemato.manifest import (
>>     ManifestFile,
>>     ManifestFileEntry,
>> )
>> from wkd import (
>>     check_domain_signature,
>>     hash_localpart,
>>     build_web_key_uri,
>>     stream_to_file
>> )
>> from fetchmedia import (
>>     OpenPGPEnvironment,
>>     setup_verification_environment
>> )
>>
>> # 0. Top level directory (repository) contains Manifest, a PGP
>signature of
>> #    blake2b and sha512 hashes of Manifest.files.gz.
>> # 1. Manifest.files contains hashes of each category Manifest.gz.
>> # 2. The category Manifest contains hashes of each package Manifest.
>> # 3. The package Manifest contains hashes of each package file.
>> #    Must be aware of PMS, e.g. aux tag specifies a file in files/.
>>
>> # 0. Check signature of repo Manifest.
>> # 1. Merge items in Manifest.files, each category Manifest, and each
>package
>> #    Manifest into one big list. The path must be made absolute.
>> # 2. Distribute items to threads.
>>
>> # To check operation compare directory tree to files appearing in all
>> # ManifestRecords.
>>
>> class ManifestTree(object):
>>     __slots__ = ['_directory', '_manifest_list', '_manifest_records',
>>         '_manifest_results']
>>
>>     def __init__(self, directory: str):
>>         self._directory = directory
>>         # Tuples of (base_path, full_path).
>>         self._manifest_list = []
>>         self._manifest_records = []
>>         self._manifest_results = []
>>
>>     def build_manifest_list(self):
>>         for path, dirs, files in os.walk(self._directory):
>>             #if 'glsa' in path or 'news' in path:
>>             #if 'metadata' in path:
>>             #    continue # Skip the metadata directory for now.
>>                 # It contains a repository. Current algo barfs on
>Manifest
>>                 # containing only sig.
>>
>>             if 'Manifest.files.gz' in files:
>>                 self._manifest_list += [(path, path +
>'/Manifest.files.gz')]
>>             if 'Manifest.gz' in files:
>>                 self._manifest_list += [(path, path +
>'/Manifest.gz')]
>>            
>>             if path == self._directory:
>>                 continue # Skip the repo manifest. Order matters, fix
>eventually.
>>             if 'Manifest' in files:
>>                 self._manifest_list += [(path, path + '/Manifest')]
>>
>>     def parse_manifests(self):
>>         td = tempfile.TemporaryDirectory(dir='./')
>>         for manifest in self._manifest_list:
>>             def inner():
>>                 if manifest[1].endswith('.gz'):
>>                     name = 'Manifest.files' # Need to also handle
>Manifest.gz.
>>                     path = '{0}/{1}'.format(td.name, name)
>>                     subprocess.run(['sh', '-c', 'gunzip -c {0} > {1}'
>>                         .format(manifest[1], path)])
>>                     for line in open(path):
>>                         mr = ManifestRecord(line)
>>                         mr.make_absolute(manifest[0])
>>                         self._manifest_records += [mr]
>>                 else:
>>                     for line in open(manifest[1]):
>>                         if line.startswith('-'):
>>                             return # Skip the signed manifest.
>>                         mr = ManifestRecord(line)
>>                         mr.make_absolute(manifest[0])
>>                         self._manifest_records += [mr]
>>             inner()
>>
>>     def verify_manifests(self):
>>         for record in self._manifest_records:
>>             self._manifest_results += [record.verify()]
>>
>>
>> class ManifestRecord(object):
>>     __slots__ = ['_tag', '_abs_path', '_path', '_size', '_hashes']
>>
>>     def __init__(self, line: str=None):
>>         self._tag = None
>>         self._abs_path = None
>>         self._path = None
>>         self._size = None
>>         self._hashes = []
>>         if line:
>>             self.from_string(line)
>>
>>     def from_string(self, line: str) -> None:
>>         parts = line.split()
>>         if len(parts) == 2:
>>             self._tag = 'ignore'
>>             return
>>         self._tag = parts[0]
>>         self._path = parts[1]
>>         self._size = parts[2]
>>         self._hashes = parts[3:]
>>
>>     def make_absolute(self, abs_path: str) -> None:
>>         self._abs_path = abs_path
>>         try:
>>             pass
>>             #if 'md5-cache' in abs_path:
>>             #    print(abs_path + '/' + self._path)
>>         except TypeError as exc:
>>             return
>>
>>     def verify(self) -> bool:
>>         if self._tag == 'ignore':
>>             return None
>>
>>         # Where is best place to do this? Before?
>>         if self._tag.lower() == 'aux':
>>             self._path = self._abs_path + '/files/' + self._path
>>         elif self._abs_path:
>>             self._path = self._abs_path + '/' + self._path
>>
>>         # Distfiles will not be present.
>>         if self._tag.lower() == 'dist':
>>             return None
>>
>>         if not os.path.exists(self._path):
>>             return False
>>        
>>         fd = open(self._path, 'rb')
>>         sha512 = hashlib.sha512()
>>         blake2b = hashlib.blake2b()
>>         while True:
>>             d = fd.read(8192)
>>             if not d:
>>                 break
>>             sha512.update(d)
>>             blake2b.update(d)
>>         rsha512 = sha512.hexdigest()
>>         rblake2b = blake2b.hexdigest()
>>
>>         if rblake2b != self._hashes[1]:
>>             return False
>>
>>         if rsha512 != self._hashes[3]:
>>             return False
>>
>>         return True
>>
>>     def __repr__(self) -> str:
>>         #return repr(self._hashes)
>>         return '\t'.join([self._tag, self._size, self._path])
>>
>> def main() -> int:
>>     # Step 0: verify the repo manifest.
>>     #publishers = ['[hidden email]']
>>     #ev = setup_verification_environment(publishers)
>>     #mf = ManifestFile()
>>     #mf.load(open('/var/db/repos/gentoo/Manifest'),
>>     #    verify_openpgp=True, openpgp_env=ev)
>>     #pprint(mf)
>>     #pprint(mf.openpgp_signed)
>>     #pprint(mf.openpgp_signature)
>>
>>     # Step 1: merge manifests.
>>     #mt = ManifestTree('/var/db/repos/gentoo')
>>     #mt.build_manifest_list()
>>     #mt.parse_manifests()
>>     #mt.verify_manifests()
>>
>>     glsa = ManifestTree('/var/db/repos/gentoo')
>>     glsa.build_manifest_list()
>>     glsa.parse_manifests()
>>    
>>     start = timeit.default_timer()
>>     glsa.verify_manifests()
>>     end = timeit.default_timer()
>>     pprint(end - start)
>>    
>>     # Handled by checking for None.
>>     #no_ignore = filter(lambda x: x._tag != 'ignore',
>glsa_manifest_results)
>>
>>
>>     #pprint(glsa._manifest_results)
>>     real_files = [x for x in filter(lambda x: x is not None,
>glsa._manifest_results)]
>>     #pprint(real_files)
>>     pprint(len(glsa._manifest_results))
>>     pprint(len(real_files))
>>
>>     all_files = []
>>     for path, dirs, files in os.walk('/var/db/repos/gentoo'):
>>         pass
>>
>>     return 0
>>
>> if __name__ == '__main__':
>>     sys.exit(main())
>> ```
>>
>> ```python (wkd.py, likely unneeded but I didn't want to redo these
>files yet)
>> #!/usr/bin/env python3
>> import sys, hashlib
>> import dns
>> from dns import (
>>     name, query, dnssec,
>>     message, resolver, rdatatype
>> )
>> import shutil, requests
>>
>> def check_domain_signature(domain: str) -> bool:
>>     response = dns.resolver.query(domain, dns.rdatatype.NS)
>>     nsname = response.rrset[0]
>>     response = dns.resolver.query(str(nsname), dns.rdatatype.A)
>>     nsaddr = response.rrset[0].to_text()
>>    
>>     # DNSKEY
>>     request = dns.message.make_query(domain,
>>         dns.rdatatype.DNSKEY, want_dnssec=True)
>>     response = dns.query.udp(request, nsaddr)
>>     if response.rcode() != 0:
>>         raise Exception('Unable to request dnskey.')
>>
>>     answer = response.answer
>>     if len(answer) != 2:
>>         raise Exception('Malformed answer to dnskey query.')
>>
>>     name = dns.name.from_text(domain)
>>     try:
>>         dns.dnssec.validate(answer[0], answer[1], {name: answer[0]})
>>     except dns.dnssec.ValidationFailure as exc:
>>         # Validation failed. The raise causes python to abort with
>status 1.
>>         #raise exc
>>         return False
>>     except AttributeError as exc:
>>         # Validation may have failed; DNSKEY missing signer
>attribute. dig may report
>>         # domain as valid.
>>         #
>>         # TODO: Additional state where subdomain of valid domain may
>fail with 3 nested
>>         # KeyErrors. Avoid temptation to wildcard catch. Safer to put
>in process?
>>         #raise exc
>>         return False
>>     else:
>>         return True
>>
>> def hash_localpart(incoming: bytes) -> str:
>>     '''Z-base32 the localpart of an e-mail address
>>
>>    
>https://tools.ietf.org/html/draft-koch-openpgp-webkey-service-08#section-3.1
>>     describes why this is needed.
>>
>>     See https://tools.ietf.org/html/rfc6189#section-5.1.6 for a
>>     description of the z-base32 scheme.
>>     '''
>>     zb32 = "ybndrfg8ejkmcpqxot1uwisza345h769"
>>
>>     b = hashlib.sha1(incoming).digest()
>>     ret = ""
>>     assert(len(b) * 8 == 160)
>>     for i in range(0, 160, 5):
>>         byte = i // 8
>>         offset = i - byte * 8
>>         # offset | bits remaining in k+1 | right-shift k+1
>>         # 3 | 0 | x
>>         # 4 | 1 | 7
>>         # 5 | 2 | 6
>>         # 6 | 3 | 5
>>         # 7 | 4 | 4
>>         if offset < 4:
>>             n = (b[byte] >> (3 - offset))
>>         else:
>>             n = (b[byte] << (offset - 3)) + (b[byte + 1] >> (11 -
>offset))
>>
>>         ret += zb32[n & 0b11111]
>>     return ret
>>
>> def build_web_key_uri(address: str) -> str:
>>     local, remote = address.split('@')
>>     local = hash_localpart(local.encode('utf-8'))
>>     return 'https://' + remote + '/.well-known/openpgpkey/hu/' + \
>>         local
>>
>> def stream_to_file(uri: str, fname: str) -> None:
>>     with requests.get(uri, verify=True, stream=True) as r:
>>         from pprint import pprint
>>         pprint(r.headers)
>>         with open(fname, 'wb') as f:
>>             shutil.copyfileobj(r.raw, f)
>> ```
>>


--
Best regards,
Michał Górny

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up Tree Verification

Pacho Ramos
In reply to this post by Fabian Groffen-2
El mar, 30-06-2020 a las 08:20 +0200, Fabian Groffen escribió:

> Hi,
>
> On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> > Hello,
> >
> > I have some runnable pseudocode outlining a faster tree verification
> > algorithm.
> > Before I create patches I'd like to see if there is any guidance on making
> > the
> > changes as unobtrusive as possible. If the radical change in algorithm is
> > acceptable I can work on adding the changes.
> >
> > Instead of composing any kind of structured data out of the portage tree my
> > algorithm just lists all files and then optionally batches them out to
> > threads.
> > There is a noticeable speedup by eliding the tree traversal operations which
> > can be seen when running the algorithm with a single thread and comparing it
> > to
> > the current algorithm in gemato (which should still be discussed here?).
>
> I remember something that gemato used to use multiple threads, but
> because it totally saturated disk-IO, it was brought back to a single
> thread.  People were complaining about unusable systems.
>
> In any case, can you share your performance results?  What speedup did
> you see, on warm and hot FS caches?  Which type of disk do you use?
>
> You could compare against qmanifest, which uses OpenMP-based
> paralllelism while verifying the tree.  On SSDs this does help.
>
> Thanks,
> Fabian
I am talking only based on my own experience but, at least on my case, there are
noticeable differences between I/O schedulers. That is something that maybe
could affect that tests... with latest kernels we only have noop, deadline
(default) and bfq... in my case bfq works far better for keeping the system
responsible (on both, rotational and SSDs). But you can test different
combinations to see the one that fits better for you.


signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Speeding up Tree Verification

Sid Spry
In reply to this post by Zac Medico-2
On Mon, Jun 29, 2020, at 9:34 PM, Zac Medico wrote:

> On 6/29/20 7:15 PM, Sid Spry wrote:
> > On Mon, Jun 29, 2020, at 9:13 PM, Sid Spry wrote:
> >> Hello,
> >>
> >> I have some runnable pseudocode outlining a faster tree verification algorithm.
> >
> > Ah, right. It's worth noting that even faster than this algorithm is simply verifying
> > a .tar.xz. Is that totally off the table? I realize it doesn't fit every usecase, but it
> > seems to be faster in both sync and verification time.
>
> We've already got support for that with sync-type = webrsync. However, I
> imagine sync-type = git is even better. All of the types are covered here:
>
> https://wiki.gentoo.org/wiki/Portage_Security

I'm being warned right now that webrsync-gpg is being deprecated; I've been using
it. It is, amazingly, faster than a typical rsync and may be faster than a git pull though.

The issue with git is there are some analyses that indicate you shouldn't rely on git
for integrity, so you are back to verifying the tree on-disk, which is slower than
verifying the .tar.xz.

(To clarify: Even with signed commits the commit hashes could be attacked and this
is considered somewhat feasible.)

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up Tree Verification

Sid Spry
In reply to this post by Michał Górny-5
On Tue, Jun 30, 2020, at 2:28 AM, Michał Górny wrote:

> Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry <[hidden email]> napisał(a):
> >Hello,
> >
> >I have some runnable pseudocode outlining a faster tree verification
> >algorithm.
> >Before I create patches I'd like to see if there is any guidance on
> >making the
> >changes as unobtrusive as possible. If the radical change in algorithm
> >is
> >acceptable I can work on adding the changes.
> >
> >Instead of composing any kind of structured data out of the portage
> >tree my
> >algorithm just lists all files and then optionally batches them out to
> >threads.
> >There is a noticeable speedup by eliding the tree traversal operations
> >which
> >can be seen when running the algorithm with a single thread and
> >comparing it to
> >the current algorithm in gemato (which should still be discussed
> >here?).
>
> Without reading the code: does your algorithm correctly detect extraneous files?
>

Yes and no.

I am not sure why this is necessary. If the file does not appear in a manifest it is
ignored. It makes the most sense to me to put the burden of not including
untracked files on the publisher. If the user puts an untracked file into the tree it
will be ignored to no consequence; the authored files don't refer to it, after all.

But it would be easy enough to build a second list of all files and compare it to
the list of files built from the manifests. If there are extras an error can be
generated. This is actually the first test I did on my manifest parsing code. I tried
to see if my tracked files roughly matched the total files in tree. That can be
repurposed for this check.

> >Some simple tests like counting all objects traversed and verified
> >returns the
> >same(ish). Once it is put into portage it could be tested in detail.
> >
> >There is also my partial attempt at removing the brittle interface to
> >GnuPG
> >(it's not as if the current code is badly designed, just that parsing
> >the
> >output of GnuPG directly is likely not the best idea).
>
> The 'brittle interface' is well-defined machine-readable output.
>

Ok. I was aware there was a machine interface, but the classes that manipulate
a temporary GPG home seemed like not the best solution. I guess that is all
due to GPG assuming everything is in ~/.gnupg and keeping its state as a
directory structure.

> >
> >Needs gemato, dnspython, and requests. Slightly better than random code
> >because
> >I took inspiration from the existing gemato classes.
>
> The code makes a lot of brittle assumptions about the structure. The
> GLEP was specifically designed to avoid that and let us adjust the
> structure in the future to meet our needs.
>

These same assumptions are built into the code that operates on the
tree structure. If the GLEP were changed the existing code would also
potentially need changing. This code just uses the structure in a different
way.

I will admit my partial understanding of the entire GLEP. I made some
simplifications just to get something demonstrable done. However, please
consider removing or putting some of the checks elsewhere. I don't have
full suggestions right now, but there is the possibility of saving an
appreciable amount of time.

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up Tree Verification

Sid Spry
In reply to this post by Fabian Groffen-2
On Tue, Jun 30, 2020, at 1:20 AM, Fabian Groffen wrote:

> Hi,
>
> On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> > Hello,
> >
> > I have some runnable pseudocode outlining a faster tree verification algorithm.
> > Before I create patches I'd like to see if there is any guidance on making the
> > changes as unobtrusive as possible. If the radical change in algorithm is
> > acceptable I can work on adding the changes.
> >
> > Instead of composing any kind of structured data out of the portage tree my
> > algorithm just lists all files and then optionally batches them out to threads.
> > There is a noticeable speedup by eliding the tree traversal operations which
> > can be seen when running the algorithm with a single thread and comparing it to
> > the current algorithm in gemato (which should still be discussed here?).
>
> I remember something that gemato used to use multiple threads, but
> because it totally saturated disk-IO, it was brought back to a single
> thread.  People were complaining about unusable systems.
>

I think this is an argument for cgroups limits support on the portage process or
account as opposed to an argument against picking a better algorithm. That is
something I have been working towards, but I am only one man.

> In any case, can you share your performance results?  What speedup did
> you see, on warm and hot FS caches?  Which type of disk do you use?
>

I ran all tests multiple times to make them warm off of a Samsung SSD, but
nothing very precise yet.

% gemato verify --openpgp-key signkey.asc /var/db/repos/gentoo
[...]
INFO:root:Verifying /var/db/repos/gentoo...
INFO:root:/var/db/repos/gentoo verified in 16.45 seconds

sometimes going higher, closer to 18s, vs.

% ./veriftree.py
4.763171965983929

So roughly an order of magnitude speedup without batching to threads.

> You could compare against qmanifest, which uses OpenMP-based
> paralllelism while verifying the tree.  On SSDs this does help.
>

I lost my notes -- how do I specify to either gemato or qmanifest the GnuPG
directory? My code is partially structured as it is because I had problems doing
this. I rediscovered -K/--openpgp-key in gemato but am unsure for qmanifest.

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up Tree Verification

Zac Medico-2
In reply to this post by Sid Spry
On 6/30/20 10:29 AM, Sid Spry wrote:

> On Mon, Jun 29, 2020, at 9:34 PM, Zac Medico wrote:
>> On 6/29/20 7:15 PM, Sid Spry wrote:
>>> On Mon, Jun 29, 2020, at 9:13 PM, Sid Spry wrote:
>>>> Hello,
>>>>
>>>> I have some runnable pseudocode outlining a faster tree verification algorithm.
>>>
>>> Ah, right. It's worth noting that even faster than this algorithm is simply verifying
>>> a .tar.xz. Is that totally off the table? I realize it doesn't fit every usecase, but it
>>> seems to be faster in both sync and verification time.
>>
>> We've already got support for that with sync-type = webrsync. However, I
>> imagine sync-type = git is even better. All of the types are covered here:
>>
>> https://wiki.gentoo.org/wiki/Portage_Security
>
> I'm being warned right now that webrsync-gpg is being deprecated; I've been using
> it. It is, amazingly, faster than a typical rsync and may be faster than a git pull though.
Yeah webrsync-gpg is deprecated but the replacement is sync-type =
webrsync and verification is enabled by default for that sync-type.
--
Thanks,
Zac


signature.asc (1000 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Speeding up Tree Verification

Michał Górny-5
In reply to this post by Sid Spry
On Tue, 2020-06-30 at 12:50 -0500, Sid Spry wrote:

> On Tue, Jun 30, 2020, at 2:28 AM, Michał Górny wrote:
> > Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry <[hidden email]> napisał(a):
> > > Hello,
> > >
> > > I have some runnable pseudocode outlining a faster tree verification
> > > algorithm.
> > > Before I create patches I'd like to see if there is any guidance on
> > > making the
> > > changes as unobtrusive as possible. If the radical change in algorithm
> > > is
> > > acceptable I can work on adding the changes.
> > >
> > > Instead of composing any kind of structured data out of the portage
> > > tree my
> > > algorithm just lists all files and then optionally batches them out to
> > > threads.
> > > There is a noticeable speedup by eliding the tree traversal operations
> > > which
> > > can be seen when running the algorithm with a single thread and
> > > comparing it to
> > > the current algorithm in gemato (which should still be discussed
> > > here?).
> >
> > Without reading the code: does your algorithm correctly detect extraneous files?
> >
>
> Yes and no.
>
> I am not sure why this is necessary. If the file does not appear in a manifest it is
> ignored. It makes the most sense to me to put the burden of not including
> untracked files on the publisher. If the user puts an untracked file into the tree it
> will be ignored to no consequence; the authored files don't refer to it, after all.
This is necessary because a malicious third party can MITM you an rsync
tree with extraneous files (say, -r1 baselayout ebuild) that do horrible
things on your system.  If you don't reject files not in Manifest, you
open a huge security hole.

> But it would be easy enough to build a second list of all files and compare it to
> the list of files built from the manifests. If there are extras an error can be
> generated. This is actually the first test I did on my manifest parsing code. I tried
> to see if my tracked files roughly matched the total files in tree. That can be
> repurposed for this check.
>
> > > Some simple tests like counting all objects traversed and verified
> > > returns the
> > > same(ish). Once it is put into portage it could be tested in detail.
> > >
> > > There is also my partial attempt at removing the brittle interface to
> > > GnuPG
> > > (it's not as if the current code is badly designed, just that parsing
> > > the
> > > output of GnuPG directly is likely not the best idea).
> >
> > The 'brittle interface' is well-defined machine-readable output.
> >
>
> Ok. I was aware there was a machine interface, but the classes that manipulate
> a temporary GPG home seemed like not the best solution. I guess that is all
> due to GPG assuming everything is in ~/.gnupg and keeping its state as a
> directory structure.
A temporary home directory guarantees that user configuration does not
affect the verification result.

>
> > > Needs gemato, dnspython, and requests. Slightly better than random code
> > > because
> > > I took inspiration from the existing gemato classes.
> >
> > The code makes a lot of brittle assumptions about the structure. The
> > GLEP was specifically designed to avoid that and let us adjust the
> > structure in the future to meet our needs.
> >
>
> These same assumptions are built into the code that operates on the
> tree structure. If the GLEP were changed the existing code would also
> potentially need changing. This code just uses the structure in a different
> way.
>
The code that predates the GLEP, yes.  It will eventually be changed to
be more flexible, especially when we can assume that we start removing
backwards compatibility.

--
Best regards,
Michał Górny


signature.asc (631 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Speeding up Tree Verification

Sid Spry
On Tue, Jun 30, 2020, at 2:29 PM, Michał Górny wrote:

> On Tue, 2020-06-30 at 12:50 -0500, Sid Spry wrote:
> > On Tue, Jun 30, 2020, at 2:28 AM, Michał Górny wrote:
> > > Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry <[hidden email]> napisał(a):
> > > > Hello,
> > > >
> > > > I have some runnable pseudocode outlining a faster tree verification
> > > > algorithm.
> > > > Before I create patches I'd like to see if there is any guidance on
> > > > making the
> > > > changes as unobtrusive as possible. If the radical change in algorithm
> > > > is
> > > > acceptable I can work on adding the changes.
> > > >
> > > > Instead of composing any kind of structured data out of the portage
> > > > tree my
> > > > algorithm just lists all files and then optionally batches them out to
> > > > threads.
> > > > There is a noticeable speedup by eliding the tree traversal operations
> > > > which
> > > > can be seen when running the algorithm with a single thread and
> > > > comparing it to
> > > > the current algorithm in gemato (which should still be discussed
> > > > here?).
> > >
> > > Without reading the code: does your algorithm correctly detect extraneous files?
> > >
> >
> > Yes and no.
> >
> > I am not sure why this is necessary. If the file does not appear in a manifest it is
> > ignored. It makes the most sense to me to put the burden of not including
> > untracked files on the publisher. If the user puts an untracked file into the tree it
> > will be ignored to no consequence; the authored files don't refer to it, after all.
>
> This is necessary because a malicious third party can MITM you an rsync
> tree with extraneous files (say, -r1 baselayout ebuild) that do horrible
> things on your system.  If you don't reject files not in Manifest, you
> open a huge security hole.
>

Ok, I will refer to https://www.gentoo.org/glep/glep-0074.html and implement the
checks in detail, but will still need to spend some time looking for the best place
to insert the code.

I think it best to address this from two fronts. On one hand rejecting extra files
seems to have immediate benefit but the larger issue is portage exposing
untracked potentially malicious files to the user.

Has anything like a verity loopback filesystem been explored? It might reduce
duplication of work.

> > But it would be easy enough to build a second list of all files and compare it to
> > the list of files built from the manifests. If there are extras an error can be
> > generated. This is actually the first test I did on my manifest parsing code. I tried
> > to see if my tracked files roughly matched the total files in tree. That can be
> > repurposed for this check.
> >
> > > > Some simple tests like counting all objects traversed and verified
> > > > returns the
> > > > same(ish). Once it is put into portage it could be tested in detail.
> > > >
> > > > There is also my partial attempt at removing the brittle interface to
> > > > GnuPG
> > > > (it's not as if the current code is badly designed, just that parsing
> > > > the
> > > > output of GnuPG directly is likely not the best idea).
> > >
> > > The 'brittle interface' is well-defined machine-readable output.
> > >
> >
> > Ok. I was aware there was a machine interface, but the classes that manipulate
> > a temporary GPG home seemed like not the best solution. I guess that is all
> > due to GPG assuming everything is in ~/.gnupg and keeping its state as a
> > directory structure.
>
> A temporary home directory guarantees that user configuration does not
> affect the verification result.
>

Yes, I know why it is there. The temporary construction of the directory is what
stood out to me as messy but I guess there is no way around it.

> >
> > > > Needs gemato, dnspython, and requests. Slightly better than random code
> > > > because
> > > > I took inspiration from the existing gemato classes.
> > >
> > > The code makes a lot of brittle assumptions about the structure. The
> > > GLEP was specifically designed to avoid that and let us adjust the
> > > structure in the future to meet our needs.
> > >
> >
> > These same assumptions are built into the code that operates on the
> > tree structure. If the GLEP were changed the existing code would also
> > potentially need changing. This code just uses the structure in a different
> > way.
> >
>
> The code that predates the GLEP, yes.  It will eventually be changed to
> be more flexible, especially when we can assume that we start removing
> backwards compatibility.
>

My comment was more on the general nature of the code having to track
what is intended. Even if parsing code (involving the filesystem?) was
generated from a specification now the specification is "code" and still may
not reflect what is intended.

Taking liberties with the programmatic representation of the manifests
seems appropriate. I also don't see anything in the GLEP that dictates an
in-memory representation (and such a thing would probably be
inappropriate).

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up Tree Verification

Fabian Groffen-2
In reply to this post by Sid Spry
On 30-06-2020 13:13:29 -0500, Sid Spry wrote:

> On Tue, Jun 30, 2020, at 1:20 AM, Fabian Groffen wrote:
> > Hi,
> >
> > On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> > > Hello,
> > >
> > > I have some runnable pseudocode outlining a faster tree verification algorithm.
> > > Before I create patches I'd like to see if there is any guidance on making the
> > > changes as unobtrusive as possible. If the radical change in algorithm is
> > > acceptable I can work on adding the changes.
> > >
> > > Instead of composing any kind of structured data out of the portage tree my
> > > algorithm just lists all files and then optionally batches them out to threads.
> > > There is a noticeable speedup by eliding the tree traversal operations which
> > > can be seen when running the algorithm with a single thread and comparing it to
> > > the current algorithm in gemato (which should still be discussed here?).
> >
> > I remember something that gemato used to use multiple threads, but
> > because it totally saturated disk-IO, it was brought back to a single
> > thread.  People were complaining about unusable systems.
> >
>
> I think this is an argument for cgroups limits support on the portage process or
> account as opposed to an argument against picking a better algorithm. That is
> something I have been working towards, but I am only one man.
But this requires a) cgroups support, and b) the privileges to use it.
Shouldn't be a problem in the normal case, but just saying.

> > In any case, can you share your performance results?  What speedup did
> > you see, on warm and hot FS caches?  Which type of disk do you use?
> >
>
> I ran all tests multiple times to make them warm off of a Samsung SSD, but
> nothing very precise yet.
>
> % gemato verify --openpgp-key signkey.asc /var/db/repos/gentoo
> [...]
> INFO:root:Verifying /var/db/repos/gentoo...
> INFO:root:/var/db/repos/gentoo verified in 16.45 seconds
>
> sometimes going higher, closer to 18s, vs.
>
> % ./veriftree.py
> 4.763171965983929
>
> So roughly an order of magnitude speedup without batching to threads.
That is kind of a change.  Makes one wonder if you really did the same
work.

> > You could compare against qmanifest, which uses OpenMP-based
> > paralllelism while verifying the tree.  On SSDs this does help.
> >
>
> I lost my notes -- how do I specify to either gemato or qmanifest the GnuPG
> directory? My code is partially structured as it is because I had problems doing
> this. I rediscovered -K/--openpgp-key in gemato but am unsure for qmanifest.

qmanifest doesn't do much magic out of the standard gnupg practices.
(It is using gpgme.)  If you want it to use a different gnupg dir, you
may change HOME, or GNUPGHOME.

Thanks,
Fabian

--
Fabian Groffen
Gentoo on a different level

signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Speeding up Tree Verification

Michał Górny-5
In reply to this post by Sid Spry
On Tue, 2020-06-30 at 16:51 -0500, Sid Spry wrote:

> On Tue, Jun 30, 2020, at 2:29 PM, Michał Górny wrote:
> > On Tue, 2020-06-30 at 12:50 -0500, Sid Spry wrote:
> > > On Tue, Jun 30, 2020, at 2:28 AM, Michał Górny wrote:
> > > > Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry <[hidden email]> napisał(a):
> > > > > Hello,
> > > > >
> > > > > I have some runnable pseudocode outlining a faster tree verification
> > > > > algorithm.
> > > > > Before I create patches I'd like to see if there is any guidance on
> > > > > making the
> > > > > changes as unobtrusive as possible. If the radical change in algorithm
> > > > > is
> > > > > acceptable I can work on adding the changes.
> > > > >
> > > > > Instead of composing any kind of structured data out of the portage
> > > > > tree my
> > > > > algorithm just lists all files and then optionally batches them out to
> > > > > threads.
> > > > > There is a noticeable speedup by eliding the tree traversal operations
> > > > > which
> > > > > can be seen when running the algorithm with a single thread and
> > > > > comparing it to
> > > > > the current algorithm in gemato (which should still be discussed
> > > > > here?).
> > > >
> > > > Without reading the code: does your algorithm correctly detect extraneous files?
> > > >
> > >
> > > Yes and no.
> > >
> > > I am not sure why this is necessary. If the file does not appear in a manifest it is
> > > ignored. It makes the most sense to me to put the burden of not including
> > > untracked files on the publisher. If the user puts an untracked file into the tree it
> > > will be ignored to no consequence; the authored files don't refer to it, after all.
> >
> > This is necessary because a malicious third party can MITM you an rsync
> > tree with extraneous files (say, -r1 baselayout ebuild) that do horrible
> > things on your system.  If you don't reject files not in Manifest, you
> > open a huge security hole.
> >
>
> Ok, I will refer to https://www.gentoo.org/glep/glep-0074.html and implement the
> checks in detail, but will still need to spend some time looking for the best place
> to insert the code.
>
> I think it best to address this from two fronts. On one hand rejecting extra files
> seems to have immediate benefit but the larger issue is portage exposing
> untracked potentially malicious files to the user.
I've pushed two branches with two exploits I could think of.  Please
test your tools against them.  In both cases, the directory should be
rejected without any ill effects.

https://github.com/mgorny/glep74-test-cases

>
> Has anything like a verity loopback filesystem been explored? It might reduce
> duplication of work.

I don't understand what a 'verity loopback filesystem' is.  Could you elaborate?

--
Best regards,
Michał Górny


signature.asc (631 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Speeding up Tree Verification

Sid Spry
In reply to this post by Fabian Groffen-2
On Wed, Jul 1, 2020, at 1:40 AM, Fabian Groffen wrote:

> On 30-06-2020 13:13:29 -0500, Sid Spry wrote:
> > On Tue, Jun 30, 2020, at 1:20 AM, Fabian Groffen wrote:
> > > Hi,
> > >
> > > On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> > > > Hello,
> > > >
> > > > I have some runnable pseudocode outlining a faster tree verification algorithm.
> > > > Before I create patches I'd like to see if there is any guidance on making the
> > > > changes as unobtrusive as possible. If the radical change in algorithm is
> > > > acceptable I can work on adding the changes.
> > > >
> > > > Instead of composing any kind of structured data out of the portage tree my
> > > > algorithm just lists all files and then optionally batches them out to threads.
> > > > There is a noticeable speedup by eliding the tree traversal operations which
> > > > can be seen when running the algorithm with a single thread and comparing it to
> > > > the current algorithm in gemato (which should still be discussed here?).
> > >
> > > I remember something that gemato used to use multiple threads, but
> > > because it totally saturated disk-IO, it was brought back to a single
> > > thread.  People were complaining about unusable systems.
> > >
> >
> > I think this is an argument for cgroups limits support on the portage process or
> > account as opposed to an argument against picking a better algorithm. That is
> > something I have been working towards, but I am only one man.
>
> But this requires a) cgroups support, and b) the privileges to use it.
> Shouldn't be a problem in the normal case, but just saying.
>
> > > In any case, can you share your performance results?  What speedup did
> > > you see, on warm and hot FS caches?  Which type of disk do you use?
> > >
> >
> > I ran all tests multiple times to make them warm off of a Samsung SSD, but
> > nothing very precise yet.
> >
> > % gemato verify --openpgp-key signkey.asc /var/db/repos/gentoo
> > [...]
> > INFO:root:Verifying /var/db/repos/gentoo...
> > INFO:root:/var/db/repos/gentoo verified in 16.45 seconds
> >
> > sometimes going higher, closer to 18s, vs.
> >
> > % ./veriftree.py
> > 4.763171965983929
> >
> > So roughly an order of magnitude speedup without batching to threads.
>
> That is kind of a change.  Makes one wonder if you really did the same
> work.
>

That was my initial reaction. I attempted to ensure I was processing all of
the files that gemato processed. The full output of my script is something
closer to:

% ./veriftree.py
x.xxxxxxxxxx
192157
126237

The first number being the time, the second the total number of manifest directives,
and the third being the number of real files in the tree. If you prune the directives
that correspond to no file you end up with an exact match IIRC.

However, you are right, and I think this is old code. gemato times the manifest file
parsing as well as the verification. It seems this change is not in the code I
provided. If I do that instead, I get:

% ./veriftree.py
11.708862617029808
192157
126237

With corresponding times for gemato (at same system state, etc) being ~20s. So it
is a halving at worst with assured n-core speedup for 1/2 of that time, and I am
fairly confident I can speed up the manifest parsing even more as well.

> > > You could compare against qmanifest, which uses OpenMP-based
> > > paralllelism while verifying the tree.  On SSDs this does help.
> > >
> >
> > I lost my notes -- how do I specify to either gemato or qmanifest the GnuPG
> > directory? My code is partially structured as it is because I had problems doing
> > this. I rediscovered -K/--openpgp-key in gemato but am unsure for qmanifest.
>
> qmanifest doesn't do much magic out of the standard gnupg practices.
> (It is using gpgme.)  If you want it to use a different gnupg dir, you
> may change HOME, or GNUPGHOME.
>

Alright, I will attempt to set that. I think I like the interface of gemato a little more
but will look at qmanifest and see how it performs.

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up Tree Verification

Sid Spry
In reply to this post by Fabian Groffen-2
On Wed, Jul 1, 2020, at 1:40 AM, Fabian Groffen wrote:

> On 30-06-2020 13:13:29 -0500, Sid Spry wrote:
> > On Tue, Jun 30, 2020, at 1:20 AM, Fabian Groffen wrote:
> > > Hi,
> > >
> > > On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> > > > Hello,
> > > >
> > > > I have some runnable pseudocode outlining a faster tree verification algorithm.
> > > > Before I create patches I'd like to see if there is any guidance on making the
> > > > changes as unobtrusive as possible. If the radical change in algorithm is
> > > > acceptable I can work on adding the changes.
> > > >
> > > > Instead of composing any kind of structured data out of the portage tree my
> > > > algorithm just lists all files and then optionally batches them out to threads.
> > > > There is a noticeable speedup by eliding the tree traversal operations which
> > > > can be seen when running the algorithm with a single thread and comparing it to
> > > > the current algorithm in gemato (which should still be discussed here?).
> > >
> > > I remember something that gemato used to use multiple threads, but
> > > because it totally saturated disk-IO, it was brought back to a single
> > > thread.  People were complaining about unusable systems.
> > >
> >
> > I think this is an argument for cgroups limits support on the portage process or
> > account as opposed to an argument against picking a better algorithm. That is
> > something I have been working towards, but I am only one man.
>
> But this requires a) cgroups support, and b) the privileges to use it.
> Shouldn't be a problem in the normal case, but just saying.
>

cgroups kernel support is a fairly common dependency. It can obviously be optional,
I am thinking related to MAKEOPTS or EMERGE_DEFAULT_OPTS (see: rustc/cargo not
respecting or being passed -j/-l as another use for cgroups) and supported best-effort,
but is there any reason to expect it to not be enabled?

If the user isn't either root or portage I think it reasonable to leave resource management
to the machine's administrator.