Ubuntu archive on top of IPFS

Yes, you are right. All the packages that are the same will get the same hash. That is what I was worried about :slight_smile: Is the mirror also taking care of the GPG signatures?

Well, not GPG. We sign the packages when we are uploading them to the archive. Then the main archive generates hashes, which is what you will see here and in some other dirs:

http://archive.ubuntu.com/ubuntu/dists/xenial/

The mirrors of the archive just copy what the main publishes. apt in the client machine will download and verify the hashes.

1 Like

A quick update here.

After leaving my machine synchronizing the mirror during all my holidays, I finally have all the files locally.

I’m adding them to IPFS, which brought two unexpected problems. First, the recursive add will take a little more than 12 hours :frowning: The Ubuntu mirrors are supposed to sync every 6 hours, which will not be possible with IPFS. We will have to sync it only once a day.
Second, ipfs stores local blocks for each file, so that duplicates the amount of space required for the mirror. The main mirror, instead of ~2TB will require ~4TB. As suggested before, the others can just sync the IPFS blocks so they will still require ~2TB.

I’m now going to package the transport into a PPA so it will be easy to install, do more tests with my local mirror and continue trying to find a server with better bandwidth, and experiment with bitswap to sync multiple mirrors. :smiley:

2 Likes

It might be worth looking into using IPFS’ filestore functionality to prevent space requirements from being doubled.

3 Likes

hey, that sounds great! That way we also help testing that experimental feature. /me gives it a try.

As said can use the filestore.
If somebody ends up here, here is the shortcut:

ipfs config --json Experimental.FilestoreEnabled true
ipfs add --nocopy $LOVE

It should work

No luck with the --no-copy, I got:

panic: interface conversion: interface {} is cmdkit.Error, not *coreunix.AddedObject

goroutine 37 [running]:
github.com/ipfs/go-ipfs/core/commands.glob..func7.1(0xc4202ec060)
        /cwd/parts/ipfs/go/src/github.com/ipfs/go-ipfs/core/commands/add.go:390 +0x9fc
created by github.com/ipfs/go-ipfs/core/commands.glob..func7.2
        /cwd/parts/ipfs/go/src/github.com/ipfs/go-ipfs/core/commands/add.go:449 +0xc7

I will report the bug and try to dig into it later. For now, I will use the command with copy.

Update: the problem is not caused by --no-copy. I reported the bug: https://github.com/ipfs/go-ipfs/issues/4555

Do you know if there’s a way to resume the ipfs add --recursive? When it fails and I have to re-run it, it seems to start from scratch.

I don’t think it stores the progress anywhere to be able to pick up where it left off. Subsequent runs of ipfs add should be significantly faster (probably not much of a difference with --nocopy though) since the blocks up to the point where it stopped already exist in the datastore and don’t need to be written.

In terms of add time, a large portion of that is:

  1. Our current datastore is slow. If you want something faster, try badgerdb. Unfortunately, that datastore backend is experimental for a reason.
  2. One of our huge bottlenecks is telling everyone on the network that you have a file. We’re working on making this better but it’s a bit of a fundamental problem.

One partial solution is to:

  1. not announce to the network that you have the files (by adding them with ipfs add --local).
  2. Have the APT backend connect to known IPFS mirror peers.

Unfortunately, that’s not very decentralized… (although the Ubuntu installations that use this backend will still announce the files they have to the network).

Ideally, we’d announce the root nodes of all pinned files to the network but we won’t be able to do that for a while.

1 Like

Thanks for the suggestions @stebalien.

We are trying now with --local, and then we can experiment with badgerdb.
It says it will be done in 3.5 hours.

Edit: This progress bar is the worst liar, it has been running for a long time and it says now 58.85% 4h44m58s :sleeping:

We were left in a weird situation. This morning it was reporting less than 30 minutes to complete the add. But for some reason, the server got stuck, our byobu session was killed and all the IPFS processes stopped.

So, now how do we know if the add was completed? It would be sad to have to run the full recursive add again, because the files in the directory have not changed. But we have no clue if that’s the only way.

I don’t know how many things you have pinned to your node, but if it’s not too many you could look through the results of ipfs pin ls --type=recursive to see if any of the pins are for the content you were adding. By default ipfs add will recursively pin the content you added.

If you know what you’re looking for you can also search through the folders/files in the top-level hash using something like

ipfs pin ls --type=recursive -q | xargs -L 1 ipfs ls | grep "fubar"

Now we are running the add without the daemon running, and we get a lot of process running, with the daemon we get 2 or 3 tops.

Unfortunately, I believe that’s the only way. We have to re-read and re-hash the files to verify that they exist in the repo (we assume that they may have changed although we could probably relax this constraint for the filestore). Note, we won’t (or shouldn’t at least) actually write them to the repo again.

One way to avoid this would be to use MFS and add the files one-by-one. That is,

#!/bin/bash

set -e

FROM="$1"  # local directory
TO="$2"    # directory in MFS

find "$FROM" -type f -readable -o -type d -readable -executable | while IFS= read -r -d '' fname; do
    if [[ -d "$fname" ]]; then
        ipfs files mkdir -p -- "$TO/$fname"
    elif [[ -f "$fname" ]]; then
        if ! ipfs files ls "$TO/$fname" 2>/dev/null; then
            # will be pinned in the next command (you should probably disable GC)
            cid="$(ipfs add --pin=false --local -q "$fname")"
            ipfs files cp -- "/ipfs/$cid" "$TO/$fname"
        fi
    else
        echo "not a file or directory: $fname" >&2
        exit 1
    fi
done

Note, that script is rather slow… a better one would list the directory you want to import, the target MFS directory, find the diff, and then add the files in batches. However, writing that script is a bit of an endeavor.

I’ve opened an issue to discuss adding a command to do this to ipfs:


FYI, the next release should make this a bit better. We figured out why adding large datasets causes problems with go-ipfs (we had a leak that has been fixed).

2 Likes

Hey @elopio – can i get access to the archive? (or how to make it) i’d like to run some tests with rabin fingerprinting + badger ds, + see if it would make sense to write a custom importer to import the archive smartly, deduping all the internal files that are the same (look into ipfs tar for a preview of what i mean).

Hello @jbenet,
We haven’t been able to ipfs add the full archive, it always fails on 99%. We have reported most of the errors we get. We were thinking to try to add a subsection, to see if that works and lets us test further.

This is the script we are using to sync the repo and add it to ipfs: https://github.com/JaquerEspeis/apt-transport-ipfs/blob/master/scripts/sync_mirror.sh

Please let us know if there’s something we can do to help.

If I may throw in my two cents, if part of the issue that you’re having is uploading a very large tarball to IPFS, perhaps try fragmenting the archive, and storing the various pieces? You can then use client-side logic to reconstruct the various fragmentted archive pieces into a single archive.

it would make sense to write a custom importer to import the archive smartly, deduping all the internal files that are the same (look into ipfs tar for a preview of what i mean).

Just a alt importer? It should be used in IPFS itself. More than that, IPFS could be compatible with git/pijul: using common objects instead of whole files. Told that when a user suggested both pip, rpm and deb: Software Repository Mirrors may be a good start for IPFS

I have a improved concept for deduplication: (draft) Common Bytes - standard for data deduplication