Ubuntu archive on top of IPFS

Hello!

Last week on the Costa Rican hackerspace we started a project to serve the ubuntu archive on top of IPFS. The idea is that it should be faster to download deb packages from an IPFS node that is closer than the closest mirror of the archive.
For example, in Costa Rica there is only one mirror, hosted at the university in the middle of the country. On the jaquerespeis there are at least 50 Ubuntu users spread on different geographic areas. If we get all of them to use the ipfs mirror, we should start seeing faster downloads. Or at least, that’s the theory; the experiment is to see how it goes in real live.

We started writing a transport for IPFS that knows how to download IPFS URIs:

Now, we are trying to find resources to get a server and start seeding the full archive. This requires 2TB of storage, so it might take us some time to find somebody who donates the server :slight_smile:

We also need to put that transport on a PPA to make it easier to install, and do more tests with more people.

So far, the only problem we have is that downloading stuff from the published IPNS is slower than using the hash directly. I don’t yet know if it will be slower than hitting the HTTP mirror. We will need to make measurements and more controlled experiments.

If you want to join the project, any kind of help will be appreciated. Specially, from Ubuntu users who would like to try it and share their bandwidth serving the debs on ipfs.

This should also work for Debian, but that’s another 2TB for the debian archives. So maybe we wait to get the first mirror online before trying that :smiley:

pura vida

9 Likes

Awesome! This is a perfect usecase for IPFS.

So far, the only problem we have is that downloading stuff from the published IPNS is slower than using the hash directly.

IPNS is currently known to be rather slow. It has to do a DHT lookup and our DHT isn’t the fastest in the world (and DHTs tend to be slow in general). We’re working on making it better but that will take some time.

For now, you might want to consider using HTTPS to get the repo’s root hash and then fetch the actual data over IPFS.

You can also use something called dnslink to make /ipns/domain.name/ point to an IPFS hash but that’s only as secure as DNS is (although apt repos are signed). To use dnslink, add a TXT record in the form of dnslink=/ipfs/HASH to _dnslink.domain.name. /ipns/domain.name will now resolve to /ipfs/HASH using DNS.

4 Likes

Hey @stebalien, thanks for the ideas.

How can I get the hash that ipns links to with https?

If I understood your proposal correctly, the DNS option would add a little more complication; because when the mirror finishes the rsync, it would have to update the TXT record.

How can I get the hash that ipns links to with https?

You can’t. I was suggesting that you host the hash on a server somewhere as a static file. The apt adapter would:

  1. Fetch http://your.domain/repo-root-hash.txt (which would contain /ipfs/HASH). This should be a low latency, low bandwidth operation.
  2. Fetch packages/repos via /ipfs/HASH/....

If I understood your proposal correctly, the DNS option would add a little more complication; because when the mirror finishes the rsync, it would have to update the TXT record.

Unfortunately, yes.

(note: if you’re using IPFS, you can just use bitswap instead of rsync).

1 Like

This sounds like a great project.

What are the requirements for the server?

1 Like

I can help with that :slight_smile:
@elopio how much space do you need?

2 Likes

I understand your http suggestion now. That would require a new apt transport, and the mirror to have also an http server in addition to ipfs. Both things are simple, so we can implement them if the first measurements with ipns are bad.

I will investigate more about bitswap, because once we have the first mirror rsync’ed, it sounds awesome to sync the rest through IPFS :slight_smile:

@koalalorenzo @leerspace the ubuntu archive is currently 1.1TB. I’m calculating 2TB to leave room for more space when the new LTS release comes out in April. Other than that, have ipfs installed and give it part of your bandwidth, ideally 24/7.

That is for a full mirror. I’m thinking that before we start making measurements in real life, it would be nice to have at least 3 full stable mirrors. But even smaller servers that host only some of the directories will be very useful once we set up our first full mirror.

Here is more information about how to sync a local mirror: https://wiki.ubuntu.com/Mirrors/Scripts
I have one in progress in my house, but with my slow Costa Rican connection, it will take at least a month to finish. Any help to speed that up would be amazing :slight_smile:

1 Like

I for sure can help with that!
I can provide 1-2 nodes in south america if you want to.

I think this is easily doable, except of the apt transport!

4 Likes

If someone posts an ipns address or ipfs hash here for the full archive (in IPFS) I can pin it to at least one node in the midwest US. It’s admittedly not very close to Costa Rica, but it should hopefully be better than nothing.

4 Likes

It’s even better that it will be far from here. That way we can make a call for testing with the ubuntu community, not just my group here. Thanks! I’ll post the hash as soon as I have the first one.

If you haven’t seen ipfs-cluster you may be interested. Unfortunately, it’s still alpha quality (but actively being worked on).

I would be interested in pinning Ubuntu archives as well, just tell me the hash, and I can pin. I am located in US South

1 Like

Ah, I am just seeing this. I may be able to finish the sync faster, I will start this process myself, as well.

1 Like

Awesome! Let me know once you are done, so we can sync with your hash.

Sure thing. Currently traveling, i will start the process in a few days.

@elopio: How do we manage deduplication? Are the packages having always the same hash or each time we mirror, we get a new hash? As soon as I am ready, I am trying out it, but I want to be sure that it works correctly so that multiple repositories can share the same packages (as they have the same hashes).

What do you guys think?

1 Like

What kind of duplication are you worried about?

Every time we rsync the mirror, some of the files will change, so the hash for the directory will change. The current solution is to assign the ipns hash of the node to the hash of that directory. Like this, but with a full mirror, not just one package:

The files inside the mirror are resolved by path relative to the hash of the directory. If the rsync didn’t change them, their hash will remain the same.

As I mentioned before, it is currently slow to resolve the ipns. I don’t yet know if this delay is enough to make it painful and worse than http in average, so I think we should explore this option first because it’s the easiest and most transparent. Maybe we can even give a hand to the devs to make ipns faster :slight_smile:

After we test and measure this, we can explore other solutions, like an ipfs client that works like an http cache, a central index of deb names->ipfs hashes, or saving the mirror hash in a faster protocol than ipns as @stebalien suggested. I think it will be useful for future projects to have tools and experiences on all of these solutions.

But feel free to disagree, we are open to try the possible solutions in a different order. And maybe there are other solutions that we haven’t thought yet.

We had a little chat on the #ipfs-cluster IRC channel. I’m not yet sure what’s the best role of the cluster here. They told me there are a few features on the backlog that could help spreading the load while bootstrapping a mirror. Also, once we have a full mirror, we could turn them a cluster master so the rest can sign up as cluster slaves instead of being independent servers. I’m very happy to play with these options too, but again, it seems to me that the first step is to have one full mirror.

Yes, you are right. All the packages that are the same will get the same hash. That is what I was worried about :slight_smile: Is the mirror also taking care of the GPG signatures?