Git on IPFS - Links and References

The topic of putting git commits onto IPFS comes up often. Let’s use this thread to help people find related projects and tools.

For example:

The exciting thing about git-remote-ipld is that it uses IPLD, which is a data model (and library) that lets you parse hash-linked data structures like git repositories, traverse them, and address items within them using their native data format. It lets you treat git repositories as just another data format on the decentralized web. In the same way that a modern OS lets you read the many different data formats for Images, Movies, Spreadsheets, etc, IPLD makes it possible to do the same thing with git repositories, blockchain ledgers (bitcoin, ethereum, zcash, etc), files & directories on IPFS, dat repositories, etc.

5 Likes

Some more projects for reference:

1 Like

should replace most of the other git helpers but one of its problems is that if git repo contains (or contained in history) files bigger than ~2MiB it won’t be able to transfer it (IPFS normally shards files but we can’t do that with Git as it would change the hash). This is security limitation of bitswap we are thinking how to solve.

2 Likes

For anyone wondering, the security problem is that bitswap needs to download entire blocks (in this case, git objects) before it can verify their hashes. Unfortunately, git doesn’t break large files into multiple smaller objects. One of the goals here is to be compatible with git hash-for-hash (commit-id for commit-id) so we can’t just change the underlying git object format to keep them small.

Possible Solutions

In case anyone wants to help solve this problem, here are a few possible solutions to get the discussions started:

Fail

Given that one shouldn’t be storing large objects in git, we could just fail to checkout such repositories. Unfortunately, GitHub has set it’s max-object size to 100MiB so, if we want to support all GitHub repos, we’d have to support downloading 100MiB blocks :slight_frown:.

Trusted peers

An alternative is to only download large blocks from trusted peers (e.g., github). This is, unfortunately, not very decentralized.

Take a vote

One possible (but not great) solution is to trust that N independently selected peers won’t collude. If we do, we could have peers split large blocks up into a merkeltree of smaller blocks and then poll a randomly selected set of N peers for the hash of the root of the merkeltree that corresponds to the block we want. If our randomly selected peers agree, we would then download the entire merkeltree, put it back together, and verify that the reconstructed block matches its hash.

While this would make fetching large blocks slow, this shouldn’t happen that often.

Extra metadata

We could also require that repositories with large objects also store some extra metadata that allows us to validate these large objects piecewise. That is, we can take the metadata from the “take a vote” solution and check it into the repository. Then, when downloading the repository, we could download the small blocks first, pull out this (small) metadata, and use it to validate the large blocks piece-wise.

IMO, this would be more trouble than it’s worth.

Crypto Magic

Ideally, we’d be able to progressively validate SHA1 hashes of large objects as we download them by exploiting the fact that SHA1 is a streaming hash function (i.e., somehow verify it in reverse). Unfortunately, I’m pretty sure there’s no way to do this securely.

2 Likes

Looks like we need to find a crypto magician

Actually, this may be doable with zksnarks if we can use them to prove SHA1 hashes “valid”. Of course, I know next to nothing about zksnarks so this may not be possible…

Backing up a bit, the naive way to do this would be to send chunks in reverse order as follows:

chunks := split(block)
for i = len(chunks)-1; i--; i > 0 {
    send(SHA1(concat(chunks[:i])), chunks[i])
}

Given that SHA1 is a streaming hash algorithm, the receiver can validate each chunk as follows:

hash := hashOfEntireBlock
for {
    hashOfPrefix, chunk := receive()
    assert(Sha1Extend(hashOfPrefix, chunk) == hash)
    hash = hashOfPrefix
}
assert(hash == sha1StartingHash) // all SHA1 hashes start at the same value.

Unfortunately, I’m almost positive that it’s entire possible for an attacker cook up chunks for some targetHash as follows:

hash := targetHash
for {
    hash, chunk := solveForHashChunkPair(hash)
    send(hash, chunk)
}

Basically, as long as the attacker never tries to “finish” sending the file, they can keep sending us hunks.

However, we may be able to use zksnarks to prove that a sender knows how to finish sending a file (without actually sending it). To do this, we’d need to be able to use use zksnarks to prove that, in order to produce a proof P (zksnark) for some SHA1 hash H, the author of P must have known some D such that SHA1(D) == H. This would allow a sender to prove to the receiver that the sender knows of a finite set of steps to finish sending the file.

Note: These proofs only need to be produced the first time an object is added to the network, not every time a peer sends an object to another peer. However, they would have to be produced at least once so adding large objects could take over an hour. The moral of the story is: don’t store 100MiB files in git!

I was thinking about something similar as you point out with SHA extension but I think it would be vulnerable to an attack.

The internal state is initially defined (in the spec) and after hashing the state is the output. Preimage attack tries to take the output (which is the final, after hashing, state) and find inputs that would turn it back into initial one from the spec.

In this case we have no idea about the “along the way” states. Which also means that the attacker has additional 256 bits of freedom (the “initial” state) so I would expect the attack to be much easier than preimage.


However, we may be able to use zksnarks to prove that a sender knows how to finish sending a file.

Having a proof that the sender has knows the steps doesn’t mean that he is willing to share the steps with us. Malicious actor might send us the proof but not the steps required later.

Just knowing that someone (anyone) knows the steps should actually be sufficient (i.e., no worse than the current system). It doesn’t prove that the sender is willing to complete the process but does allow one to verify chunks as they are received. If the sender stops sending us blocks, we can just complete the process where we left off with another peer.

Basically, one would:

  1. Receive a chunk.
  2. Receive a proof that someone knows the set of remaining chunks necessary to make the hash work out. Note: By "proof that someone knows the set of remaining chunks, I mean that someone knows some remaining_chunks such that SHA1(remaining_chunks || this_chunk || previously_received_chunks) == targetHash.
  3. Verify the proof. If it fails, ban the peer.
  4. Accept the chunk and go to step 1.

This should allow us to verify chunks as we receive them.

Hmm, I didn’t think about it as a continuation proof. This in theory could work but I think we both agree that using zkSNARKs or zkSTARKs right now is infeasible.

1 Like

You’ll need someone to mirror it to IPFS anyway or if the original upload point is IPFS, to have commit access. Can’t you just have them sign mappings of IPFS hashes <–> git blob hash with their key? You have to trust them anyway.
If that’s not decentralized enough, you can use something like Freenet’s WebOfTrust. This would be useful for many other applications as well.
Here’s how it could be done for this particular use case:

  • Start with git blob hash B
  • Get the IPFS hash mapping from the most trusted node
  • If it matches, assign positive trust to all nodes which asserted this was a valid mapping
  • If it doesn’t match, assign negative trust to all nodes which asserted this was a valid mapping, return to step 2.
  • If you’ve checked all hash mapping assertions from all nodes and none of them are valid, only then you can download it from a centralized server (and of course upload it to IPFS)

Since nobody would waste trusted keys on causing a minor slowdown for someone, you’ll in practice just have to perform steps 1 and 2.

Thoughts?

Totally. Personally, I’d just leave things as-is indefinitely. You shouldn’t be storing large files in GIT.

The idea is to make possible to use existing GIT objects as IPLD objects (note, this is IPLD, not IPFS) without converting anything. This way, one can, e.g., call git show, send the commit hash to a friend, and let them fetch their git repo using the IPFS daemon.

Sure. But someone needs to upload it to IPFS anyway. How are they going to fetch the git repo with their IPFS daemon if it doesn’t exist on IPFS? Clearly, someone has uploaded it to IPFS some time, and then they could have publish the mapping between git SHA-1 and IPFS SHA-2.

Not necessarily. For example, using the up-and-coming plugin interface, I’d like to add a GitHub plugin that fetches unknown git objects directly from GitHub.

1 Like

If you’re fetching them directly from GitHub, where does IPFS come into play? Does it upload them to IPFS too? In that case, you’ll hash them with the IPFS content hashing and then cryptographically sign the mapping between git and IPFS hash.

You’re right, you could have a signed mapping from git hashes to IPLD hashes (which you can fully validate once you download the appropriate blocks). When fetching a large IPLD object, the peer could present a signed mapping to a chunked version. However, is there really much benefit to doing this over trusted peers solution above? Storing each mapping along with a bunch of signatures could get very expensive.

Much more decentralized. And for some cases, you can’t always use the trusted peers approach. For example, say you want to do this for BitTorrent as well: map info_hash:start_offset:end_offset to IPFS CID. Where’s your trusted peer now?

Storing each mapping along with a bunch of signatures could get very expensive.

How do you mean? A signature is 256 bytes large for a 2048 bit key. You just need 1 signature for each node, each node could sign a merkle root of all their asserted mappings. This is no huge amount of metadata in any case, even if they sign each mapping individually.

Keybase’s Git solution is based on Keybase FS which is based on IPFS:

https://keybase.io/blog/encrypted-git-for-everyone

Thanks for the link @balupton

Correction: Keybase FS (KBFS) is not based on IPFS but KBFS and IPFS could potentially be used together. I really like the idea of using IPFS as a storage layer for KBFS and perennially look into how hard it would be to do that integration, but thus far nobody has done that work.

1 Like

Thanks. Looking into it seems you are right:

And that this post is wrong:

1 Like

Hi, just wanted to revive this thread now that Github has been officially overtaken. Not to make it political or anything, but something as important as our social coding infrastructure should not be owned and operated by a commercial entity.

Looked into Git on Keybase and I have two first impressions:

  1. the backing storage for Git on Keybase is owned and operated by Keybase the company and might disappear if Keybase were ever to run out of money, or sell. Is this even remotely true?
    https://news.ycombinator.com/item?id=15401211

  2. Git on Keybase really only works for private repos.

  3. Is it possible to extend the Keybase experience? To make a Github killer, we would need more functionality related to support software development.

Really interested to hear if anyone has made any progress on this front. Anyone else interested in starting a self-funding Distributed Autonomous Community as an alternative to Github, please send me a message.

-cogwire