Git on IPFS - Links and References

Actually, this may be doable with zksnarks if we can use them to prove SHA1 hashes “valid”. Of course, I know next to nothing about zksnarks so this may not be possible…

Backing up a bit, the naive way to do this would be to send chunks in reverse order as follows:

chunks := split(block)
for i = len(chunks)-1; i--; i > 0 {
    send(SHA1(concat(chunks[:i])), chunks[i])
}

Given that SHA1 is a streaming hash algorithm, the receiver can validate each chunk as follows:

hash := hashOfEntireBlock
for {
    hashOfPrefix, chunk := receive()
    assert(Sha1Extend(hashOfPrefix, chunk) == hash)
    hash = hashOfPrefix
}
assert(hash == sha1StartingHash) // all SHA1 hashes start at the same value.

Unfortunately, I’m almost positive that it’s entire possible for an attacker cook up chunks for some targetHash as follows:

hash := targetHash
for {
    hash, chunk := solveForHashChunkPair(hash)
    send(hash, chunk)
}

Basically, as long as the attacker never tries to “finish” sending the file, they can keep sending us hunks.

However, we may be able to use zksnarks to prove that a sender knows how to finish sending a file (without actually sending it). To do this, we’d need to be able to use use zksnarks to prove that, in order to produce a proof P (zksnark) for some SHA1 hash H, the author of P must have known some D such that SHA1(D) == H. This would allow a sender to prove to the receiver that the sender knows of a finite set of steps to finish sending the file.

Note: These proofs only need to be produced the first time an object is added to the network, not every time a peer sends an object to another peer. However, they would have to be produced at least once so adding large objects could take over an hour. The moral of the story is: don’t store 100MiB files in git!

1 Like

I was thinking about something similar as you point out with SHA extension but I think it would be vulnerable to an attack.

The internal state is initially defined (in the spec) and after hashing the state is the output. Preimage attack tries to take the output (which is the final, after hashing, state) and find inputs that would turn it back into initial one from the spec.

In this case we have no idea about the “along the way” states. Which also means that the attacker has additional 256 bits of freedom (the “initial” state) so I would expect the attack to be much easier than preimage.


However, we may be able to use zksnarks to prove that a sender knows how to finish sending a file.

Having a proof that the sender has knows the steps doesn’t mean that he is willing to share the steps with us. Malicious actor might send us the proof but not the steps required later.

Just knowing that someone (anyone) knows the steps should actually be sufficient (i.e., no worse than the current system). It doesn’t prove that the sender is willing to complete the process but does allow one to verify chunks as they are received. If the sender stops sending us blocks, we can just complete the process where we left off with another peer.

Basically, one would:

  1. Receive a chunk.
  2. Receive a proof that someone knows the set of remaining chunks necessary to make the hash work out. Note: By "proof that someone knows the set of remaining chunks, I mean that someone knows some remaining_chunks such that SHA1(remaining_chunks || this_chunk || previously_received_chunks) == targetHash.
  3. Verify the proof. If it fails, ban the peer.
  4. Accept the chunk and go to step 1.

This should allow us to verify chunks as we receive them.

Hmm, I didn’t think about it as a continuation proof. This in theory could work but I think we both agree that using zkSNARKs or zkSTARKs right now is infeasible.

1 Like

You’ll need someone to mirror it to IPFS anyway or if the original upload point is IPFS, to have commit access. Can’t you just have them sign mappings of IPFS hashes <–> git blob hash with their key? You have to trust them anyway.
If that’s not decentralized enough, you can use something like Freenet’s WebOfTrust. This would be useful for many other applications as well.
Here’s how it could be done for this particular use case:

  • Start with git blob hash B
  • Get the IPFS hash mapping from the most trusted node
  • If it matches, assign positive trust to all nodes which asserted this was a valid mapping
  • If it doesn’t match, assign negative trust to all nodes which asserted this was a valid mapping, return to step 2.
  • If you’ve checked all hash mapping assertions from all nodes and none of them are valid, only then you can download it from a centralized server (and of course upload it to IPFS)

Since nobody would waste trusted keys on causing a minor slowdown for someone, you’ll in practice just have to perform steps 1 and 2.

Thoughts?

Totally. Personally, I’d just leave things as-is indefinitely. You shouldn’t be storing large files in GIT.

The idea is to make possible to use existing GIT objects as IPLD objects (note, this is IPLD, not IPFS) without converting anything. This way, one can, e.g., call git show, send the commit hash to a friend, and let them fetch their git repo using the IPFS daemon.

Sure. But someone needs to upload it to IPFS anyway. How are they going to fetch the git repo with their IPFS daemon if it doesn’t exist on IPFS? Clearly, someone has uploaded it to IPFS some time, and then they could have publish the mapping between git SHA-1 and IPFS SHA-2.

Not necessarily. For example, using the up-and-coming plugin interface, I’d like to add a GitHub plugin that fetches unknown git objects directly from GitHub.

3 Likes

If you’re fetching them directly from GitHub, where does IPFS come into play? Does it upload them to IPFS too? In that case, you’ll hash them with the IPFS content hashing and then cryptographically sign the mapping between git and IPFS hash.

You’re right, you could have a signed mapping from git hashes to IPLD hashes (which you can fully validate once you download the appropriate blocks). When fetching a large IPLD object, the peer could present a signed mapping to a chunked version. However, is there really much benefit to doing this over trusted peers solution above? Storing each mapping along with a bunch of signatures could get very expensive.

Much more decentralized. And for some cases, you can’t always use the trusted peers approach. For example, say you want to do this for BitTorrent as well: map info_hash:start_offset:end_offset to IPFS CID. Where’s your trusted peer now?

Storing each mapping along with a bunch of signatures could get very expensive.

How do you mean? A signature is 256 bytes large for a 2048 bit key. You just need 1 signature for each node, each node could sign a merkle root of all their asserted mappings. This is no huge amount of metadata in any case, even if they sign each mapping individually.

Keybase’s Git solution is based on Keybase FS which is based on IPFS:

https://keybase.io/blog/encrypted-git-for-everyone

Thanks for the link @balupton

Correction: Keybase FS (KBFS) is not based on IPFS but KBFS and IPFS could potentially be used together. I really like the idea of using IPFS as a storage layer for KBFS and perennially look into how hard it would be to do that integration, but thus far nobody has done that work.

1 Like

Thanks. Looking into it seems you are right:

And that this post is wrong:

1 Like

Hi, just wanted to revive this thread now that Github has been officially overtaken. Not to make it political or anything, but something as important as our social coding infrastructure should not be owned and operated by a commercial entity.

Looked into Git on Keybase and I have two first impressions:

  1. the backing storage for Git on Keybase is owned and operated by Keybase the company and might disappear if Keybase were ever to run out of money, or sell. Is this even remotely true?
    https://news.ycombinator.com/item?id=15401211

  2. Git on Keybase really only works for private repos.

  3. Is it possible to extend the Keybase experience? To make a Github killer, we would need more functionality related to support software development.

Really interested to hear if anyone has made any progress on this front. Anyone else interested in starting a self-funding Distributed Autonomous Community as an alternative to Github, please send me a message.

-cogwire

1 Like

@cogwire Very much! I would send you a message but apparently new users can’t do that… We are just getting started on something called Git DAC. Found this thread while looking at the best way to store git repos on IPFS. If you can message me, go for it. Otherwise find me on twitter, or let me know how you would prefer to connect! We might even start offering some bounties to help bootstrap the initial phases of the project, so if you or anyone you know is handy with IPFS and other decentralized architectures please point them my way!

Another tool for publishing Git repositories to IPFS with webhooks similar to “github-ipfs” but with wider feature-set: https://github.com/AuHau/ipfs-publish

I just announced a new remote that indirects all the objects to per type folders.

As well as avoiding the sharding problem, the file contents are readable and the root of the repo are actually the contents of the pushed branch.

So perhaps the naive way isn’t so bad in relative terms? To elaborate on my reply in Handle blob objects larger than MessageSizeMax · Issue #18 · ipfs/go-ipld-git · GitHub

The attack is

But for any merkelized data, we have

hash := targetHash
for {
    chunkWithHash := solveForChunk(hash)
    send(chunkWithHash)
    hash := extractSha1Cid(chunkWithHash)
}

i.e. the fraudulent chunk can utilize “plausible CIDs” which keep the attack alive for future rounds.

Also, thanks to Merkle–Damgård construction - Wikipedia, SHA-1 will put the length in padding at the end, so attack up front still needs to commit to a length. This means no attack can waste resources indefinitely. Furthermore, Users are free to set policies specifying some function of the largest blob they’ll try to receive given the degree they trust the peer, to further mitigate spam.

Now it could well be that solveForChunk is substantially harder than solveForHashChunkPair, but given that SHA-1 is kinda “hosed anyways”, I’d consider cautiously banning for most CIDs, but making an exception for git-raw + sha1 given it’s ubiquity.

Ultimately, I think it’s in IPFS’s best interest to lobby hard for git + sha256 to chunk blobs, but IPFS will have more clout if there’s a stop-gap solution for git + sha1 to attract git users.

Yet another Git remote bridge is now available. There is also a PR on awesome.ipfs. This is the set of Python programs utilizing HTTP API to make the user able to clone, maintain and share Git repositories over IPFS/IPNS IDs using native Git CLI way and simple scripting language.