CID concept is broken

@dbaarda I didn’t know enough about direct-DHT API to know if it was possible to use DHT almost like a generalized key/value store which it seems like you are implying can be done. Are you saying it can be done, with existing version, our “would be nice to have”? I have a feeling it’s impossible to store arbitrary key types (like a SHA-256 string) in the DHT like this but I’d love to find out I’m wrong.

@ldeffenb “Truestless” in the decentralized web always means “checkable by hashes and/or sigantures”, so I’m totally fine with doing a hash check AFTER I get the data. I still call that “trustless”.

1 Like

The DHT API docs have put and get commands that appear to take arbitrary buffers of data. Other doc’s talk about the three record types in the DHT (peerid, multiaddress, and ipns), but I don’t think any validation against those record types is done at the DHT API layer, only higher layers.

If you wanted to integrate it into ipfs you’d need to add the new record type and how it is handled into the higher layers (kind of like an alternative lookup method to ipns).

Without integrating it into ipfs, and assuming the DHT can put/get arbitrary values, you could write an application on top of ipfs that can upload/download files keyed by the whole file hash.

1 Like

I’m definitely not knowledgeable enough about the DHT to say anything with certainty here, but i did try a couple things. The DHT docs might have the API options to do what you said, IPFS doesn’t seem to be that free.

ipfs dht put - Write a key/value pair to the routing

ipfs dht put [–verbose | -v] [–]

Given a key of the form /foo/bar and a valid value for that key, this will write
that value to the routing system with that key.

Keys have two parts: a keytype (foo) and the key name (bar). IPNS uses the
/ipns keytype, and expects the key name to be a Peer ID. IPNS entries are
specifically formatted (protocol buffer).

You may only use keytypes that are supported in your ipfs binary: currently
this is only /ipns. Unless you have a relatively deep understanding of the
go-ipfs routing internals, you likely want to be using ‘ipfs name publish’ instead
of this.

The value must be a valid value for the given key type. For example, if the key
is /ipns/QmFoo, the value must be IPNS record (protobuf) signed with the key
identified by QmFoo.

For more information about each command, use:
‘ipfs dht put --help’

So for example a hypothetical ipfs dht put /checksum/sha2-256/<some checksum> <CIDv1 of file in IPFS> won’t work because only /ipns is currently allowed.

Anyone up for making a proposal to add that /checksum/<algorithm> to IPFS? :slight_smile:

Even if this would be in IPFS, it would probably still be an opt-in feature. It would be best if IPFS itself would, eventually, just add this to the logic of adding files. Thus then by default you’d have a CID + an DHT entry to lookup the sha2-256 -> CIDv1. That would be just perfect!

Wouldn’t this also fix your initial issue, @ngortheone?

@markg85 Thanks for the great write-up on this. What you’re saying is exactly what I also though could be the solution. It seems like very easy low hanging fruit for the DHT developers to add on.

I think IPFS is growing and the industry is accepting it already, but if there were the new feature of “Look up any file by it’s hash” I think that would be a huge selling point for IPFS, and a huge win for IPFS almost for free.

1 Like

@everyone in this thread.

Now that we seem to have a rough understanding of what we want (feature wise) and potentially even how it can be done in IPFS, how do we proceed from here on? This thread is going to disappear slowly when new threads arrive and eventually will get lost in time.

I have two possible ideas/solutions here.

  1. (we should probably do this one regardless) make a report on go-ipfs describing this feature clearly for anyone to pickup from there.
  2. Make a source bounty to let someone develop this. I’m willing to donate for this feature as i too now would like to have this! But on which platform? A crowdfunding kind of way would make sense to me. Is IPFS active in that area somewhere?

But i’m not alone in this thread, there’s a bunch more. How to proceed with the highest possibility of it getting somewhere? Also, “if” it was to be made by someone, would it even be accepted as a feature in IPFS?

Can someone give me a slight rundown of that the summary of consensus is on this thread?

1 Like

@ShadowJonathan Some people (including me) realized there would be significant value in being able to look up files based on their actual hash …Something not currently possible with IPFS.

1 Like

I have to ask: Is the CID Concept broken?

1 Like

are you actually saying that the very definition of IPFS (on wikipedia) is wrong?

The InterPlanetary File System (IPFS) is a protocol and peer-to-peer network for storing and sharing data in a distributed file system. IPFS uses content-addressing to uniquely identify each file in a global namespace connecting all computing devices.

or just the IPFS by itself cannot do content addressing.

Yeah most definitions of “content-addressing” strongly imply there’s a single hash being taken, whereas what an IPFS CID is, is more of a Merkle Tree root hash, and essentially a proprietary record ID that can only be generated by running a specific and complicated tree algorithm over the entire file with specific settings, and if you even change the settings you get a totally different CID.

1 Like

So here’s an unexpected twist of events!
As i usually do, i watch the IPFS Core Implementations Weekly Sync. In there @adin was talking about the very concept of this issue (as initial posted, not the conclusion we came to).

You can find the part where he talks about it here.
You can find his quite elaborate thoughts (basically a design document) on the matter here. And a patch for the protocol change (the documentation of it, not the code) is here.

Nice work, @adin!!! :+1:


Thanks for that @markg85 That’s a really interesting and very important topic in IPFS I think. Appreciate you posting the links.

Thanks for calling this out! Is @adin’s proposal written up yet? I wonder how similar it is to what I wrote in Git on IPFS - Links and References - #24 by Ericson2314. Very glad to see something happen here!

It turned out there’s a technical challenge around the fact that if you map a SHA-256 to a CID, then you end up having to trust that the peer who created the mapping is being honest and not an attacker, because the cost of each false entry done by a hostile peer is that everyone has to download the entire file before discovering the hash is wrong.

I’m no expert at “logic puzzles” myself, but I don’t see a solution to this other than some kind of vote counting where lots of peers have to basically “vote” about what the correct mapping is, in order to have a hacker-proof system, and adding that kind of complexity may end up just not being worth it.

Yes the idea isn’t there isn’t just black box SHA-whatever to unrelated CID. It’s instead to exploit the streaming structure relaying blocks in reverse so that each block can be verified with just the hash of the to-be-sent block being trusted. That’s not perfect, but it’s a lot better.

I couldn’t be sure, but it sounded like @adin had similar thoughts, both mentioning the structure of the SHA ones, and pointing out that the tree structure of blake ones could allow things that are even better.

Thanks @markg85 this issue has been circulating in various forms for a long time. Some of the issues and ideas linked in the PR are from 2017.

Overall the difficulty is in if there is a way to do this verifiably utilizing the existing major hash functions (e.g. SHA-2-256). I’m hoping that the proposal is sound, but want to get more eyes to verify. If it’s determined to be sound we can see what it’ll take to implement the changes and where it falls on our roadmap, but having a sound proposal (assuming it is sound) is a step forward.

@Ericson2314 yep, my proposal (which I assume you found, but is in @markg85’s post above) is based off of that discussion on Git on IPFS (I linked to it in the proposal).

I’d like more eyes to verify that the proposed approach is sound, but as I’ve written in that PR it’s looking pretty good as there is a formal definition of the problem we were concerned about (i.e. freestart collisions) and that it looks like SHA2 is currently safe from it. I’d still like to flesh out more deeply what goes wrong if freestart collisions exist but full collisions are still infeasible. I suspect the attack is both not that bad and, without the existence of freestart preimages, is expensive for the attacker to pull off. Understanding some of the details here would allow us to evaluate whether enabling this behavior for SHA-1 is a possibility or whether Git would have to wait for SHA-2 support.

One thing that IMO is a pretty important extension/addition to the prior versions of this proposal is taking the approach referenced in RFC|BB|L2 - Speed up Bitswap with GraphSync CID only query that basically allows us to efficiently download a linear graph across multiple peers by being a bit more trusting and scaling our trust over time. Without this downloading a large block (e.g. a 10GiB file referenced by the SHA2 of the file) would take a ridiculously long time and be restricted to downloading data from a single peer.

Yep, if you check out the Blake paper they basically mention our use case.

Using the DHT to store unverified data

The approaches mentioned here are understandable as well given that they unleash a lot of flexibility (e.g. I claim A is a gzipped version of B), but as many of you have noticed there are some difficulties associated with needing to download a lot of data over an untrusted p2p network before realizing that you have received the data you expected. In some use cases this might not be a big problem, but in the general case things can become quite difficult.

It’s also worth noting that these unverified or Web of trust style solutions can unlock other functionality too. For example, I claim that A and B are the same book but translated into different languages. This is even less verifiable than the gzip example above, but has its use case.

I wouldn’t want to discourage folks from experimenting with implementing some of these ideas, but IMO if we can solve some of the more pressing problems in a verifiable that approach that’ll be easier and keep the mental models people have when working with IPFS intact.

That being said, if you’re interested in ways you can build new record types on top of the existing DHT without proposing a system wide protocol upgrade my comment might be helpful (and if you’d like to go into this more I’d recommend creating a new discuss topic and tagging me :smiley: )


Oh, I somehow missed the hackmd and PR being linked above! Since this thread has gone through a bunch of topics, I’ll take the conversation there.

@ngortheone First of all, you’re correct that a CID the content of address of the original file, and that this can be confusing and sometimes inconvenient, particularly if multiple people independently pin the same file but with different hashing/chunking algorithms. This can indeed be wasteful, but I don’t see it being a major problem, since I would imagine in most cases people would pin an existing version of the file that is on IPFS (thereby using the same hash/chunking algorithm), and the case where different people instead add it from the orginal file would be less common. Worst case scenario, you have 2-3 versions of the same file available.

While not a perfect analogy, you could also complain that saving an image in different file formats is going to result in different files. That’s just how data storage works, and different image formats each have their own merits, as do different hash algorithms and chunking strategies.

There are aspects of the CID design I consider overkill, in particular multibase, i.e. the ability to express a given binary CID (array of bytes) as text in different ways. I think they should have just picked on encoding and gone with it - a lowest common demoniator that’s going to work everywhere, like base16 or base32. Multihash can also result in confusion because the same hash can be represented in different ways such that it’s not obvious that multiple text encodings are actually the same hash.

Multihash on the other hand is more useful. Many protocols/tools over the years have been designed to use a specific hash function, e.g. md5 for bittorrent or sha1 for git. Those hash algorithms have since been found to be insecure. Git is in the process of transitioning away from SHA-1, but the transition is way, way, way more complicated in that if Git had been designed from the beginning to support multiple hash algorithms. IPFS is designed to cater for extensibility and evolution of the protocol, and while this does make things a bit more complex to deal with right now (e.g. choosing which hash algorithm too use), in the long run it’s likely to pay off.

Finally, the DAG services an important purpose. The hash of each block/node is computed independently, and when downloading blocks a receipient can verify whether the block has the correct content based on its hash, without having to wait for the whole file to download. And if there is an error in a block, it can try to re-download just that block again, either from the same peer or another. Suppose you had an 8gb file; if you only had hashing done at a file level, you’d have to grab the whole 8gb again because you don’t know which block(s) were corrupt.

Note that most filsystems already use a DAG (or rather tree) - see for example single/double/triple indirection nodes in a variety of Unix filsystems. A “file” is an abstraction over the block store which uses a DAG to store its layout, and this is true whether you’re using BSD unix from the 1980s or IPFS in 2021.

I think the design rationale that lead to some of the above properties of IPFS isn’t really well explained and can definitely be confusing to newcomers. I spent a lot of time thinking “what the hell is this mess” when I started out trying to understand how IPFS works, but over time came to better understand some of the decisions (not that I agree with them all). I think improved documentation would be really valuable in this regard.


I wonder if Probabilistically Checkable Proofs (PCPs) could be used to publish verifiable mappings of SHA256(file) => CID(file)