CID concept is broken

I don’t think (or know of) any project or person having solved this one.

While conceptually it seems not that complicated to me. In essence it can be a multihash. Where 1 key evaluates to any number of values. The opposite is possible too.

Implementing something like this probably requires a massive restructuring of how IPFS works with the DHT. And on top of that, the chunking of IPFS is very much in the way of even making this possible.

Before i knew all this i thought that it would be as “simple” as creating a DHT listener, catch all CID’s, get the checksum from them and you’ve got a starting point. But that isn’t the case at all. You really have to download all of it to get that checksum.

A new CID can also be designed to make this more easily possible, can it not? The format is made with freedom of change in mind so surely that could be helpful. I tried thinking of a CID that would help here, but i find it quite difficult to make one up that keeps the current hash logic where the hash is composed of the child nodes (dag supported hash). You can just, bluntly, add a digest to it but then the CID becomes really long (you basically add 32 characters)!

1 Like

This sounds like the most realistic way to achieve something like this. IPFS is fundamentally a content-addressed block storage model. By itself it’s too low-level for such a feature, but if you knew how you could probably use it to implement it.

The CID concept is perfectly fine in context, but it seems reasonable for the author to misconstrue it the way they did:

1 Like

damn, I was hoping IPFS solved this issue. bittorrent protocol has the same problem.
its obvious why both went this route, but it is unfortunate because its not truly content addressable. because the content that is being addressed is the checksum index and not the content itself.

The downside is that it would become opt-in. An upload wouldn’t automatically get that extra metadata. Something would have to either tell it (the opt-in) or something would have to download every new piece of IPFS data to “automatically” add it.

That’s very far from ideal and likely unsustainable.

It does seem oddly ironic that IPFS has this big DHT capable of looking up lots of info, but the one thing it cannot do is find data based on it’s hash. “A hash” yes, but “It’s hash”, no.

I wonder if this is just an oversight or by design?
Based on the merkle trees it seems “by design” but based on logic it seems an “oversight”.

Perhaps the big inventor of this great project can shed some light on this? @jbenet !

What i also keep thinking with this subject is that a blockchain might be of value here (pun intended).
If you’d have a key->value blockchain where the key is, for instance, the wallet. The transactions in that wallet are then the values (checksum in each algorithm) that are known for that file. The key would be the current CID. This would map neatly to the immutability of CID’s and the guarantee that a file with checksum X really belongs to a certain CID.

You’d still need a lot of magic to do the checksum -> cid lookup though.

@markg85 its a little of both. the design piece is basically by hashing the merkle tree of block hashes you crypto securing the metadata ensuring that the block checksum set you lookup is actually the content. preventing attacks where a node returns a invalid digest causing the download to fail validation at the end of the download.

the downside to this is that then you can’t use the hash of the actual content without incurring a 2nd DHT discovery process. first to map the content hash with the merkle tree hash and then to get the merkle tree to download the actual data.

and that downside is pretty big given how easy it would be to create peers that poison a swarm for a particular piece of content.

You could actually put it in the DHT.

When a file is added it is first broken up into chunks and turned into a merkle tree DAG, every node in the merkle tree is added to IPFS keyed by the hash of the merkle tree nodes, and the CID returned is actually the hash of the merkle tree root node. Fetching a file involves fetching the merkle tree node-by node starting at the root and reassembling the original file from it. Note the file data chunks are stored in the merkle tree leaf nodes.

This approach means that files are broken down into manageable sized chunks that can be efficiently stored/transmitted/validated independently, possibly from different peers in parallel. It also means identical chunks from different files (usually different versions of the same file) are automatically de-duplicated, saving storage and transmission overheads.

Unfortunately different upload settings, like different chunker or hash algorithms, mean that the file is chunked and hashed differently, resulting in a completely different merkle tree. So the same identical file content uploaded with different settings ends up with a completely different CID and no content is de-duplicated. So there can be multiple different CID’s for a single file uploaded with different settings.

The DHT is used to find nodes given their hash, and returns a list of peers that have a copy of that node. After uploading a file into IPFS you could push another entry into the DHT keyed by the whole-file hash that returns a list of CIDs for that file. You could include a bit of metadata like the chunker/hash settings used by each CID and the number of peers that have a copy of the merkle tree root node.

When a client doing a lookup from a hash gets a list of CID’s from the DHT instead of a list of peers, it can pick a CID with it’s preferred chunker/hash and/or most peers and then start fetching that CID normally.

The main problem with this is bad clients could post bad CID lists to the DHT, and the only way to validate them is to do a full fetch and validate the checksum. You could mitigate this similar to IPNS with peers signing entries, so you could pick a CID signed by a trusted peer or something.

So much for the “trustless” goal of IPFS!

@dbaarda I didn’t know enough about direct-DHT API to know if it was possible to use DHT almost like a generalized key/value store which it seems like you are implying can be done. Are you saying it can be done, with existing version, our “would be nice to have”? I have a feeling it’s impossible to store arbitrary key types (like a SHA-256 string) in the DHT like this but I’d love to find out I’m wrong.

@ldeffenb “Truestless” in the decentralized web always means “checkable by hashes and/or sigantures”, so I’m totally fine with doing a hash check AFTER I get the data. I still call that “trustless”.

1 Like

The DHT API docs have put and get commands that appear to take arbitrary buffers of data. Other doc’s talk about the three record types in the DHT (peerid, multiaddress, and ipns), but I don’t think any validation against those record types is done at the DHT API layer, only higher layers.

If you wanted to integrate it into ipfs you’d need to add the new record type and how it is handled into the higher layers (kind of like an alternative lookup method to ipns).

Without integrating it into ipfs, and assuming the DHT can put/get arbitrary values, you could write an application on top of ipfs that can upload/download files keyed by the whole file hash.

1 Like

I’m definitely not knowledgeable enough about the DHT to say anything with certainty here, but i did try a couple things. The DHT docs might have the API options to do what you said, IPFS doesn’t seem to be that free.

USAGE
ipfs dht put - Write a key/value pair to the routing
system.

ipfs dht put [–verbose | -v] [–]

Given a key of the form /foo/bar and a valid value for that key, this will write
that value to the routing system with that key.

Keys have two parts: a keytype (foo) and the key name (bar). IPNS uses the
/ipns keytype, and expects the key name to be a Peer ID. IPNS entries are
specifically formatted (protocol buffer).

You may only use keytypes that are supported in your ipfs binary: currently
this is only /ipns. Unless you have a relatively deep understanding of the
go-ipfs routing internals, you likely want to be using ‘ipfs name publish’ instead
of this.

The value must be a valid value for the given key type. For example, if the key
is /ipns/QmFoo, the value must be IPNS record (protobuf) signed with the key
identified by QmFoo.

For more information about each command, use:
‘ipfs dht put --help’

So for example a hypothetical ipfs dht put /checksum/sha2-256/<some checksum> <CIDv1 of file in IPFS> won’t work because only /ipns is currently allowed.

Anyone up for making a proposal to add that /checksum/<algorithm> to IPFS? :slight_smile:

Even if this would be in IPFS, it would probably still be an opt-in feature. It would be best if IPFS itself would, eventually, just add this to the logic of adding files. Thus then by default you’d have a CID + an DHT entry to lookup the sha2-256 → CIDv1. That would be just perfect!

Wouldn’t this also fix your initial issue, @ngortheone?

@markg85 Thanks for the great write-up on this. What you’re saying is exactly what I also though could be the solution. It seems like very easy low hanging fruit for the DHT developers to add on.

I think IPFS is growing and the industry is accepting it already, but if there were the new feature of “Look up any file by it’s hash” I think that would be a huge selling point for IPFS, and a huge win for IPFS almost for free.

1 Like

@everyone in this thread.

Now that we seem to have a rough understanding of what we want (feature wise) and potentially even how it can be done in IPFS, how do we proceed from here on? This thread is going to disappear slowly when new threads arrive and eventually will get lost in time.

I have two possible ideas/solutions here.

  1. (we should probably do this one regardless) make a report on go-ipfs describing this feature clearly for anyone to pickup from there.
  2. Make a source bounty to let someone develop this. I’m willing to donate for this feature as i too now would like to have this! But on which platform? A crowdfunding kind of way would make sense to me. Is IPFS active in that area somewhere?

But i’m not alone in this thread, there’s a bunch more. How to proceed with the highest possibility of it getting somewhere? Also, “if” it was to be made by someone, would it even be accepted as a feature in IPFS?

Can someone give me a slight rundown of that the summary of consensus is on this thread?

1 Like

@ShadowJonathan Some people (including me) realized there would be significant value in being able to look up files based on their actual hash …Something not currently possible with IPFS.

1 Like

I have to ask: Is the CID Concept broken?

1 Like

are you actually saying that the very definition of IPFS (on wikipedia) is wrong?

The InterPlanetary File System (IPFS) is a protocol and peer-to-peer network for storing and sharing data in a distributed file system. IPFS uses content-addressing to uniquely identify each file in a global namespace connecting all computing devices.

or just the IPFS by itself cannot do content addressing.

Yeah most definitions of “content-addressing” strongly imply there’s a single hash being taken, whereas what an IPFS CID is, is more of a Merkle Tree root hash, and essentially a proprietary record ID that can only be generated by running a specific and complicated tree algorithm over the entire file with specific settings, and if you even change the settings you get a totally different CID.

1 Like