CID concept is broken

ngortheone · January 4, 2021, 5:17am

Hello everyone!

I really like the ideas behind IPFS and I want to share some feedback about the design of the IPFS.

The core of the problem is that CID concept is wrong in it’s current implementation. I know it sounds blunt and harsh so let me clarify:

IPFS at its core claims to be a content addressable file system.

Content address

A content’s address is determined by it’s digest. Simply put - a stream of bytes that represent a file on the disk fed into a hashing function give us the digest.

content == a stream of bytes that represent a file on the disk

this is important, because only the end result (a file on the disk) is important to the end user.

CID != content address

But then if we follow the documentation or any blog post about IPFS and we learn that IPFS actually does not use file’s digest as its address. There is a thing called CID and it mixes a bunch of concepts together, some of them are correct, and some of them are wrong.
And users are supposed to fetch content based on the CID

CID is

multibase-prefix
multicodec-cidv1
multicodec-content-type
multihash-content-address
…
So the same file can have x1* x2 * x3... * xN CIDs where N is a a number elements that form a CID

Normally this shouldn’t be a problem, but there is no normalization between the different encoding schemes because … (see below)

Multihash != content address!

IPFS stores the content by the multihash, which is not content’s address either.
multi-hash is a hash of a DAG that this particular file was sliced into. Not only you can use different hashing functions, you can also chunk the DAG differently, that produces different hashes.

ipfs add --cid-version 1 --chunker=size-1 cat.jpg
bafybeigmitjgwhpx2vgrzp7knbqdu2ju5ytyibfybll7tfb7eqjqujtd3y cat.jpg


ipfs add --cid-version 1 --chunker=size-2 cat.jpg
bafkreicdkwsgwgotjdoc6v6ai34o6y6ukohlxe3aadz4t3uvjitumdoymu cat.jpg

So the same content can be represented by M * N DAGS where M - number of hashing functions available, N - number of chunking options available

So DAG encoding != content’s address

The problems

IPFS claims to have automatic content de-duplication, it it only works inside one hashing-chunking scheme. It is in fact possible to have M*N duplicates of the same file stored alongside each other.
Convertiing CIDs is not possible without having the file on disk, so different peers can have same content under different CID’s, which fragments the network into incompatible segments or forces everyone to have M * N copies of the content to allow interoperability
Given CID forces specific file chunk layout onto everyone who downloads the content, which will cause storage and network performance problems (different OSes, different storage devices, different FS cache sizes etc, different network connectivity, not to mention duplication of OS kernel functionality in user space)

Proposed solution

Do not address by DAG encoding/ layout, address by file’s content.

Content Id(CID) should have the content hash instead of the DAG hash. With this it becomes possible to normalize CIDs. Once a peer gets a file - it can calculate multiple hashes of the content and advertise them all.

A particular DAG layout can be requested by the peer at transfer time.

Now It is also up to the peer how to store the content ( in chunks or as a single file, let OS handle it or deal with it in user space…)

simon · January 4, 2021, 3:29pm

Hi,
FYI, I am not an IPFS expert.

One issue with your proposition is that if we use the file content id (let’s say sha-512), like you said, the blocks can be generated in multiple ways (with different encoding and different chunk size). So, let’s say I want to PIN the sha-512 of Ubuntu’s iso file, instead of pinning only one sets of blocks, I would need to pin all the sets of blocks. If there are 5 different sets, then it takes up 5 times the space.

Of course, it would have been possible, if they designed it with a single hash (sha-512) and single defaults for blocks, to just use the content’s hash, but that would inevitably needs to be changed in the future.

I think their design decision was well chosen to have defaults and if everyone uses the same version of the software without touching any of the defaults, then yes the dedup works fine and it is still backward compatible with previous software’s version’s defaults.

Regards

zacharywhitley · January 4, 2021, 4:18pm

I recently posted a proposal for something that would address some of the issues you’ve mentioned Merkle tree root hash with lthash?

ngortheone · January 4, 2021, 4:43pm

@simon

I think their design decision was well chosen to have defaults and if everyone uses the same version of the software without touching any of the defaults, then yes the dedup works fine and it is still backward compatible with previous software’s version’s defaults

I think that hoping that everybody should use default settings is wrong. Since this is an Interplanetary File System - can you imagine that people on the other side of the planet, or even on another planet have same hardware as you? Same disks, same OS configuration, same performance requirements, same workloads? Does one expect that chosen hash/chunking scheme is as good to be sent to another planet over the expensive connection as it is good to be sent next door over fast fiber channel?

I don’t think that this expectation is realistic. Merkle DAG is a good idea if you don’t hardcode but let peers negotiate it. You don’t need to duplicate the data. You can calculate multiple DAGS for the same file. The only difference is that you only store file offsets in DAGs nodes instead of data itself

simon · January 4, 2021, 5:06pm

I am not hoping. For any software I use, I always use the default values unless there is really a big issue with them.

Maybe it could be possible to add another layer like:

there is “/ipfs/” for the objects
there is “/ipns/” for pointers to changing objects
there could be a new “/ipXs/” where you give the hash of the content

For sure, that would be nice like when we search on the net, find a big file and a small file with the hash of that file. We could easily check on ipfs if it is available by that hash.

On the other hand, that would add more traffic and metadata on the DHT. Also, there could be attacks on that. Eg: someone tells that hashX is provided by blocksX when it isn’t true. The only way to validate is by downloading the full file and then hash it to see if that is the right one. In the case of blocks, since the blocks are “small”, that same check is faster done.

markg85 · January 4, 2021, 5:37pm

Disclaimer: I’m not an IPFS dev! Merely an “enthusiastic user”

You do have an interesting point there. If i hash a file in “Algorithm X” just because that’s faster for my machine and you hash the same file in “Algorithm Y” then another user who might be searching for the very same file but with “Algorithm Z” will never be able to find it. Even though it’s very much there and perhaps even well seeded. Just under a different “name” if you will.

But…:

How is that magically going to make it work?

For example, the pin explorer does tell me the digest. So what exactly do you mean by address by file’s content? As to me that means making a digest (or checksum) in some algorithm that is agreed upon by all nodes.

You need to identify content by “something” and that digest is that something. In IPFS there’s some magic sauce on top of it (CIDv1/multihash)

This has a few issues.

You will need to hash your file multiple times. That on it’s own can be a very big lengthy resource intensive task. For example think about hosting Wikipedia. A light version would be to merely hash that “file content hash” which would make this super fast.
You’ll cause a boat load of noise on the network as each file would be advertised in - say - the top 10 hashes or so.

I don’t see how you can do this efficiently. But i might be missing something too so please do elaborate

ngortheone · January 4, 2021, 7:25pm

Once a peer gets a file - it can calculate multiple hashes of the content and advertise them all.

That indeed is more computationally expensive and will create more network traffic. I was just pointing out that computing N hashes for a file is much better than computing N *M hashes.

Do not address by DAG encoding/ layout, address by file’s content.

How is that magically going to make it work?

I don’t have a good answer. I wanted to point out a few problems, but I am afraid that I failed to bring it across properly.
Let me take a step back and set up the stage.

I want to download a big file. A multi-hour cat video in 4k that has a size of 10GiB.
I happen to know that sha512 of the file is sha512-aaaabbbbcccc (I prefix with sha512- for clarity), a friend ran shasum on his copy of the file and emailed it to me.

How do I download it from IPFS? I can’t use file’s shasum. I need a CID.
content hash != CID

So a CID like Qmxxxxyyyyzzzz is a hash of the root node of the content-addressable DAG of the sliced cat video that uses some hashing algorithm and some slicing pattern. Hash of the DAG’s root node != content hash. One peer on the network sliced and hashed this cat video in one way, another peer used a different hash function and slicing pattern and ended up with a different CID.

As a user I have the following questions:

given a content hash of a wanted file - how do I discover all CIDs that correspond to that file?
given a list of CIDs that correspond to a desired file how do I use them all to download s single copy of the file? ( I don’t want to have X copies of my 10Gb cat video all sliced slightly differently)
given one CID that corresponds to a desired file can I deduce all other possible CIDs without having a file locally?

Now it gets more interesting. Lets assume there is only one CID possbile for a given file, just for the argument sake.

A peer advertises a certain hash. It is a hash of the the root of the MerkleDAG, I can traverse the root down to all of the nodes and get all their hashes too. But without a file on the disk I don’t know if the DAG is correct. I won’t be able to verify that the root’s hash is correct until I download every bit of the file and calculate the root’s hash myself.

Even more interesting: even if I download every bit of the file, and calculate the DAG and it happens to be correct, there is still no guarantee that this is my cat video. DAG hash != files content hash. I need to download every bit of the file to make sure that this is indeed my sha512-aaaabbbbcccc cat video.

Having X hashing algorithms I there can be X content hashes of my cat video. There is already a problem because as a user I can’t convert one has to another hash.
Having X hashing algorithms there can be X * X DAGs for the file - I can have a DAG nodes hashed with algo X1 and refer to the file content hash X2.
Having Y chunking schemes there can be X * Y DAGS for X hashes of the same file.

As a user all I care about is my cat video that has a hash sha512-aaaabbbbcccc to make sure that this is the same video that my friend recommended me to watch. How do I find the video given that there are Y * X^2 possibilities to represent it? Or X* Y possibilities to represent just the sha512

How do I know that downloading a given CID will indeed result in sha512-aaaabbbbcccc file on the disk?

ngortheone · January 4, 2021, 7:47pm

To sum up:

axiom: hashes are not convertible or reversable, that is their core property.
Therefore:
content hash X != content hash Y even if this is the same file. (duh)
CID != content hash
you can’t prove that CID == another CID without having the content locally (because CID contains multi-hash)
you cant verify the CID until you download everything (a property of a Merkle DAG, root node is calculated recursively based on it’s children)
you cant verify the content hash until you download everything
there are X^2 * Y ways to represent the same content. So while achieving some deduplication inside of particular hashing/chunking scheme overall IPFS network will suffer up to X^2 * Y duplication

I will be happy to be proven wrong, I am not a mathematician or IPFS expert after all!

wclayf · January 4, 2021, 8:49pm

There is another problem here that I’ve noticed, that’s also related to the goal of a “Worldwide Standard” for how IPFS content needs to work, and that is, there needs to be a content type (mime type) given at the front of the data, and hashed right along with the data. Storing files that can’t have their ‘mime’ determined is horrible.

That being said, the only solution I see will be for the world to agree on “How” to use IPFS, rather than expecting IPFS developers go back and address these limitations.

So we could just create a standard that says the first 100 bytes of data are reserved for a left justified, lowercase mime type, followed by the data, and that the entire thing is chunked a specific way and uses the SHA-256, and give that some kind of name like “IPFS Open File Format” or something, and then everybody gets that they need. So what I’m saying is IPFS works just fine, but needs perhaps this “standards” meta-layer which lets disparate platforms interoperate better. Just saying “Whatever the IPFS defaults are”, would of course be a ‘silly’ standard. It needs to be a real standard but can be built on top of IPFS.

markg85 · January 4, 2021, 11:22pm

That is wrong.
It’s more then the content hash. It also includes the hash you’re looking for. Like in my previous post, here’s the CID explorer link again: CID Inspector | IPFS that magic can quite likely be done through commands too.

I’m very confused about your post.
You seem to be arguing for 2 quite different ideas here.

You want to get content by basically any hashing algorithm.
You want to verify that the content you request is your 10GB cat movie without downloading it first.

For point 1. That’s a DHT change i assume. If you feel strongly about that, go help out the team to figure out a sane way of doing that. I personally would “like” to have that too as it enables using the most optimal hashing algorithm specific to the platform. In reality this is really just future proofing. At this very moment sha 256 is the de-facto standard accessible on all sizable platforms that could sanely run IPFS (being x86/64 and arm, possible risc-v in the near term future) so you really should just keep that as a default.

For point 2. That’s just not IPFS. Filecoin does have this i think. You might also be interested in “PDP” which is probably what filecoin uses behind the scenes (or something like it). But for IPFS, you’re likely never going to get this. It sounds like having a proof mechanism in it would be quite a fundamental change.

That is purely speculative reasoning. Sure, it’s possible. But you’re shooting yourself in the foot if you change from the default settings when it comes to hashing.

I don’t see that as a wrong thing. It needs to have strong good defaults. Which, i think, they were working on. But the hashing is good as-is i’d say.

I would not do that first part (the 100 bytes). You would just be adding that to every single file. That’s just wasting space.

Just thinking out loud here, but a thing that might be interesting is a “metadata layer / DHT” on top of IPFS that “adds value” to what would otherwise be just plain CID’s. That would make it possible to say Content A is hashed by algorithms X and Y. Bonus points if that would be “collaborative” where i for instance can download the file and hash it in “Algorithm Z” and tell the “metadata network” the hash. Again, there would need to be some verification here. So this probably would have some form of incentive framework. This network would then allow the lookup from your hash -> metadata layer -> CID.

Interesting ideas, folks Who’s going to make this? (not me)

wclayf · January 5, 2021, 12:18am

To things I want to clarify, about what I said:

I do realize MFS (Files in general) already do have at least a file name so the extension can be used to encode (or determine) the mime type, at least the way operating systems do, without true mime info, but just a file extension. But that’s not for DAGs in general. Just files.
If we want to get the whole world to agree on hashes (so we can compare any file to any other file) we have to just convince everyone to use “Qm” multihash which is guaranteed to be sha256 in base58.

If the IPFS devs had just said “We reserve the first two bytes of every hash and they are required to be Qm, and don’t ask us why”, they could have just hidden it all, until needed, and just said the world uses sha256 don’t worry about it. They just wanted to ensure a path for growth for the day when perhaps sha256 is deemed to weak, as computers get faster, so they allowed so much flexibility that the enforcement of a common standard was left optional…and made things a little harder and more confusing than it needed to be.

ngortheone · January 5, 2021, 1:36am

@markg85

That is wrong.
It’s more then the content hash. It also includes the hash you’re looking for.

I am a bit confused:

ipfs add ~/files/iso/FreeBSD-12.1-RELEASE-amd64-dvd1.iso
added QmShzonja4XdJsWTwDtfA6so1LgEqxjBkffZg3LgcNEvf1 FreeBSD-12.1-RELEASE-amd64-dvd1.iso

shasum -b -a 256 ~/files/iso/FreeBSD-12.1-RELEASE-amd64-dvd1.iso
00d65d47deceabec56440dea3f5c5dfe2dc915da4dda0a56911c8c2d20231b2d  

openssl dgst -sha256 ~/files/iso/FreeBSD-12.1-RELEASE-amd64-dvd1.iso
SHA256(FreeBSD-12.1-RELEASE-amd64-dvd1.iso)= 00d65d47deceabec56440dea3f5c5dfe2dc915da4dda0a56911c8c2d20231b2d

shows (sha2-256 : 256 : 40E6AB67BAF4D041B32F906CA43DE63E1CD0CA2C9C019262DCEDF16A687D7F88)

40E6AB67BAF4D041B32F906CA43DE63E1CD0CA2C9C019262DCEDF16A687D7F88 != 00d65d47deceabec56440dea3f5c5dfe2dc915da4dda0a56911c8c2d20231b2d

wclayf · January 5, 2021, 3:04am

I haven’t verified this info is correct but, and I’m not saying you weren’t aware of the protobuf thing but here’s some more context for anyone:

ngortheone · January 5, 2021, 4:42am

So CID inspector for

QmShzonja4XdJsWTwDtfA6so1LgEqxjBkffZg3LgcNEvf1

shows sha2-256

40E6AB67BAF4D041B32F906CA43DE63E1CD0CA2C9C019262DCEDF16A687D7F88

Which is the same as

$ ipfs block get QmShzonja4XdJsWTwDtfA6so1LgEqxjBkffZg3LgcNEvf1 | sha256sum -b
40e6ab67baf4d041b32f906ca43de63e1cd0ca2c9c019262dcedf16a687d7f88

so CID does not contain content hash. It contains the hash of the root node of the DAG.
So my question stands. How do I get the original sha256-aaaabbbbcccc (or any other hash of the content, before it was added to the IPFS) of my cat video without downloading it? Is it stored somewhere or not?

markg85 · January 5, 2021, 1:32pm

Thank you @wclayf for that link!

Looks like i just learned quite some new stuff about IPFS. I knew each block was being hashed and that the root node is a composition of the direct child nodes. But i was expecting, apparently naively, that the digest in that root node would be of the whole content. As the IPFS CID inspector made me believe. In fairness to that tool, it doesn’t “say” that the digest is from the root hash. But i think it’s safe to assume that one would think that.

Yeah, i think i now see why you have a problem with this.
You apparently cannot (as far as i know) get the sha2-256 by CID without getting the whole content first.

And you also cannot create a CID if you do know the sha2-256 checksum as there’s no way to go from that checksum (without actually having the data, so if you got the checksum from someone else) to the CID.

Now i’m stuck too I have no clue. But it does smell like you’re on to something that would be nice to have in IPFS.

hector · January 5, 2021, 1:40pm

Hello, this thread has a lot of information which is correct mixed with a lot of information that is not fully correct.

First there is the concept of CID. A CID is just a way to represent a hash, giving the user some extra information:

What type of hash it is (sha256 etc). This is the multihash part.
What type of IPLD-merkle dag it is referencing (which can be a “raw” type to reference content directly). This is the multiformat part.
How the hash is encoded for human representation (base32, base64 etc). Some CIDs are equivalent if the only thing that changes is the encoding (see how IPFS supports both Qmxxx (base58) and bafyxxx (base32) and switches interchangeably between them). This is the multibase part.
Qmxxx CIDs are called “V0” by the way. They are actually just multihashes without any base or type information, which is assumed to be base58/protobuf too all effects.

The whole CID concept works independently from IPFS. A CID can be used to represent a normal sha256 hash in the format you are used to see it (hex) if you want it. https://cid.ipfs.io can help making conversions etc. Also the ipfs cid subcommands.

IPFS uses CIDs because they are future proof and allow working with any time of hash/format/encoding configuration, regardless of the default hashing, dag type, encoding.

We could imagine IPFS using CIDs that just encode the “regular” sha256 sum of a file. However, as mentioned, IPFS is not content-addressing files themselves, but rather IPLD-merkle-dags. It is not that the same content can be represented by different CIDs, but rather than different DAGs are represented by different CIDs.

One of the main reasons that IPFS chunks DAG-ifies large files is because you want to verify content as you move it around in the network. If you did not chunk a 1GB file, you will need to make a full download before verifying that it corresponds to what was requested. This would enable misbehaved peers to consume too many resources from others. For this reason, IPFS nodes refuse to move blocks larger than 1 or 2 MB on the public network. Of course, private IPFS networks can be adapted to whatever and you could make IPFS not chunk at all.

Also, with smaller chunks, a large 1GB which is similar to another 1GB can be deduplicated. If they were made of a single chunk, they would not be able to share pieces.

There are other smaller reasons, like ability to better distribute the downloads and requests different chunks from different people, the ability to support DHT-lookups of bytes located in the middle of the content (i.e. seeking video without downloading from the start or relying on a provider that has the whole thing) etc. all while ensuring the content can be easily verified.

With all the above, a default which does not do any chunking, seems less reasonable than any other default. Selecting the chunking/dag algorithm per request would be disastrous for performance and security reasons.

The question of “how the dag is stored by the OS” is not very relevant as that is a lower-layer issue and can be solved there regardless. The OS/storage devices are as good/bad suited to store a DAG as they are to store different types of non-chunked content. Different datastore backends will optimize for different things as well (i.e. badger vs. fs).

Then, the question of “I have a sha256 digest and I want to find that on IPFS” can only be solved with “search engine” (be it DHT, or something else). But I find this similar to saying “I have a filename and I want to find that on the web” and complaining that the HTTP protocol does not give you that. Just like you browse the web with full URLs (and use a search engine otherwise), you will normally browse IPFS using CIDs that get you to the content you want and normally you will be referencing DAGs directly.

In the end the question is not how to translate between sha256 digest to CID, but how to translate between “thing that a human can read/remember” and CID. The only reason sha256 digests are provided next to human-readable filenames now is to be able to verify the content after download. However, IPFS embeds this functionality directly, which makes additional digests rather redundant.

So, taking into account the above, the choice of 256KB block size, with balanced DAG layout as default wrapper for content in the public IPFS network was deemed to be the safest choice when balancing a bunch of practical, security and performance concerns. Of course optimizing just for deduplication, or just for discovery, results in other choices and the good thing is that IPFS is architecturally designed to deal with different choices, even if the public network sets some limits.

markg85 · January 5, 2021, 1:48pm

But how?
How can you get that sha256 without downloading the file?
Is there some dag-tree-traversal magic possible to get that original hash out of it?

hector · January 5, 2021, 2:10pm

Someone has to tell you, and then you need to trust the source of that info, and then you need to verify it is correct once you have the full file. And by that time, you don’t need it anymore because if you got the full file from IPFS it is already verified. This is why I don’t see too many upsides in worrying about full sha256 digests when operating on IPFS-land.

zacharywhitley · January 5, 2021, 2:39pm

As long as you’re only in IPFS-land it’s a non-issue but it seems to come up if say you want to download an iso and they publish a sha256 but sadly not a CID. You think, “Hey, I bet someone else has already put this on IPFS and if they did I’d like to download it from there”.

I’m not sure how you’d do that. I guess you could have some soft of search engine that published file hashes (which I think we’re referring to as content hashes but I find that term to be ambiguous) but they’d have to download the entire file to do the hash so it would be nice if there was a way of the original publisher to include the hash when they publish it. It would also be a nice way for clients that use gateways to verify the file since they can’t do it without the Merkel DAG and must trust the gateway.

I also wanted to add that the post’s title is probably more confrontational and antagonistic than intended.

EDIT: Even if a publisher somehow included the hash of an entire file you’d still have to trust that they were telling the truth so there is the possibility of publishing a number of files with an incorrect hash forcing people to download to verify.

markg85 · January 5, 2021, 3:08pm

Sure, fair point.

Please do enlighten me a little further with regards to what exactly is hashed.
As, right now, i’m still living in the assumption that somewhere down the ipfs chain the file in it’s entirety is hashed using sha256. As if you’d be calling sha256sum somefile on linux.

Or, and this is entirely possible too, is the file as a whole never hashed and is it only hashed in chunks? That would be in those 256KB blocks.

If it’s the former, the ipfs network somehow somewhere must know that sha256 hash of the file in it’s entirety.
If it’s the later then i’ve learned something new yet again

Topic		Replies	Views
Monolithic File Hash , a proposal Help	6	276	June 30, 2021
How can I link to a different kind of hash-based system Help ipld	12	1744	October 21, 2017
Un-informed ipfs user asks. What cid codec to use for introducing multi-part file? IPFS	4	518	May 12, 2019
Does the IPFS chunking change the CID for the same file chunked differently? Tutorials	2	753	June 26, 2021
When hashing a CID does the metadata inside the CID track what node I hashed from Help	4	316	December 6, 2021