Does a scanned document and a photo(same document) result to the same content identifier?

Given 2 users have the same document

When user1 uploads a scanned copy and user2 uploads a photographed copy from the same document

Will the resulting content identifier be the same?

For all practical purposes, no. In IPFS identical means identical. Down to the last one and zero identical. Not kind of, soft of alike but completely indistinguishable identical. A single single bit in a multi terabyte file would make a difference.

You’re too casually using the idea of sameness. First you’re talking about physical documents in the real world and it’s hard to compare real world things onto the abstract world of digital files but even physically they don’t have the same document. They have two copies of a document that contains the same information. If you were to look closely they will contain slight variations from the printing process. The variations are magnified when one is photographed and the other scanned resulting in very different documents.

IPFS aside the problem of comparing two images can be extremely difficult. You could perhaps run optical character recognition (OCR) on both documents and compare the text. There would always be slight differences caused by errors in the OCR process so even there you start getting a fuzzy idea of sameness. Perhaps one document had a company logo and the other didn’t. Are they the same? Maybe one logo is just obscured because of a poor scan.

3 Likes

This would imply that metadata doesn’t impact the hash right? For example the timestamp of the files wouldn’t be relevant. Isn’t this a vulnerability?

Yes, the metadata would change the hash. I’m not sure what vulnerability that might introduce that you’re suggesting.

2 Likes

I just want to understand. Let’s say we have file A and on 1/1/23 we upload it to IPFS with a Hash. Then on 1/1/24, we upload the exact same file to IPFS. Will the new Hash be equal to the previous one?

Under which conditions will both Hashes be the same? And different?

I get what you’re talking about when you say “uploaded” but there seems to be a lot of confusion about this with IPFS so for anyone coming across this in the future I thought I’d mention that you don’t really “upload” anything to IPFS. The closest thing to uploading would be using a pinning service or maybe a writable gateway.

Again, sort of. If you added the exact same file with the exact same options - chunking, layout, hash, etc - you would get the same CID. You just have to be careful because there are some subtle ways things can change and any change, no matter how small or trivial, will result in a different CID. If you add the exact same file, in the exact same way you should expect to get the same CID.

2 Likes

We don’t upload anything to IPFS.

You just announce to network saying “hey I have a file with hash bafe…”
And if someone is looking for file with has bafe…, they will form connection with your node and fetch the file. If they cache the file they will also contribute it to the network.

In other words IPFS is just a better version BitTorrent.

Let’s say you have a file A.txt file whose content is

This is a file called A

Whenever you add a file to IPFS it will read the contents of file. And then form a hash, let’s say for the A.txt you get bafe01…

Now on other day in future, you create B.txt which has the same content as A.txt, IPFS will read the contents of file, and then obtain the hash of file. If the content of file are exactly the same then you will obtain the same hash for both the files.

In our example
A.txt → bafe001…
B.txt → bafe001…

So, when you create a file, what is it’s file name, doesn’t matter, as long as the content of file (whatever data is written on disk from beginning to end of the file’s content) is same, you will have the same hash.

This doesn’t introduce any vulnerability, because a hash is produced from the content, of a file, if some modification was done to the content, then you will obtain new hash.

For better understanding I would recommend to learn how BitTorrent and Hashing works. Here’s a good video for understanding hashing.

1 Like

Kubo does not really care if the adding the same files multiple times gives the same CID.

What Kubo really care about is that when you download a file (from the gateway, using ipfs get, ipfs cat, …) the content is absolutely identical to the file that were originally ipfs added, That what the cryptographic hashes are used for, verifying the content that is downloaded.

Whether or not ipfs add gives the same CID for the same input is not a very high consideration, it is nice if this happen but it is not a huge issue if this does not happen.
In practice:

  • ipfs add with Kubo will give you back the same CID if the file are identical to the last bit as long as:
    • all of the chunker options are the same --raw-leaves, --cid-version, --chunker, …
    • and you use the same version of Kubo for both operation

Sometimes it works across versions too however this might change in the future.
If you use other pieces of software to unixfs-ify your files they can give you a different CID because they use different options which optimize for a different usecase or use different algorithms. I have a list of examples here IPFS hash feature use non-specified algorithm which is not widely compatible in the ecosystem · Issue #14389 · ethereum/solidity · GitHub

2 Likes