How to calculate file/directory hash?

I’m interested in calculating IPFS hashes for various files and directories, somewhat similar to ipfs add -n. I’ve searched around, but wasn’t able to find any solid documentation on how this is calculated exactly.

I’ve had a quick look through code and tried figuring it out from experimentation, and have some basic idea (files are broken into 256K chunks, wrapped in some protobuf format, hashed using SHA256 and there are index pieces which join together these small pieces, in a tree-like fashion, from what I can tell). However, is there some documentation/specification which explains all of this a bit better?

Also, is there a way to add a file via stdin?

3 Likes

Could you clarify the question? ‘ipfs add -n’ does exactly what you’re asking – it calculates the hashes of whatever you’re adding. The ‘-n’ flag tells ipfs to just calculate the hashes without actually adding the files.

Are you asking how to do this in code rather than from the command line?

is there some documentation/specification which explains all of this a bit better?

Try reading the tutorials at https://dweb-primer.ipfs.io

3 Likes

Yes, sorry for not making that clear. To put it another way, is there some specification/guide on how I would write a program to basically do the same thing as ipfs add -n ?

Thank you for the response by the way.

3 Likes

I don’t know the specifics of how the current default behavior of ipfs add -n is implemented, but depending on your use case it might be worth noting that the hashes generated by go-ipfs’s add command for a given file/directory are dependent on a variety of factors (where the defaults might change). For example, the output from ipfs add -n depends on

  • chunking algorithm (--chunker option)
  • DAG format (--trickle option)
  • CID version (--cid-version option)
  • hashing algorithm (--hash option)
    • SHA-256 was/is the default but I think the default might be moving to BLAKE2b
2 Likes

Oh, so a file could have many possible hashes? And how hashes are calculated isn’t yet standardised/stable? That was something I was unaware of, so thank you very much for the information!

I realised that the multi-hash system could allow the same file to have different hashes, but I was under the impression that there’d be a ‘standard’ hash (say SHA256) everyone would adhere to that’d rarely change.

But if hashes can wildly vary for the same thing, doesn’t this somewhat reduce the effectiveness of content based addressing?

1 Like

As I understand it, the IPFS hash is a self-describing hash, i.e. a node will know from the format which algorithm was used, at which length etc. As long as you have the latest version installed, you should be fine. The latter was a problem for me, when I tried accessing the Turkish Wikipedia mirror, which had been created using v0.4.9, with go-ipfs v0.4.8; that didn’t work until I had v0.4.9 installed. But under normal circumstances, a node should recognized how a hash was calculated.

1 Like

yeah… as I understand it the first two letters(“Qm” most commonly found in IPFS hashes) state how a file is going to be hashed. https://github.com/ipfs/faq/issues/22

So if you used a different hash function the first two letters would be different which should be recognized automatically…

1 Like

Thanks for the responses, but from what leerspace is saying, this isn’t the only issue.

The multi-hash format only provides a wrapper around the hash algorithm, so you can change from SHA256 to BLAKE2b, and the prefix will differ. HOWEVER, the hash is also affected by the chunking algorithm, DAG format and CID version, so you can have completely different hashes even if the format is marked the same.

This means that the same file can be duplicated across the network (goes against the claim of “zero duplication” on the IPFS home page), and this doesn’t even seem to be a rare case. I mean, someone could add a file, remove it, upgrade the IPFS client (or use a different one), add it again, and get a completely different hash.

Or have I misunderstood something?

2 Likes

yes… what you are saying could happen. However, if this happens most likely the user is forcing the duplication, as knowing how to change the hash algorithm, chunking algorithm, DAG format and CID version is more complicated and time consuming than most users would have time to do it, without substantial incentive.

So assuming that IPFS doesnt change the default settings frequently, the probability of existence of multiple hashes for the same file is low.

Further, I don’t think each file will only exist once in the network (zero duplication). For the same hash, multiple peers could get the same file and pin it to their local drive to duplicate it. Duplication of files leads to redundancy which I think is necessary and something users would want.

Maybe somebody could correct me, if I am incorrect…

1 Like

Sidenote: I’ve installed the prebuilt multihash binary here, and it’s giving me an IPFS hash different from running ipfs add -n /path/to/FileOrDir, both in default setting (i.e. sha2-256). I assume this has to do with the changes made to hashing at one of the earlier ipfs updates—I believe it was 0.4.8 to 0.4.9—, and the standalone multihash hasn’t been changed yet.

1 Like

Multihash just hashes the file and prints out it hash in multihash format.
ipfs add creates merkeldag and wraps the file in needed format, that is why the hash is different.

3 Likes

Thank you for your responses.

users would have time to do it, without substantial incentive.

If there’s little incentive to change those, what’s the rationale for having them as switches? Are they perhaps mostly for experimental purposes and no-one is actually expected to use them?

So assuming that IPFS doesnt change the default settings frequently

This is a concern that I have. How often can we expect it to change? For one, it seems like the hash algorithm is already expected to change (SHA256 to BLAKE2b) - I get that IPFS is likely still in the experimental phase at the moment, but such a change could really impact an actual environment.
And this is ignoring potential other implementations.

If the current defaults are not expected to change, shouldn’t they be documented and/or written as a specification? This would allow others to implement alternative applications which are compatible, and don’t end up partitioning up the network. This documented hash should be exact, that is, contain no ambiguities/tunables such as a customizable chunking algorithm.
Unfortunately, flexibility and hashing don’t really mix, since hashing has to be exact. That is, the same input must always give the same output hash (and not 10 different hashes depending on settings used). I get that some flexibility may be desired, for example, if a cryptographic hash has been broken, and it seems that the multi-hash format is intended to deal with this, but such changes should be rarely performed, and considered somewhat breaking.

Is there such documentation available, or is it still being worked out, or something else?

Further, I don’t think each file will only exist once in the network (zero duplication)

I interpret “zero duplication” as referring to not having wasteful duplicates (or zero duplication in named resources). That is, if you and I have a copy of the same file, and person C wished to obtain a copy, he can obtain it from either or both of us.
However, if the hash generated by you differs from mine, there is now a duplicate resource on the network. That is, person C can no longer download the file from both of us, and if he tries to download from both hashes (since there’s no way to determine whether the underlying file is identical), he will have two copies of the same file on his disk.

2 Likes

The way I always understood it—and please correct me if I’m wrong—, is that it doesn’t matter what hash settings you use for a file that you add to the IPFS. If you add a file to your node using the current default sha2-256, it’s still the same file as when another user adds it to his node using the upcoming default blake2b-256. Doesn’t the network treat both as the same object, independent of the hash settings? Which means, if you get/cat/pin an object by inputting a sha2-256 hash, the network will also grab parts of the object from the node that used the blake2b-256 hash?

If not, then, indeed, I second your reservations.

1 Like

The way I understand it from here: https://github.com/jbenet/random-ideas/issues/1

is that Qm encodes the hashing algorithm in hex format. however, the other factors that leads to the hash could result in duplication (Ref: https://github.com/jbenet/random-ideas/issues/1#issuecomment-58437241). Perhaps these parameters could be added to the code to reduce data duplication?

Also @jbenet seems to be aware of this issue of changing default setting too often, but perhaps a spec would help (Ref: https://github.com/jbenet/random-ideas/issues/1#issuecomment-43361213)

1 Like

Maybe my test isn’t valid for some reason, but trying to retrieve files added with one hash using a separate hash doesn’t work. I’m not sure how this could possibly work, but I wanted to double check.

# Create random 1MiB file I'm pretty sure nobody else has
dd if=/dev/urandom of=rand_file bs=1M count=1

# add the file to ipfs with defaults
ipfs add rand_file  # gives me QmeHy1gq8QHVchad7ndEsdAnaBWGu1CAVmYCb4aTJW2Pwa

# generate a multihash for the same file using a different hashing algorithm, but don't make it available on ipfs
ipfs add -n --hash=blake2b-256 rand_file  # gives me zDMZof1kx7N1VLQa3bxVSk53tJLUSXzW4mUMyc57HqBqri94rava

# try to retrieve the file added to ipfs with sha2-256 using the blake2-256 version of the hash
ipfs get zDMZof1kx7N1VLQa3bxVSk53tJLUSXzW4mUMyc57HqBqri94rava  # this command hangs (doesn't work)

# try to retrieve the file added to ipfs with sha2-256 using the sha2-256 version of the hash
ipfs get QmeHy1gq8QHVchad7ndEsdAnaBWGu1CAVmYCb4aTJW2Pwa  # this obviously works
2 Likes

So there will be no backward compatibility once blake2b-256 is introduced as the default, i.e. files might wind up on the IPFS that are already there under a different hash. So @Nyan does have a point here.

PS: to reproduce on macOS use dd if=/dev/urandom of=rand_file bs=1m count=1
…with 1m instead of 1M.

1 Like

My understanding is that there will be backward compatibility in sense that the old hashes will continue to work.

However, it will introduce duplicate data if people continue to re-add files to IPFS even though they already exist under older/different multihashes; I’m having trouble thinking of a common scenario for this in the theoretical future where content is addressed and distributed using its IPFS multihash. I don’t know what OP’s use case is, but I thought that one of the primary use-cases for IPFS was where one (maybe sometimes a few) people add the file into IPFS and many retrieve it from IPFS using its multihash (which will continue to work even if the defaults for adding new files change).

1 Like

If the world already used IPFS, this would seem like a stronger point, but with so many other distribution methods being used these days, a network which inherently allows such fragmentation doesn’t really seem like a solution to much, I would think.

Having only one person add a file to IPFS and then distribute the hash seems counter-intuitive to the notion of decentralization?

Is there much of a reason to allow such flexibility in the hash anyway?

1 Like

How so? This is similar to how BitTorrent works, and as I see it IPFS is at least as decentralized.

An example use case for this flexibility is the trickle DAG format (--trickle) which is generally more suited to video streaming than the default DAG format. I’d imagine there are other DAG formats suitable for other types of content as well. Depending on the content, different chunking algorithms may also be better at reducing duplicated blocks.

As for why use a multihash instead of hard-coding a specific cryptographic hash into the protocol, I think this GitHub issue addresses that somewhat: https://github.com/jbenet/random-ideas/issues/1

As time passes, software that uses a particular hash function will often need to upgrade to a better, faster, stronger, … one. This introduces large costs: systems may assume a particular hash size, or call sha1 all over the place.

1 Like

I was under the impression that IPFS was aiming to be more than just a Bittorrent alternative (correct me if I’m wrong) - the home page seems to give the impression that decentralization is a key goal. However, if the intention was to have centralized “IPFS tracker” type websites, and hence, basically be a copy of Bittorrent with a few niceties on top, then it seems perfectly acceptable.

Is there a good place which documents all of these? The help for ipfs add is fairly bare and doesn’t describe what a number of the options exactly do.
I tried searching for information about trickle, and came across this. The last comment opines that there really isn’t that much difference between the two. I don’t know how accepted that opinion is, but my intuitive understanding of a tree based hash is that it doesn’t really impede streaming/sequential access in any way?

I get the rationale for customizable algorithms and see the benefits. My main concern is rather how easily it is expected to change. For a stable network, it should ideally only be changed if it is absolutely necessary.

Thank you for the response!

2 Likes