Understanding hashes

Are there any resources (articles or blogs) that can help me understand how IPFS creates hashes?

Specifically, I’m wondering how does IPFS compute hashes and can I compute these hashes before “adding” them to the ipfs network.

For example with shasum, I can run shasum mytext.txt and get 17e5eac2ee13951603f513989b9e85b2e17e1341.

Is there something similar for ipfs?

This will help me get an ipfs fingerprint of a file that I have not yet added to the network, and likewise be confident that when the time comes for me to add it to the network, I’ll be confident I can retrieve it from the network using that fingerprint

Thanks.
jw

1 Like

hi jeff,

ipfs use this project to create hashes : https://github.com/multiformats/multihash

Regards

1 Like

To compute what the hash would be without actually adding the file(s) or data, use the -n flag on the ipfs add command.

ipfs add -n path-to-files

I plan to write a section of the dweb primer that explains the internals but haven’t gotten around to it yet. (which makes me sad :frowning_face:)

2 Likes

Awesome.

So, in my ruby app for example, if I needed to compute an ipfs hash for a given record that a user creates, I could use the https://github.com/multiformats/ruby-multihash gem and the hash output would be identical to the hash that gets created when I run ipfs add???

If so, cool!

Ok, I think I’m getting the hang of it, but I’m having trouble understanding why my hex and base58 hashes are not lining up.

Consider this file:

https://gateway.ipfs.io/ipfs/QmXodYnsHUp8vbeksExaVZiEGWjG5UNth1ZM2qnuMvSYig

running shasum -a 256 (in bash on a mac) gives me:
d5365a7cc627dd6c1d27e914ef79ae3def4544d514c3848ee5437ffd68fa52b5

Using the multihash gem this gets prefixed with 1220, so i get:

1220d5365a7cc627dd6c1d27e914ef79ae3def4544d514c3848ee5437ffd68fa52b5

But when I try convert this hex to base58 using this tool: http://lenschulwitz.com/base58 I get:

QmcgwctjGy47aV8pM2NjMBWZdqjiG5yfq5VWCPNKiPZP88

NOT what ipfs add -n gives me which is:

QmXodYnsHUp8vbeksExaVZiEGWjG5UNth1ZM2qnuMvSYig

I was expecting the final two hashes to be identical.

Am I missing something.

Any assistance is much appreciated.

I had asked a similar thing some time ago, and if I remember correctly, the reason is that in the former case you’re running sha256/multihash/base58 on a file, whereas in the latter case it’s not just the file that’s relevant here, but the IPFS object created from that file by your IPFS node. Therefore the hashes will not match.

1 Like

Is it because the IPFS break into small pieces the file?

I’m not 100% sure of how ipfs works, but if you compare hashes using a small file, then they still don’t match, so it has to be a different reason. Afaik ipfs creates a kind of wrapper for any file that’s added to a node, whether it’s a small file or a chunk of a bigger file, and that’s what constitutes an “IPFS object”.

Is anyone a little concerned about this difference?

One of the things I love most about IPFS is that each file is content addressable.

But does this difference create a small distorting noise, changing the file, and creating a hash that makes the file only identifiable within the IPFS system?

I believe GIT does something similar, by actually placing a small header at the beginning of a given file and the computing the hash of that new content:

Git constructs a header that starts with the type of the object, in this case a blob. Then, it adds a space followed by the size of the content and finally a null byte:

header = “blob #{content.length}\0”
=> “blob 16\u0000”
Git concatenates the header and the original content and then calculates the SHA-1 checksum of that new content.

https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

I’m sure that there are important reasons Git does this, just as I’m sure that there are important reasons IPFS hashes differ from the file’s pure hash.

But I would like to an automated way to move from a file’s pure hash to a files IPFS hash and visa versa.

This would help ensure that someone who didn’t have any IPFS software on their computer, could still locate their file on the IPFS network, simply by converting the pure hash of the file.

Any thoughts? Am I missing something?

I assume that someone would have to create a program, e.g. ipfh (for “interplanetary file hash”) that only does what ipfs add -n -Q does, i.e. multihash plus the ipfs “wrapper”. But people would still not be able to compute it out-of-the-box, because they’d need to install ipfh first, and in that case they can also install default go-ipfs and run ipfs add -n -Q. I assume that developers could add the relevant code from ipfh to their checksum/hash libraries and programs like rhash, but people would still need to install that.

If I understand the whole thing correctly, you can’t use an already calculated sha-256 and derive the IPFS multihash from it, because the sha-256 is based on the original file, not on the IPFS object. So we’d need to know what a node does with a file before hashing it. I believe I once read that there is some tar action involved. (But I might be wrong.)

1 Like

Thanks @JayBrown. Interesting food for thought :slight_smile:

I’ve written an article about the Multihash protocol, its motivation and use cases here.

Hope this helps.

1 Like

For reference there is https://github.com/ipfs/faq/issues/22 where I’m trying to understand how the hashes are created, with the end goal of having ipfh or ipfsmh as I called it just like @JayBrown mentioned, however for the use case I have in mind it will not be an option to use go (to much dependencies) so my goal right now is to at least create a Proof of concept implementation in Python, hopefully that could also serve as documentation and example for how hashes are generated.

@NiKiZe This discussion has a pretty extended high-level discussion about calculating hashes that you might find relevant. The main caveat that seems to be a sticking point for some is that multiple valid DAG structures can be constructed for a given file. There’s not a “definitive” hash for any given file.

1 Like