Are there any resources (articles or blogs) that can help me understand how IPFS creates hashes?
Specifically, I’m wondering how does IPFS compute hashes and can I compute these hashes before “adding” them to the ipfs network.
For example with shasum, I can run shasum mytext.txt and get 17e5eac2ee13951603f513989b9e85b2e17e1341.
Is there something similar for ipfs?
This will help me get an ipfs fingerprint of a file that I have not yet added to the network, and likewise be confident that when the time comes for me to add it to the network, I’ll be confident I can retrieve it from the network using that fingerprint
So, in my ruby app for example, if I needed to compute an ipfs hash for a given record that a user creates, I could use the https://github.com/multiformats/ruby-multihash gem and the hash output would be identical to the hash that gets created when I run ipfs add???
I had asked a similar thing some time ago, and if I remember correctly, the reason is that in the former case you’re running sha256/multihash/base58 on a file, whereas in the latter case it’s not just the file that’s relevant here, but the IPFS object created from that file by your IPFS node. Therefore the hashes will not match.
I’m not 100% sure of how ipfs works, but if you compare hashes using a small file, then they still don’t match, so it has to be a different reason. Afaik ipfs creates a kind of wrapper for any file that’s added to a node, whether it’s a small file or a chunk of a bigger file, and that’s what constitutes an “IPFS object”.
Is anyone a little concerned about this difference?
One of the things I love most about IPFS is that each file is content addressable.
But does this difference create a small distorting noise, changing the file, and creating a hash that makes the file only identifiable within the IPFS system?
I believe GIT does something similar, by actually placing a small header at the beginning of a given file and the computing the hash of that new content:
Git constructs a header that starts with the type of the object, in this case a blob. Then, it adds a space followed by the size of the content and finally a null byte:
header = “blob #{content.length}\0”
=> “blob 16\u0000”
Git concatenates the header and the original content and then calculates the SHA-1 checksum of that new content.
I’m sure that there are important reasons Git does this, just as I’m sure that there are important reasons IPFS hashes differ from the file’s pure hash.
But I would like to an automated way to move from a file’s pure hash to a files IPFS hash and visa versa.
This would help ensure that someone who didn’t have any IPFS software on their computer, could still locate their file on the IPFS network, simply by converting the pure hash of the file.
I assume that someone would have to create a program, e.g. ipfh (for “interplanetary file hash”) that only does what ipfs add -n -Q does, i.e. multihash plus the ipfs “wrapper”. But people would still not be able to compute it out-of-the-box, because they’d need to install ipfh first, and in that case they can also install default go-ipfs and run ipfs add -n -Q. I assume that developers could add the relevant code from ipfh to their checksum/hash libraries and programs like rhash, but people would still need to install that.
If I understand the whole thing correctly, you can’t use an already calculated sha-256 and derive the IPFS multihash from it, because the sha-256 is based on the original file, not on the IPFS object. So we’d need to know what a node does with a file before hashing it. I believe I once read that there is some tar action involved. (But I might be wrong.)
For reference there is https://github.com/ipfs/faq/issues/22 where I’m trying to understand how the hashes are created, with the end goal of having ipfh or ipfsmh as I called it just like @JayBrown mentioned, however for the use case I have in mind it will not be an option to use go (to much dependencies) so my goal right now is to at least create a Proof of concept implementation in Python, hopefully that could also serve as documentation and example for how hashes are generated.
@NiKiZe This discussion has a pretty extended high-level discussion about calculating hashes that you might find relevant. The main caveat that seems to be a sticking point for some is that multiple valid DAG structures can be constructed for a given file. There’s not a “definitive” hash for any given file.