Bootstrapping IPFS with data from Dropship / Dropbox

Dropbox, just like IPFS, allows users to access its content via its SHA-256 hash, regardless of which account uploaded the original file.

Imagine an IPFS node that, upon receiving a request for a block of data by its SHA-256 hash, would query Dropbox for that hash.

This would effectively put potentially petabytes of data, already addressable by hash via the Dropbox API, on IPFS as well.

I believe it may be limited to files less than 262144 bytes in size (because of how IPFS chunks data, and the IPFS hash may need to be hashed again to acquire a dropbox content hash for the correct content (https://www.dropbox.com/developers/reference/content-hash).

This could be as simple as getting the wantlists of all its peers, and running all the hashes through Dropship or something similar, and ipfs adding any successful Dropbox downloads.

Info on getting data from Dropbox by SHA-256 hash only: https://github.com/driverdan/dropship http://blog.fosketts.net/2011/07/11/dropbox-data-format-deduplication/ https://news.ycombinator.com/item?id=2478757 https://news.ycombinator.com/item?id=2478567 http://razorfast.com/2011/04/25/dropbox-attempts-to-kill-open-source-project/

Thoughts on this idea?

1 Like

Wow, I can’t believe I never heard of that vulnerability. Fortunately, that doesn’t work anymore; I assume they now do something a bit less naive. Unfortunately, that means this won’t work. Neat idea though (although we wouldn’t have been able to do this anyways without violating Dropbox’s terms of service).

In general, we would like to support remote blockstores for similar use-cases. For example, now that we have experimental git support, we could fallback on asking GitHub for git objects we can’t find via bitswap (their API allows one to search for git objects by hash).


FYI, in case you ever find yourself implementing a deduplication system like that, a better way to do it is to force the client to prove that they know the file in question. You can do this by, e.g., having them send both HASH(file) and HASH(nonce || file) where nonce is a some session-specific value chosen by the server. Users could still use this as an oracle to determine if someone has uploaded the file (a bit concerning) but it won’t let them fetch the file given only the hash.

The code for Dropship no longer works because Dropbox updated their API. They still use sha256 deduplication though, so the api would just need to be reverse engineered again.

You can see this is still the case in their API documentation for the content hash: https://www.dropbox.com/developers/reference/content-hash

It’s a SHA256 of the SHA256s of each 4MB or less block. So a file less than 4MB in size would have a dropbox content hash of SHA256(SHA256(file)).

With regards to the Terms of Service, that is true. Such code wouldn’t need to be in, for example, go-ipfs officially.

In fact, just a single running modified node that used Dropbox as a backend for its blockstore would result in access to all Dropbox’s data.

That’s a different API for verifying uploads and comparing remote files to local files (in one’s own Dropbox). They probably still use sha256 for deduplication but I doubt that they still allow clients to abuse this API. I assume they’ve implemented a fix like the one I suggested.

What you do in your own free time and what code you write is your own concern but we’re not going to discuss that here :slight_smile:.

All right.
Do you know where I can learn more about doing the same thing (but aboveboard) for GitHub?

For now, IPFS doesn’t have nice, pluggable, blockstores. You may be able to plug this in at the datastore level but I’m not really the expert on that (@Magik6k?).

As for the GitHub API, you can search commits using the search API and then use one of the git protocols to download the commit in question.

I see how to get a commit with a certain sha1 hash. Would it be possible to get a blob by its SHA-256 though? That seems like the useful application. Almost the equivalent of Dropship but for GitHub.

Now I want to write a bot that googles every sha256 as hexadecimal and checks the results for any files with the desired hash…

I see how to get a commit with a certain sha1 hash. Would it be possible to get a blob by its SHA-256 though? That seems like the useful application. Almost the equivalent of Dropship but for GitHub.

No, git, and therefore GitHub, use SHA1. However, we also use SHA1 for git objects in IPLD (if you enable git the plugin when you compile go-ipfs, we don’t plan on enabling it by default until we work out some kinks). So, you wouldn’t be able to use GitHub for arbitrary objects but you could use it for git commit objects.

Also, it’s unlikely that many whole-file file hashes in the wild will match up with existing hashes in IPFS. This is because:

  1. By default, we break files over 256KiB into chunks of ~256KiB and create a tree structure on-top of those chunks. The hash is actually the hash of the root object in the tree (merkledag). We do this to better facilitate data transfer over the network (you can validate pieces of a file as you download it instead of having to download the entire file up front) and to allow deduplication between files with identical pieces.
  2. Currently, we wrap file chunks in a bit of extra metadata before hashing them. We actually have an experimental feature that we call “raw leaves” that doesn’t do this but we haven’t enabled that by default.

TL;DR: Once we enable “raw leaves” by default, you’ll be able to do this for files <256KiB (generally) but that’s really the only case that will work out of the box.

However, if we know how some third-party service stores a file, we import it into IPFS such that we can fetch it from that service (e.g., git/GitHub).