Questions About IPFS Scaling Limits

Hi all!

I’m currently trying to evaluate IPFS for a large-scale content distribution network.

From my understanding of both IPFS and BitTorrent, I’m confident that IPFS should have no problem serving datasets/distribution patterns that BitTorrent currently handles. In other words, I’m confident that IPFS should be able, now, or in the future with optimization and tweaking, to act as a replacement for BitTorrent

However, there are some hypthetical workloads that we don’t see in the wild, and I’m wondering about IPFS’s ability to serve these.

In a BitTorrent swarm, all peers in a swarm are interested in roughly the same data. Thus, the swarm is in a sense “topical”, and the protocol is able to take advantage of this in a few ways. For example, peers are all requesting and sending pieces from the same list, and thus are able to send each other have lists and want lists that consist only of indices. Also, although peers may sometimes not be interested in the same data, the common case is that peers are able to upload to and download from most peers that they connect to.

In IPFS this might not be the case. The IPFS network might be storing extremely large datasets where many peers are only interested in a small subset of data. For example, imagine a ~1PiB dataset and many 10s of thousands of peers, where each peer is only interested in some more-or-less-random ~10-100GiB subset. This would mean that each peer is only interested in 0.1-0.01% of the dataset.

My fears in this scenario are:

  1. Peers might have great difficulty finding other peers to share data with, since another peer might not be interested in the same data.

  2. Peers would have very large have and want lists, and there would be huge overhead sending these lists of hashes back and forth.

  3. The sheer number of blocks involved would mean that peers might not be able to insert themselves into the DHT for every block they have or are interested in. Thus, there might be subtrees of the dataset which peers are interested in, but for which the DHT does not contain any entries.

  4. BitSwap might break down, since most peers that they trade with might never be seen again, or might never be interested in the same data after the initial trade.

Does anyone have any thoughts about these concerns? Has the IPFS network seen instances of this kind of usage?

Best,
Casey

My (probably incomplete) understanding is this:

  1. You say that Bittorent use only indices in a limited data space to exchange information. IPFS use hashes instead. While this is more expensive, it’s not that much more.
  2. Data in IPFS is put together in graph, which leads to a new set of possible optimisations:
    1. If a node have a block, it likely also have the child block, so you can have prioritized communication with this particular node for the other blocks.
    2. You can say “I want all the children of $hash” instead of listing all the child hashes
    3. In the future with IPLD selectors, you can ask for a subset of a graph (ask to node A one half of the children and the other half to node B)

More details here: https://github.com/ipfs/go-ipfs/issues/3786

1 Like

Thanks for the response!

  1. I agree that hashes are only a little bit more expensive, but IPFS has the additional cost that the contents of the have lists and want lists can potentially be much larger. Thus there’s the additional cost of indices vs hashes, but also many more items in total to pass back and forth.

  2. I think these are possible optimizations, however I would be more comfortable if there was, for example, simulations or approximate calculations based on distributions of data.