(put this in ‘Coding’ because the other categories didn’t seem to fit, but this one doesn’t either)
One part of the IPFS architecture has always bothered me and that is the part about scaling to huge amounts of data.
Storing a collection of data creates many objects
When you add a file to IPFS it will create at least 2 new content-addressed objects, a new ‘block’ with the data and an updated ‘tree’ object that points to the new block. Larger files may be built of a lot more objects as it will have an indirection from the ‘list’ object as a series of ‘blocks’ for the different chunks. A whole directory tree can require thousands or millions of objects to represent. This data will only live on your local node and any nodes that request the same data. The storage of the data is fine, my concern comes of indexing.
Storing new data requires many DHT updates
Considering adding a whole directory tree to IPFS. Every directory and every chunk of every file will be its own object in the IPFS DHT needs to be updated for each one of these objects. This is going to take a long time and move a lot of data. For adding a large collection of data your node will end up talking to a large percentage of all of the IPFS nodes on the network to record the metadata.
Retrieval has many possible shortcuts
When looking for a collection of data we don’t need to query the DHT for every object. Once we find a node that contains one of the objects of interest we can ask that node directly for anything else it has. And that node might have a peer discovery shortcut to tell us about other nodes that might also be storing the same data. But when we are creating the data in the first place these shortcuts don’t work. The network requires that every object is addressable by the DHT and so needs to get verified and stored.
Consider a subset of nodes sharing a set of data
Consider a handful of nodes that want to operate on a set of data that gets updated regularly. These nodes can discover each other by finding other nodes with the root object and so when reading data they can mostly limit traffic to only the nodes that contain the data of interest. However new data that gets added to this collection need to get sent to the broader set of all IPFS nodes. Well, at least the metadata for that new data.
Musing about other solutions
So right now it seems to me IPFS can scale OK when broadly replicating mostly static data. And when used for web content that does seem to be the initial target. But it will have a fairly large indexing footprint on each node.
So the question is what could be changed to improve this situation? Clearly, we need to remove the requirement that every single hash needs to be indexed in the global DHT. What if only some objects were indexed? For instance, perhaps only a top-level tree object is indexed in the DHT and the unindexed data linked to is only found on nodes containing the top-level object? This is more like the way BitTorrent works. The ‘peers’ are found using the DHT and then the data is requested from those peers directly. A large directory tree would get split into manageable sized sections by making different sections be indexed. This change can reduce the global deduplication because we may not notice that a chunk of data already exists in the network, but any given node will only store a single copy.
Anyway, I am sure this has been discussed in the past, but I didn’t notice the discussions.