IPFS scaling question

(put this in ā€˜Codingā€™ because the other categories didnā€™t seem to fit, but this one doesnā€™t either)

One part of the IPFS architecture has always bothered me and that is the part about scaling to huge amounts of data.

Storing a collection of data creates many objects

When you add a file to IPFS it will create at least 2 new content-addressed objects, a new ā€˜blockā€™ with the data and an updated ā€˜treeā€™ object that points to the new block. Larger files may be built of a lot more objects as it will have an indirection from the ā€˜listā€™ object as a series of ā€˜blocksā€™ for the different chunks. A whole directory tree can require thousands or millions of objects to represent. This data will only live on your local node and any nodes that request the same data. The storage of the data is fine, my concern comes of indexing.

Storing new data requires many DHT updates

Considering adding a whole directory tree to IPFS. Every directory and every chunk of every file will be its own object in the IPFS DHT needs to be updated for each one of these objects. This is going to take a long time and move a lot of data. For adding a large collection of data your node will end up talking to a large percentage of all of the IPFS nodes on the network to record the metadata.

Retrieval has many possible shortcuts

When looking for a collection of data we donā€™t need to query the DHT for every object. Once we find a node that contains one of the objects of interest we can ask that node directly for anything else it has. And that node might have a peer discovery shortcut to tell us about other nodes that might also be storing the same data. But when we are creating the data in the first place these shortcuts donā€™t work. The network requires that every object is addressable by the DHT and so needs to get verified and stored.

Consider a subset of nodes sharing a set of data

Consider a handful of nodes that want to operate on a set of data that gets updated regularly. These nodes can discover each other by finding other nodes with the root object and so when reading data they can mostly limit traffic to only the nodes that contain the data of interest. However new data that gets added to this collection need to get sent to the broader set of all IPFS nodes. Well, at least the metadata for that new data.

Musing about other solutions

So right now it seems to me IPFS can scale OK when broadly replicating mostly static data. And when used for web content that does seem to be the initial target. But it will have a fairly large indexing footprint on each node.

So the question is what could be changed to improve this situation? Clearly, we need to remove the requirement that every single hash needs to be indexed in the global DHT. What if only some objects were indexed? For instance, perhaps only a top-level tree object is indexed in the DHT and the unindexed data linked to is only found on nodes containing the top-level object? This is more like the way BitTorrent works. The ā€˜peersā€™ are found using the DHT and then the data is requested from those peers directly. A large directory tree would get split into manageable sized sections by making different sections be indexed. This change can reduce the global deduplication because we may not notice that a chunk of data already exists in the network, but any given node will only store a single copy.

Anyway, I am sure this has been discussed in the past, but I didnā€™t notice the discussions.

2 Likes

Youā€™re right @wscott. These have been discussed in the past but those were usually very dense discussions of the technical details. Youā€™ve done a good job of spelling out some of the general patterns. Thank you for writing it up.

Related Optimizations

At the start of the data.gov sprint in January we generated a laundry list of optimizations that would impact large datasets. See ā€œbig optimizationsā€ in the notes from the sprint planning meeting. Note that some of those optimizations have been handled now. We were also aiming at accommodating datasets that are hundreds of Terabytes. That turned out to be irrelevant for the sprint (the target corpus of datasets ended up being less than 10TB)

Testing & Metrics

For most of these optimizations, we have a strong sense of what needs to be done but rather than optimizing based on hunches we want to implement proper testing infrastructure so that we can do things like add 100TB to a node, replicate the data across 1000 nodes, and use fine-grained metrics to watch where the bottlenecks appear. This drove us to add more metrics features to ipfs and it showed some of the demand for Interplanetary Test Lab (work in progress).

Limiting the Number of Nodes for Fast Replication

Youā€™re right about limiting the number of nodes in your network in order to reduce network overhead. That is definitely one strategy. In January I wrote these instructions for efficiently replicating large datasets. That was before @kubuxu implemented private networks in go-ipfs, so we can actually simplify the process profoundly ā€“ just use that feature to form a private network. (However, this might get tricky if you want to publish the data on the public network later. We havenā€™t figured out the right UX for dealing with public nodes and private nodes on one machine.)

ipfs-cluster: Coordinated Networks of Nodes

Your questions also relate, indirectly, to ipfs-cluster

2 Likes

Thanks for the reply and yes it is very cool what we can accomplish with IPFS currently.

Yes, I see that. But most of those feel like optimizing the existing data structure to be more efficient. The current DHT inserts are slow and we can fix the code and the network operations to make those operations faster. But that is only going to lead to linear performance improvements. My point is that the current architecture fundamentally uses a lot of network resources and cannot scale to replace something like BitTorrent. And it is the write-path with a problem.

Ideally, a collection of people can share data without involving the entire network, but the current indexing requirements mean that any data that is publically accessible must be published in very fine granularity.

Always a good thing. But again that is focused on the efficiency of the current architecture. Still, it will help to expose the network footprint of local operations.

Yes, you could splinter the network and make a private cloud for your data to avoid all the metadata updates. But that is a hack and is a lot less usable that just publishing on the public net and allowing interested parties to contact each other.

disconnected networks

Semi-related to this thinking about having a group of people communicating without needing to touch the entire network is thinking about disconnected networks.

I work on networking issues for some private networks that are weakly connected to the global internet. For example, letā€™s say I have a collection of users all in close proximity that wants to interact with each other, but they share an unreliable/slower connection to the outside internet. Consider building a software service on top of IPFS. It would be nice if communication between nodes on the campus were fast without saturating the shared outside link.

Again, for reads, we can build a system to make this fast and follow the current IPS architecture. But writes will bog down. It would be really useful to be able to limit the number of objects that need to be globally indexable and introduce the ability to have locally indexable objects that are linked to a globally indexed top level object.

Also, it would be useful to automatically (or manually) form some mesh forms to allow outside requests to filter through a local proxy node to prevent duplicate requests.

These are interesting points. If you want to propose an alternative approach/architecture, I recommend writing it up and posting on https://github.com/ipfs/notes, which is basically an open forum for RFC-type discussions.

(since discuss.ipfs.io is so new we havenā€™t figured out the relationship between it and ipfs/notes. The one thing thatā€™s clear is ipfs/notes tends to get technical proposals that look more like RFCs. By contrast, discourse is designed for broader community discussion, mutual advice & support, etc. Iā€™m personally partial to letting discuss.ipfs.io be the main forum for all discussion and have things jump over to ipfs/notes if/when thereā€™s a more structured proposal.)

1 Like

At the moment I donā€™t have an alternative proposal. I have a vague idea about a change that seems like it would work. But I donā€™t know enough about the IPFS implementation to describe the updated protocol or understand the impact of the proposal.

Here I was just voicing my concerns and trying to spell out where I think they will occur. I am also trying to see if I was not understanding something.

I also prefer using discuss.ipfs.io for discussions like this. Partially because it is so unstructured that people donā€™t need to feel intimidated about contributing. Also, the moderation tools are such that if an unrelated gem appears in the middle of a discuss it can be split out into its own topic.

2 Likes

That works for me!

In terms of the architecture and the data structures involved, I think some of the scaling issues you point to are less costly than you suspect, but I have to defer to someone like @whyrusleeping @daviddias @lgierth or @jbenet for a more definitive answer.