The problem is likely not on your end, been seeing a lot of these reports the past few months.
TLDR: For the past few months content routing (Who can I acquire this data from?) has been having issues, fixes are in the works.
Long Version Below:
The method that IPFS uses to locate a peer that can supply us with desired data is called content routing. If the content routing system cannot find a node that can supply a piece of desired data then the data is effectively unreachable.
IPFS currently has 2 methods of content routing.
- Bitswap Sessions. (BSS)
- Distributed Hash Tables. (DHT)
BSS is an unstructured and opportunistic routing strategy that first attempts to see if any connected peers already have the data we want, bypassing a large amount of work that our node would otherwise be required to do if it happens to be successful. For popular content this strategy is surprisingly effective as most of the time one of the ~700 connected peers has this popular data. (i.e WebUI Static files, IPFS Blog posts, etc)
DHT (aka Kademlia) is a Structured routing strategy in which each node is responsible for a small part of the dataset that contain records of who can provide what data. It is effectively one big key-value table split into many pieces. How DHTs go about this is a bit in depth for a forum post however properly implemented DHT are proven quite effective at this task even scaling to millions of nodes. (Bitorrents Mainline Swarm)
In short when desired data is not available on any connected peers our node must utilize the DHT to find and connect to a peer that can supply us with the desired data.
The current DHT implementation is unstable.
Due to the many subsystems of IPFS needing to work together the development team was unable to implement a traditional implementation of a Kademlia based DHT in a 1:1 fashion.
Two examples that come to mind is
- IPFS-DHT is required to communicate through existing IPFS based connections and subsystems, Vanilla Kademlia exclusively utilities connection less UDP messages.
Since our node already has gone through the work of establishing a stable communication channel with NAT traversal, encryption, etc, it would be silly not to utilize it, but at the same time Kademlia operations assume one off connection less messages and some inefficiencies arise when forcing these messages down a serialized tube.
- IPFS-DHT currently requires that other DHT nodes stay connected at all times to remember them.
This is mostly due to how the DHT subsystem currently interacts with the Peerstore and connection manager subsystems and changes are planned to fix this.
It seems at face value the accumulation of these discrepancies has caused the DHT swarm to not scale correctly with a large number of nodes in the network often taking minutes to complete a single DHT lookup if it ever completes at all.
One possible attribution for this instability is in march a large number of nodes joined the network, And by a large number I mean a 21x increase in ~24 hours. (~14k nodes to ~300k nodes) This “unusual” increase in users brought with it a lot of instability.
First off, This github issue.
With these purposed changes the IPFS-DHT swarm should, in theory…
- Become highly more resistant to “useless” / adversarial nodes.
- Properly balance load across the swarm.
- Become highly more performant and stable.
Finally many other Github issues have been raised with other potential improvements to content routing with a large concentration being located in the go-libp2p-kad-dht project: