Slow IPFS data retrieval after several performance improvement attempts

i have been trying to speed up the ipfs data retrieval process, and have tried all the below approaches:

so far, what has been completed:
a. added badgerDS profile
b. added accelerated DHT client
c. increased low, high water parameter
d. add list of popular providers to peers
e. moved to more powerful server (10 GB, 8-core, GCP)

however, when running ipfs get [hash] with all the above added, the command is taking 5 minutes. is there any other reason that may be causing this slow down? or a configuration parameter that I might be missing?

main goal: retrieve data from IPFS faster

Did you do all of these things on the machine hosting the content, the machine retrieving the content, or both?

What is the size and structure of the data you are trying to download (e.g. is it a single X byte file, a directory with lots of nested subdirectories and files, etc.)?

I did all of these things on the machine retrieving the content.

The data is a directory of non-nested small JSON files. The total size is usually around 5-10 MiBs.

It’s possible it’s not the fault of the machine retrieving the content, but instead of the the machine that is serving it. For example, the machine could have not advertised its data in the DHT (e.g. too many records and not using the accelerated client).

If you know the data provider try entering the information into https://ipfs-check.on.fleek.co/ and you should get some basic validation that it’s set up correctly. Do you know any of the people who could/should be hosting this data?

Side note: If you’re wondering how it’s possible you’re finding the content at all if it’s not advertised the answer is that if your node happens to connect to another one that has the content (e.g. as a result of doing DHT queries or receiving queries) it’ll end up downloading from them even if it’s not advertised.

1 Like

thanks so much! Pinata is a data provider that should be hosting this data. I followed the IPFS docs guide on adding a content provider as a peer ([Peering with content providers | IPFS Docs]), and I added Pinta’s addresses. So I thought that my machine must be connected to a peer that has the content and should be able to locate it fast. It is also having trouble with getting the files fast after retrieving the data.

If you’re connected to a node that has the content then it should be discovered pretty quickly. You should be able to enter your CID and each of the Pinata peerIDs/multiaddrs that you’re peered with into ipfs-check (linked above) to do a basic check that it has the data.

The above screenshots are what I am trying to retrieve and the response of the test; it says that the peer does have the CID.
As an aside, is there a way to initialize/use multiple IPFS nodes on my end to help the speed in locating and fetching IPFS data? Or would increasing the server’s memory help more?

( I ran ipfs swarm peers and verified that the above peer was indeed connected)

I tried downloading the data after manually connecting to the peerID you listed (ipfs swarm connect /p2p/QmcfgsJsMtx6qJb74akCw1M24X1zFwgGo11h1cuhwQjtJP). It took about 23s to get via ipfs pin add, unfortunately that peer is not advertising your data in the DHT so it has to be connected to manually (e.g. peering for longer term connections and ipfs swarm connect for shorter term ones) or sort of bumbled into as mentioned above.

Unfortunately, writing the 10k tiny files out to disk even once the data was cached via ipfs get QmXUUXRSAJeb4u8p4yKHmXN1iAKtAV7jwLHjw35TNm5j took about 40 seconds when I tried it. Untarring data and optimally writing lots of tiny files to disk can be tricky. When I tried passing the -a flag to get the data exported as a tar archive almost instantly and then I manually untarred it with my system utilities and it took around 10 seconds.

I’m not sure how you’re trying to get the data out of go-ipfs, but if you’re using ipfs get I’d recommend using archive flag and using whatever the fastest untarring utility your OS provides. At that point it’s no longer an IPFS problem and is between you, your untarring utility, OS, and filesystem :grin:. If tar utilities are too slow you might need to take a step back and look at your use case and approach.

Having a single server that has and advertises the data in the DHT (i.e. no red X in ipfs-check) should help you out on the discovery times and not require your users to manually peer with a single server. I would highly recommend having your data advertised in the DHT if you can, otherwise the ability for other nodes to find your data is compromised (for example, it may require manual configuration)

The more nodes that store and advertise your data the more resiliency you have to a single node being out of service, busy, censored, etc.

1 Like

@gadget Update:

Actually I tried downloading the data from /ip4/172.65.0.13/tcp/4009/p2p/QmcfgsJsMtx6qJb74akCw1M24X1zFwgGo11h1cuhwQjtJP (the peer you indicated) and after downloading a little bit it stalled out. That node might not have the full directory.

I didn’t try every node advertising it had the data (found via ipfs dht findprovs <cid>), but this one /ip4/34.69.135.159/tcp/4001/p2p/12D3KooWSYykymnRKNTg4XF4M3XjuyxKcU6RKXWJT11tHH23UTdy had the data and gave it to me in a few seconds.


There are also some implementation details around the various go-ipfs commands for fetching DAGs that may help/hurt performance in particular usecases. None of the performance profiles are guaranteed to stay the same over time as optimizations are made, but at the moment commands like ipfs pin add (if you want to keep the data safe from GC) or ipfs refs -r (if you don’t really care about that) are much more parallel than commands like ipfs get and will likely help you get your content onto your node faster so you can then run ipfs get -a afterwards.

Very briefly the different downloading models for ipfs pin add and ipfs get stem from ipfs pin add just grabbing the whole DAG in whatever traversal order it wants whereas ipfs get is trying to stream results out as quickly as it can in a predetermined order with some amount of pre-fetching. It doesn’t have to be this way indefinitely, however it does need someone with time to work on that area of the code :grin:.