What is the fastest way to access ipfs jsons that are not mine?

I am building a python bot for which I need to access thousands of json files on the ipfs.io server in a matter of seconds, at the moment my bot takes a few minutes to load all the files, I would like to know if there is a way to access a python node that is not mine directly from my python code with an ipfs library, or, if not possible, if there is a way to speed up the data gathering process, apart from multithreading, and async, which I already do. I was thinking of using multiple proxies and different Ip’s, I dont know if a faster internet would help.
The ipfs links I have to get access to and gather information from are something like this:
https://ipfs.io/ipfs/QmfWuoR2Cxezerynjjknhmus1qgGdwLRcBLQVFSrdHAwnJ/1

I just need to change the last number up to 10000 and read from every file.
Thanks to anyone who helps me

Don’t hammer the poor thing, that’s not its purpose (it’s meant for testing only), and has become very slow because everyone is thinking like you.

Run your own node locally where you intend to run your script and access its gateway at 127.0.0.1:8080. So, your URLs will look like this, using your example:
http://127.0.0.1:8080/ipfs/QmfWuoR2Cxezerynjjknhmus1qgGdwLRcBLQVFSrdHAwnJ/1

That will give you the highest possible speed, since you are talking to the node locally (of course, the node will have to fetch the content if it doesn’t have it in its cache already).

I am real sorry, I had no idea I was causing slowness in the servers, I will now try and run my script locally like you suggested, hope this works, can I ask you if the internet speed changes anything or if, again, requesting from different proxies can change anything?

1 Like

Your node becomes your proxy (in a way), it will pick which other nodes to download the content from itself (you don’t get to pick, it picks the most efficient).

One thing that will speed things up dramatically is to configure your node to use the AcceleratedDHTClient (but it will need more resources from your machine to do that. it’s really meant for servers).

Another thing you can do is to ask your node to pin the folder you are targeting first and let it finish, then run your script (it will then pull from the local cache and run insanely fast). Here is how you would do that first, type this command at the CLI:
ipfs pin add --progress QmfWuoR2Cxezerynjjknhmus1qgGdwLRcBLQVFSrdHAwnJ

Do those two things, and your speed will be at its best (and yes, during the pin command, having higher bandwidth certainly helps).

thank you so much, I will keep you as a reference for eventual issues, you helped me a lot with this, thanks again and sorry for having done it slowing the servers up until now

You’re welcome. Just post to this thread if you have more issues while trying to implement this.

hey, sorry to disturb you again, I would like to ask you if you know python and if you could tell me if there is a way to use my local host on my python script. I ask this because currently my bot opens all the links like the one I showed you before, by changing the last number. By changing the “ipfs.io” part, though, I get asked to open ipfs desktop (when I open the local link manually), while when I open the link with the bot it doesn’t ask me to open ipfs desktop and so it doesn’t find the jsons since it can’t open them. I even saw that there is a library called “ipfshttpclient” for python and ipfs communication, and I saw there is a client.get_json(’ hash ') command, but I am having trouble using it since I don’t know if and how I can connect to ipfs with my local host.
By reading a bit of the documentation I tried with this base command for initializing my client: “client = ipfshttpclient.connect(addr = ‘/dns/localhost/tcp/5001/http’)”

when I run the program though, it gives me the error "Method not allowed for url: "

Hope you can help, thanks in advance!

You don’t need any of those things, just use http and connect to http://127.0.0.1:8080 as if you were connecting to ipfs.io, and you should get the exact same behavior. Note that it’s http, not https like ipfs.io. And, unfortunately, the last time I wrote anything in python was 2012, so I really don’t remember it much. Here is an example:

> echo "Hello nocivo" | ipfs add
added QmVL8dcmHwv682KmGSVyKxa5KQ64Y3d685sx5gk5pTpJfK QmVL8dcmHwv682KmGSVyKxa5KQ64Y3d685sx5gk5pTpJfK
 13 B / 13 B [===========================================================================================================================================] 100.00%
> curl "http://127.0.0.1:8080/ipfs/QmVL8dcmHwv682KmGSVyKxa5KQ64Y3d685sx5gk5pTpJfK"
Hello nocivo
> curl "http://127.0.0.1:8080/ipfs/QmfWuoR2Cxezerynjjknhmus1qgGdwLRcBLQVFSrdHAwnJ/1"
{
  "name": "Wolf #2",
  "description": "The genesis collection of Wilder Beasts. 3,333 Wolves who arrived through an interdimensional portal to be the first lifeforms to roam Wilder World.",
  "image": "ipfs://Qmc5BiRj6h3o7uoQjPiH4kiX1Ed5aQu8pkxeieJC9s7CJJ",
  "animation_url": "ipfs://QmSixUHqZ6oFE9pW2KJRz75B8SxEpoEFiUS9Am9EGyRucw",
  "stakingRequests": "disabled"
}%
>

As you can see, I can easily access your URL using my own node in the way I explained. It should be easy for you to do the same in Python.

Alright so first of all thank you so much for the help, I now understood and I implemented it on my script.
I noticed that that the speed is insanely fast but only from the second time I load the files, in fact, it seems like the first time it is very slow, even slower than ipfs.io, and it gets all the files, and than the second time it remembers which node is better and download it from there so it goes very fast. The only problem is that I really don’t need the second try. My objective is to get the information as fast as possible on the first try.
I am kind of lost now, my bot runs the first time faster on the ‘ipfs.io’ server, but the successive tries are way faster on my local, the problem is that I really only need the first time to be fast.

I would like to ask you if there is a way of somehow getting the internet speed of the local second runs but on the first run, even though I guess thats not possible.
And if that is not possibile with code, how would it be possible with money?
So I ask my second question, is there a guide or a special setting I have to do to use that "AcceleratedDHTClient you mentioned earlier? could the use of that and buying lots of personal servers eventually speed up the loading process that much?

For reference I’m getting (1 minute loads on ipfs.io ( all tries))( 6 minute load on local (1st try)) (9 sec load on local (2nd + try)).

I will leave my discord tag down here in case you wanted to respond in chat or in case you wanted to see my particular case more closely to understand: Lello#2604

Thanks in advance and have a great day!

K, a few things to address :stuck_out_tongue:

Nodes: Your personal node isn’t really different from the node running at ipfs.io, except, of course, talking to your own node on the same machine is vastly more efficient than talking to the ipfs.io node across the internet. Also, you are sharing the ipfs.io node with many, but you are the sole user of your own node.

Caches: once content has been downloaded once, it survives in the node’s cache for a while (can be a long while, if you don’t download much and the cache is large. 10GB by default). This is why the second run is so much faster, you are receiving the content from the local cache on your own machine, not from the internet. The reason the first run on ipfs.io is faster than the first run on your own node is due to the fact that it already has part or all of the content in its own cache, and is simply returning it to you (this, of course, is only true if someone else requested that content before you and it hasn’t been flushed from the cache yet).

AcceleratedDHTClient: the normal DHT client has to walk the DHT to find which node contains the block you want, open a connection to that node, and download that block. this happens for every block. what’s taking a long time is that walk, as it has to connect to a set of DHT servers and ask the question, then they each return a new set of DHT servers, which you then have to talk to and do it again, until one or more of the DHT servers return the address of the nodes that have your content. Your node then opens a connection to one or more of them, and downloads the content from the fastest one of them. This works that way because your node only knows about a small fraction of the DHT servers on the network. In opposition to that, the AcceleratedDHTClient scans the network every hour and keeps a list of all the DHT servers on the network (usually between 10,000 and 15,000 DHT servers), so when it needs to walk the DHT to find your block, it can usually do it in one hit, vastly speeding up the process.

Availability: your node can only find out where a block lives if the node that has it has published it to the DHT. This normally happens every 12 hours, and a DHT record survives in a DHT server for 24 hours. In theory, all available blocks should be discoverable. In practice, it’s a mess. While every node is trying to “reprovide” (that’s what it’s called) every block every 12 hours, if the node has many blocks, it can take vastly longer than 12 hours to do 1 run. Which causes some blocks to disappear from the DHT for long periods of time, making finding them extremely difficult. That’s another reason to run the AcceleratedDHTClient. I have around 30,000 blocks pinned in my node. Using the normal DHT client, it would take over 100 days to do a reprovide run, leaving my blocks undiscoverable for 99 of those 100 days in each cycle. Running the AcceleratedDHTClient allows my node to do a reprovide run in 12 minutes. I suspect many servers serving a lot of blocks are still using the default DHT client.

So, the speed at which your node can find the blocks you are looking for is dependent on whether they are easy to find. If only 1 node has it and its reprovide cycle sucks, it’s going to be really hard to find. However, once you have downloaded it, it is now in your cache and your node has immediately done a “provide” on it, so the next person looking for it has 2 options. if they download it, the next 1 now has 3 options. the more popular your content is, the easier (faster) it is to find and retrieve. If it’s obscure, the node better be running the AcceleratedDHTClient, or it’ll take you a while to find it. However, once a connection is established to that node, and it happens to also have all the other blocks you are looking for, your node will download the whole set over that connection without using the DHT. In fact, if a block can’t seem to be found, if you let your node search for it for long enough, it randomly walks the network and can run into a node that has it by chance, and if that node has everything else, you get the whole thing. I sometimes let my node look for something for a whole day and it often finds it after many hours of looking. Of course, if the owner had done a proper job of reproviding, that could have been found instantly.

Anyway, I still think that using your own node to retrieve the content you want is the best way to solve your problem, but you are prisoner of the quality of the reprovide the owner is doing and the popularity of the content. It could be fast, it could take forever.

1 Like

I can’t thank you enough for the amazing information you are giving me, as a 17 year old who knows very little about this whole programming world, people like you who take the time to rephrase hard concepts like this in such a basic and understandable way are essential.

From what I understood I must totally find a way to run this AcceleratedDHTClient, I will now try and find some info online on how to turn my node to that.

I would like to ask you if, having cleared that there is no particularly fast method to scroll through all this ipfs links, especially the ones that have just been created, if there is a way of speeding up this search process manually. I mean, which component of my computer is in charge of the speed of this correct node search? If I wanted to invest money to increase the speed of this research, would it be better to invest in a very good internet connection? or would it be more useful to invest in a large computing cpu power? or would it be more useful to invest on more machines that can run simultaneously? or I don’t know. So, what is the investment that speeds up this search process definetly? thank you again

oh, oops, run this on your node, and restart it :stuck_out_tongue:

ipfs config --json Experimental.AcceleratedDHTClient true

After a restart, you have to wait till it’s done doing its first scan before you can use it, that takes about 10 mins.

Remember I told you to do a “pin” on the folder before you run your script in a previous post? The reason I said to do that is because it’s the fastest possible way to get the “first run” done. once that pin is finished, you just run your script and you get “2nd run” speed.

The node plays many tricks when doing a pin, so you’ll get a much shorter time that way. Beyond that, as I said before, you are prisoner of the quality of the reprovide and the popularity of what you are looking for. There’s really not much you can do to speed things up.

I just ran that example pin I gave you on my node, and it took 105 seconds. That’s probably what you’ll see too. I’m not going to flush my cache, so you’ll benefit from it when you do your run. Let me know how fast it goes.

Hey, so, I think I have activated the acceleratedDHTClient on my node, I have waited for a couple minutes and than the command “ipfs stats dht” finally listed a bunch of adresses. I have than tried to start my bot to search on the ipfs links with http://127.0.0.1:8080.
Unfortunately it still goes really slow, just like the other time I started it with my local adress in front of the link.
I am really not sure that I did it right, I checked some stats like “ipfs stats provide” and “ipfs stats bw” and they seemed to work, showing information received and sent.

When I started my ipfs node with the command “ipfs daemon”, looking at my activity monitor on my pc I saw that the process ipfs immediately started sending lots of information but receiving a lot less.
Even when I started my bot (with the node with dhtacceleration running) the packets sent were way more than the packets received.

Not being very informed in this case I am pretty sure I am reading the graphs wrong and maybe the packets sent are sign of my bot searching but I don’t know so I will explain to you what I see.

That said, do you think this is all normal or did I make some evident mistake in the process?

Regarding the folder pinning I still have to understand and learn how to do it on python since my bot never accesses the terminal, in which I could have used the command you sent me.

To understand more, do you need some photos of some information? I even remind you that I sent my discord name some messages ago, so, in case you needed to see some processes on my pc or make me try some particular commands, we could rapidly do that in a call.

Thanks again for the time you are dedicating to me and have a great day!

Run these two commands, they should give you an idea if it’s running or not:

> ipfs stats dht wan | grep Bucket
  Bucket  0 (14072 peers) - refreshed never:                                          
> ipfs stats provide
TotalProvides:          67k (67,167)
AvgProvideDuration:     20.818ms
LastReprovideDuration:  12m1.433677s
LastReprovideBatchSize: 33k (33,583)

As you can see, my node sees 14,072 DHT servers and took 12 minutes to reprovide 33,583 blocks. If acceleratedDHTClient isn’t running, the second command will return an error instead.

Which version of ipfs are you using?

P.S. you have to use the API to do a pin from Python: Pin objects to local storage

P.P.S. there is one more thing you can do to speed things up, possibly by a lot, but it’s a little bit tricky. I’ll explain it after your next post

Bucket 0 (10353 peers) - refreshed never:

TotalProvides: 7k (7,691)
AvgProvideDuration: 67.335ms
LastReprovideDuration: 8m37.87892s
LastReprovideBatchSize: 7k (7,691)

this are the results after almost 25 min of the AcceleratedDHTClient running

I am using version 0.12.0 of ipfs

(my internet connection, at the moment, goes in an avrage of 40-50 mb/s)

K, it all looks reasonable to me.

Now, here is the last thing you could do. As I’ve explained, what’s taking a long time is the DHT walk to find each block and connect to the server that has it. Once connected, downloading the block (or blocks) that it has is only limited by the bandwidth. What if you knew which server (or servers) had everything you want ahead of time and opened a permanent connection to it? If you did, requesting anything would just be a quick download, no searching. Well, you can, it’s called “peering”. The tricky part is finding the server(s).

The first thing you need to do is get a list of candidates. Use the command ipfs dht findprovs <key>... on a bunch of blocks you want and see which peers tend to come up a lot.

Next, use IPFS Check to test a specific node and check a good sample of the blocks you want and see if it says it has them all.

Once you have one server (or a couple) that seems to have everything you are looking for, it’s time to set up peering. Edit the following section of your config file and add the peer ID of the node you found to it, then restart your node (I redacted the ID):

"Peering": {
	"Peers": [
		{
			"Addrs": [],
			"ID": "12D3KooW..."
		}
	]
},

No need to supply the address, your node will find it using the DHT.

Now that you have a permanent connection to the server(s), running the pin command should only be limited by the available bandwidth.

Let me know how it went.

Alright, before trying I would like to explain better the situation so you can tell me if this is really possible.

I don’t know if you understood it already but I am trying to get the metadata of NFT’s, that’s what those links have in their json files.

So for example this link: “https://ipfs.io/ipfs/QmettED54g37LWexjAex9MXvEMjEzaEDnVBrA71H18Vo73/1”

has the metadata (information I need) for the number 1 of the collection.

By changing that last number in the link, so without changing the ID (I am guessing that that’s not changing the ID because the code stays entirely the same but correct me if I’m wrong) I can access to the whole collection, usually made up of 5k to 10k pieces.

The collections are not revealed (don’t show metadata) up until a set time, when the creators load all the metadata and change the ipfs ID to the one with the metadata.
At that time, instantly, I have to load the new ipfs ID and load all the just loaded metadata with my bot.

I now have 2 questions since I don’t know as well as you do:

  1. Is it plausible that the unrevealed ipfs ID blocks are the same as the revealed ID blocks? I am guessing that because it is the same person that makes these two different links, usually from the same pc and from the same house. If that was possible I could load a list of the blocks I need to check from the unrevealed link and than just download the new metadata from the revealed link that maybe uses the same blocks.

  2. Do the same links with a different number at the end, as for example “https://ipfs.io/ipfs/QmettED54g37LWexjAex9MXvEMjEzaEDnVBrA71H18Vo73/1” and “https://ipfs.io/ipfs/QmettED54g37LWexjAex9MXvEMjEzaEDnVBrA71H18Vo73/421” have the same blocks? if yes I could find the blocks to download just the number 1 everytime and than replicate them to search all the other numbers.

P.S. I know I am probably using an entirely wrong terminology here so I am going to make clear that when I say “blocks” I mean the servers, or whatever they are called, that my pc connects to in order to download a specific json/ ipfs url. Correct me if that is not the case and I will say it better next time. Oh and correct me if the concept I mean is wrong, wouldn’t want to miss the definition of this complex system.

thanks a lot, again

Use the following command to see the list of blocks (from your example URL):

> ipfs object links QmettED54g37LWexjAex9MXvEMjEzaEDnVBrA71H18Vo73
QmUhu7rXdSQEwtWevUTkFxSLxgagqrmZnA4AGqQFaMevwg 863 1
QmaqaLPgb115UYmZHWCusmhCuCyxmShBfgjkdWyASYGGp7 860 10
QmQhiadFBJP6UwL5aQEn6n4jhR1KmaF2UPAQNpCubdhjtb 865 100
QmNr1wonM8o5kxaKD4qcMqVboVAeCKtjmCjVMLKjF9gzUh 851 1000
Qmd6HSRdQ5NCLifd8Mj5Bq7vPumrewmx1y86PxVVbYHquN 862 1001
QmfXPk39vdjTqwZqQBMH1FaSsXhXy6iHcSkmeoKbYoKrGa 881 1002
...

Each line has a block CID, a size and a name (the number you add to the end). This is similar to a directory listing, and shows you all the valid names and their CID.

Pick a few, and use the commands I showed you to try and determine the server(s) that has them.

I tried a few from both URLs you provided in this thread, and hardly anything comes up at all, and what comes up doesn’t respond.

Bottom line is, the quality of their reprovide is dismal, which explains why your script runs so slowly.

It’s going to take some work trying to find their server(s).

Are the servers I would need to find different for every CID? like for every url?

And is there a possibility I could make a script that does this server research on its own or is it different everytime and so it has to be done manually?