Cluster Configuration issues, or wrong use case?

I want to host files on IPFS and I have thousands of servers that will host the content. I have been attempting to use IPFS Cluster to make a server of mine act as cluster leader (the only trusted peer within the cluster) and define the pins. Then, I have all of my nodes join as cluster followers.

In testing, it works fine for the first 20-30 nodes. They all join the cluster, set their pins, and pin the files locally. However, after 30+ nodes the cluster leader becomes unresponsive even though there is no excessive CPU/RAM/Disk/Network that I can see. The main thing I can see is hundreds of open network connections, many of which are ESTABLISHED, which leads me to believe that it is trying to reach cluster followers that may have gone offline.

My best guess is that the cluster leader becomes saturated with open network connections, possibly because the default settings results in very frequent pings, check-ins, etc. I also guess that this is made worse by some of the nodes regularly going offline and coming back online, and their connections potentially still being retried from previous sessions.

Once it gets into this “bad state”, I am unable to list peers or pins - both commands simply hang for 5-10 minutes and never return. I guess this might be because it is trying to iterate through peers to establish the state and the peers are changing too frequently to make this possible so it is never able to complete.

So there are a few questions in here:
Does what I’m describing above sound like an issue of misconfiguration? Or is this simply not how Clustering is supposed to be used?
Are there limits to how many members can be in a cluster?
Should I bother using cluster since I will have a single source of truth for the pins anyway? Would it be better for me to make my own tool that tells my thousands of nodes which pins to keep?

Any help would be sincerely appreciated!

1 Like

You might want to check your server’s OS networking configuration and logs…

Perhaps your server has too many TCP connections. If you are running a Linux based distro, you might try setting the maximum connection limit to a higher value per this serverfault post

Thanks for your response! I have already increased some of these, as well as the number of files that may be open on disk.

Are there any numbers available of how many connections are required based on the number of cluster leaders or cluster followers? Or perhaps the suggested maximum ratio of cluster leaders to cluster followers?

Some commands (peers ls or status) are “broadcast commands”. The response is the result of querying other peers.

some of the nodes regularly going offline and coming back online

This might be an issue. The monitor_ping_interval configuration value controls how long a peer is considered a peer since last metric. By default it is set to 15s, which means peers that a peer will be considered part of the peerset for 30 seconds after last “ping”. If your peers are coming and going, it might help to reduce this (though network will be more chatty, since they will have to checkin more often).

That said, we are doing a bad job timing out these broadcasts requests (I think, in practice they don’t timeout automatically). They should not make everything hang forever, particularly in those broadcast requests. I am about to tag a new release and going to try to remediate this.

Other than that, the cluster is functional and escalable. If you are going to work with 1000 nodes, even if they are all online, peers ls or status are going to be slow and heavy. Depending on what you want, you can avoid them, i.e. by using ipfs-cluster-ctl --enc=json id (which shows peer ids of other peers), pin ls to list items in the state without querying everyone for status etc.

There is a default dial timeout of 60 seconds btw.

Thank you very much for the insight on this. I’m a bit confused on the “how long a peer is considered a peer”. In a way, I don’t really care about all of the follower peers being active - I’m just looking for “eventual consistency” of pins from my peer leaders to my peer followers. So it seems like a shorter time-out with a longer ping interval would help me to limit the active network connections (and thus the load on the server). Am I understanding this correctly?

Is it possible to change the default dial timeout on the cluster side? I don’t the option listed here: Configuration - Pinset orchestration for IPFS

It is not possible to change the timeout, but I’m working on it.

1 Like

Thanks! Glad to hear that. In my testing it seems that a few poorly performing followers can wreak havoc on a cluster. It seems that if the followers start to have connectivity issues (perhaps because their router can’t handle the number of connections) that they start to connect and disconnect somewhat regularly. This then seems to cause lots of duplicated and hanging connections on the cluster leader. Does this sound like realistic explanation of the issue, which would potentially be mitigated by the longer polling and quicker dial timeout that you’re working on?

It is important to understand that it can cause some user-triggered broadcast operations to hang, and it can cause (I guess), stuck connections until they are cleaned up by the OS, but this should not affect the actual operation of the cluster which consists in letting people know what to pin, or pinning what needs to be pinned. This is also why you don’t see increased CPU/Memory etc. Two commands hang but the peer is otherwise operative.

Note that there is also a connection manager which by default starts ripping out connections when they reach 400. (See connection_manager settings ).

1 Like

hey @origin , there is a new cluster-v0.13.1-rc1 release (https://dist.ipfs.io/ipfs-cluster-service). Can you try it out?

Failed dials should timeout at 3 seconds.

Thank you very much for the prompt update on this. We will give this a try!

Your suggestion to use ipfs-cluster-ctl --enc=json id was key, that has made a huge difference in the performance of the network. We had a job that would do peers ls on an interval to get a status of the network. I am guessing that as our network got bigger and more spread out this job started taking longer and started overlapping with previous calls. I think this and having the cluster leaders set relay hop to true and most of the followers having this disabled is what caused our network to get overloaded.

I am curious - is there anything other than a private network planned to prevent the cluster from being “attacked” by a bad actor who spams these broadcast commands to intentionally cripple the network?

Afaik as I understand, the network is not crippled. The only issue is that the command took a long time on the node sending the broadcast that is needed to complete the command.

Cluster peers are libp2p peers and as such, part of a network. They are generally diallable by everyone. Libp2p, I think, does not offer yet filtering solutions (or it was added recently). That said, in the case of follower cluster peers, most interactions will be rejected right away unless the requester is a “trusted peer”.