IPFS Cluster Resilience


We have an IPFS cluster geographically distributed. We publish content in our Lab with CI/CD and gets delivered to the IPFS cluster via a local node.

In our datacenter there is another member of the cluster, it is used to serve content to the internet using the IPFS gateway.

Today the Lab fiber is down because we are changing connectivity provider.

The IPFS content on the public gateway is down. we are getting 404s.

How is that possible? the content was pinned in the cluster.

what happens to pinned content in this scenario?

Something’s off here. The gateway does not 404 when it cannot find content on IPFS, because it tries to find it and eventually times out with a 5xx I think.

In any case, you should check that the content was actually successfully pinned in both places (ipfs-cluster-ctl status), and that the request on the gateway is actually for content that was pinned on that local node.

the content has been in production for some days…
i am trying to figure out what is up.

404 is returned when the link is invalid, not when it is not accessible.
For example you tried to access the file Qmfoo/abc but the Qmfoo directory doesn’t have a file named Qmfoo inside it.

In other words, you see 404 when IPFS “proved” that this path is and will always be invalid.

I think you have a typo in your path.

This has been all released out of CD/CI for months… with content updates weekly.
nothing is manual. we pin to the cluster from gitlab pipelines and so we update the DNS __dnslink entries
the content setup is correct.

something broke the content delivery after the connectivity is down.

something has upset the IPFS or the haproxy in front.

actually the haproxy in front is serving just non-IPFS sites just fine…

… still investigating…

yes, I managed to get into the ipfs-cluster member and the content is displayed as pinned

The you should be able to curl the local ipfs gateway at localhost:8080/ directly for the content and see if it returns it, or what kind of error it gives.

I went to the IPFS server and did ipfs get of CIDs and content was delivered

so to issue is between haproxy and the ipfs content gateway while considering that the haproxy is capable to server other content OK…

… getting closer…

@hector, the usual suspect was the guilty one: DNS

there was a DNS entry in our cluster pointing back to the LAB which is down.

the IPFS content gateway has such a dependency on DNS that it should have an special 5xx for failed DNS lookups.

1 Like