Go-ipfs HTTP gateway connection resets

go-ipfs HTTP gateway connection resets

We’ve been having an issue that’s been making go-ipfs just about unusable as an HTTP gateway. We’re seeing connection resets that seem to occur intermittantly between OpenResty and go-ipfs. We don’t know hat triggers it as it doesn’t align with anything we can tell(load, resource usage, etc). The setup looks kind of like this:

[Kubernetes LoadBalancer] -> OpenResty(+lua-resty-auto-ssl) -> [Kubernets NodePort Service] -> go-ipfs HTTP gateway port (8080)

It looks like the connection reset happens between Resty and go-ipfs and we see logs like these from the Resty instance:

2020/07/31 17:35:58 [error] 20#20: *56 recv() failed (104: Connection reset by peer) while sending to client, client: 10.138.0.12, server: , request: "GET /app.ace7f8e8.css HTTP/1.1", upstream: "http://10.11.250.149:8080/app.ace7f8e8.css", host: "example.my.app"
2020/07/31 17:36:44 [error] 20#20: *74 recv() failed (104: Connection reset by peer) while sending to client, client: 10.8.0.1, server: , request: "GET /app.ace7f8e8.css HTTP/1.1", upstream: "http://10.11.250.149:8080/app.ace7f8e8.css", host: "example.my.app"
2020/07/31 17:36:45 [error] 20#20: *77 recv() failed (104: Connection reset by peer) while sending to client, client: 10.138.0.12, server: , request: "GET /app.ace7f8e8.css HTTP/1.1", upstream: "http://10.11.250.149:8080/app.ace7f8e8.css", host: "example.my.app"
2020/07/31 17:36:52 [error] 20#20: *92 recv() failed (104: Connection reset by peer) while sending to client, client: 10.138.0.17, server: , request: "GET /app.ace7f8e8.css HTTP/1.1", upstream: "http://10.11.250.149:8080/app.ace7f8e8.css", host: "example.my.app"

They seem to happen randomly and the files get cut off at random points during the transfer. Here’s a sampling of the size returned fo the truncated file(actual size: 214465):

52571
56667
56584
52488
157824
162048

The go-ipfs daemon debug logs so far don’t seem to show anything relevant. What modules should we pay attention to in go-ipfs debug logging? Anyone see this before or have any ideas for further debugging to help get to the bottom of this?

cmds/http perhaps?

If it does not align with things like CPU usage or FD usage and logs don’t show already errors, I am doubting that this is a go-ipfs issues. If it comes from go-ipfs, it probablly means some resource is scarce, usually file descriptors (but that should be logged).

Hey Hector, thanks or the response. Don’t see anything useful from cmds/http but I’ll be on the lookout next time this happens with debug logging enabled.

I’m not really sure if it’s go-ipfs or another component just looking for troubleshooting ideas. From the PoV of those reset log messages above, it looks like go-ipfs is sending the resets but at this point I don’t have any definitive answers.

I was able to get a tcpdump capture of the reset occurring from the PoV of the reverse proxy. It appears the IPFS gateway pod is the one sending the TCP reset. I don’t know why and I wasn’t able to find anything relevant from the go-ipfs debug logs.

And here’s one of successful request just for comparison:

I’ll need to build a container I can install other packages/tools on before I can get a capture from the PoV of the IPFS gatway. Also going to try upgrading to 0.6.0.

I still can’t reliably reproduce this issue but was able to trigger it yesterday after throwing a bunch of traffic at a gateway. After this occurs, it stays in this broken state where 1 out of every 5 requests or so end up with a partial transfer that is aborted mid-way through.

Restarting/recreating the go-ipfs pods brings them back to the working state. Restarting the nginx/openresty reverse proxy pods does not have any effect. That’s what leads me to believe the problem lies on the go-ipfs side of things.

Open to any ideas to try and narrow this down.

Probably you should check your network settings in Kubernetes. This situation is usual for TCP connection stability considered as a hangup. As it is described here, FIN flag should be raised when the transfer was successful.

Examining your files I discovered HTTP 200 (#28) message sent via the same connection just before it was interrupted due to finishing of successful transfer. In failed case, there is no such message. Maybe, the system considers the TCP connection as hanged one and closes it automatically.

I get that but there has to be a reason for the hangup/reset. It’s the symptom of an underlying problem which is what I’m trying to figure out. So far it looks like the ipfs pod or go-ipfs gateway is sending the reset, but that’s all the information I have so far.

What kubernetes network settings do you think would be relevant to these connections being reset mid-transfer?

For large amounts of data I red the interesting blog post here https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/

I am not an experienced Kubernetes user, but it could be similar issue to yours, I think.

Did an upgrade to Kubernetes 1.15 in the hopes that was the culprit but after about a week and some change the issue appears to have come back.

The SNR is kind of low in the go-ipfs debug logs but nothing stands out as useful. Mostly swarm connection errors with some relay stream notices. Probably not relevant, but these show up around the same time:

dht/RtRefreshManager rtrefresh/rt_refresh_manager.go:265 starting refreshing cpl 2 with key  ������������������������������� (routing table size was 168)

Might create a cluster outside of kubernetes entirely to rule it out. Open to any other troubleshooting ideas. I still don’t have a good repro since this only seems to occur with a week or two in between. I’m about out of options.