[go-libp2p] [go-libp2p-kad-dht] Unable to connect to peer nodes (providers) using Kad-DHT

DHT is unable to fetch multi-address for certain peers when private network is scaled up (both in terms of number of nodes, and number of transactions wrt DHT). Due to this we face 2 issues:

  1. it is unable to connect to certain nodes giving the error “dial backoff” and/or “i/o timeout”.
  2. it is unable to fetch the providers for a given key/contentID for certain nodes, i.e. it gives empty response on dht.FindProvidersAsync(key) call even when dht.Provide(key) has been called for these nodes.

For (1) we tried to fetch the multiaddress from the DHT using dht.FindPeer API, but it gave the error “Routing: Not Found”
In both these cases all peers are online but still this issue persists when scaling the network up . On a smaller network it works fine for any number of calls to the DHT.

Can you share the total number of nodes and how many connections they usually maintain? Are they dial-able from the others as needed?

We are running a network of 2000 nodes, with around 350-450 connections (as per host.network().Peers() api call). We did some local testing and found out that the nodes were dial-able from others as well.

This sounds like those nodes are not reachable or not running.

Any chance the records have expired (they need to be republished every 12h).

Anything relevant in logs?

After checking the most obvious explanations, you might need to look closely at debug logs from the node that cannot find routing and the node that it is trying to find routing to, make sure that it is still listening etc.

We tried to print the o/p of “GetClosestPeers” which returned us the NodeIds & then we tried to use"FindPeer" to get the connection details of that node but Address Info was missing. How come NodeId is being retrieved but Address Info is not being retrieved?

It happens for records provided for around 30m-1h as well (not all but some)

The main thing I would check is what the routing tables of your peers look like and checking that peers have routing tables that are about the right amount of filled (i.e. you should see around Min(1/2^bucketNum, bucketSize) peers per bucket.

Some code in go-ipfs that prints out stats like this is go-ipfs/stat_dht.go at master · ipfs/go-ipfs · GitHub (I highlighted the areas that deal with getting the peers out of the routing table and figuring out which buckets they are from).

1 Like

Found following from the debug logs (Please copy it on nodepad++ for best view):

DEBUG   swarm2  [limiter] clearing all peer dials: QmcE8JqxGpq95ev4kENpEchMYeCBgPTVBRXzAc2qRj2fG2
2021-04-16T15:20:04.156Z        DEBUG   swarm2  network for QmP3g6JP2NrERuDWkwMuMQgcwahSeQo6iw5VR68WPvLtJU finished dialing QmcE8JqxGpq95ev4kENpEchMYeCBgPTVBRXzAc2qRj2fG2
2021-04-16T15:20:04.156Z        DEBUG   dht     request failed to open message sender   {"error": "failed to dial QmcE8JqxGpq95ev4kENpEchMYeCBgPTVBRXzAc2qRj2fG2: all dials failed\n  * [/ip4/192.168.183.109/tcp/5000] dial tcp4 0.0.0.0:5000->192.168.183.109:5000: i/o timeout", "to": "QmcE8JqxGpq95ev4kENpEchMYeCBgPTVBRXzAc2qRj2fG2"}
2021-04-16T15:20:04.157Z        DEBUG   swarm2  network for QmP3g6JP2NrERuDWkwMuMQgcwahSeQo6iw5VR68WPvLtJU finished dialing QmXJdbkHf7zyNPHV7Vw1JgvZ6YcwSrzuTyDcPhprM6hcdS
2021-04-16T15:20:04.157Z        DEBUG   dht     request failed to open message sender   {"error": "failed to dial QmXJdbkHf7zyNPHV7Vw1JgvZ6YcwSrzuTyDcPhprM6hcdS: all dials failed\n  * [/ip4/192.168.117.240/tcp/5000] dial tcp4 0.0.0.0:5000->192.168.117.240:5000: i/o timeout", "to": "QmXJdbkHf7zyNPHV7Vw1JgvZ6YcwSrzuTyDcPhprM6hcdS"}
2021-04-16T15:20:04.157Z        INFO    swarm2  got error on dial: dial tcp4 0.0.0.0:5000->192.168.104.151:5000: i/o timeout
2021-04-16T15:20:04.157Z        DEBUG   swarm2  [limiter] clearing all peer dials: QmXKVcRi4fE6LvXGQsJK1xqyE5NbeAUhgJpJucxeoWQtz3
2021-04-16T15:20:04.157Z        DEBUG   swarm2  network for QmP3g6JP2NrERuDWkwMuMQgcwahSeQo6iw5VR68WPvLtJU finished dialing QmXKVcRi4fE6LvXGQsJK1xqyE5NbeAUhgJpJucxeoWQtz3
2021-04-16T15:20:04.157Z        DEBUG   dht     request failed to open message sender   {"error": "failed to dial QmXKVcRi4fE6LvXGQsJK1xqyE5NbeAUhgJpJucxeoWQtz3: all dials failed\n  * [/ip4/192.168.104.151/tcp/5000] dial tcp4 0.0.0.0:5000->192.168.104.151:5000: i/o timeout", "to": "QmXKVcRi4fE6LvXGQsJK1xqyE5NbeAUhgJpJucxeoWQtz3"}
2021-04-16T15:20:04.156Z        INFO    swarm2  got error on dial: dial tcp4 0.0.0.0:5000->192.168.111.30:5000: i/o timeout
2021-04-16T15:20:04.158Z        DEBUG   swarm2  [limiter] clearing all peer dials: QmPdPg2LHadd3DpE1QVJ4xm2t9Svf261HDtw4dXK5utK1h
2021-04-16T15:20:04.158Z        DEBUG   swarm2  network for QmP3g6JP2NrERuDWkwMuMQgcwahSeQo6iw5VR68WPvLtJU finished dialing QmPdPg2LHadd3DpE1QVJ4xm2t9Svf261HDtw4dXK5utK1h
......
......
......
......
2021-04-16T15:20:04.432Z        DEBUG   pubsub  PEERDOWN: Remove disconnected peer QmcK8ScGfi1bptmRyiyjhSgNsG1bvQEYV4q5XRXoLicHbM
2021-04-16T15:20:04.432Z        DEBUG   pubsub  error reading rpc from QmcK8ScGfi1bptmRyiyjhSgNsG1bvQEYV4q5XRXoLicHbM: stream reset
2021-04-16T15:20:04.751Z        DEBUG   basichost       protocol negotiation took 109.162µs
2021-04-16T15:20:04.751Z        DEBUG   dht     peer found      {"peer": "QmVuTG8C51UkfjUwV6jiviH9kDfHBzNPnjaJEbL3DbmZ2M"}
2021-04-16T15:20:04.751Z        DEBUG   dht     handling message        {"from": "QmVuTG8C51UkfjUwV6jiviH9kDfHBzNPnjaJEbL3DbmZ2M", "type": 4, "key": "EiAAAQMYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="}
2021-04-16T15:20:04.752Z        DEBUG   dht     handled message {"from": "QmVuTG8C51UkfjUwV6jiviH9kDfHBzNPnjaJEbL3DbmZ2M", "type": 4, "key": "EiAAAQMYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==", "time": 0.000772582}
2021-04-16T15:20:04.752Z        DEBUG   dht     responded to message    {"from": "QmVuTG8C51UkfjUwV6jiviH9kDfHBzNPnjaJEbL3DbmZ2M", "type": 4, "key": "EiAAAQMYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==", "time": 0.000883386}
2021-04-16T15:20:04.826Z        DEBUG   pubsub  PEERDOWN: Remove disconnected peer QmPyEojyrWbbpLmZpztvBwdn46hGFM4Sv2LTTmaU4KjY6e
2021-04-16T15:20:04.826Z        DEBUG   pubsub  error reading rpc from QmPyEojyrWbbpLmZpztvBwdn46hGFM4Sv2LTTmaU4KjY6e: stream reset
2021-04-16T15:20:05.238Z        DEBUG   dht     peer found      {"peer": "QmQFF15eRAGYpaXfabTRJHG4PxrJSU6cECr3XFPab6HVN4"}
2021-04-16T15:20:05.238Z        DEBUG   dht     handling message        {"from": "QmQFF15eRAGYpaXfabTRJHG4PxrJSU6cECr3XFPab6HVN4", "type": 2, "key": "EiB9IQgQkNC8Yi1Lisk3EmuMfGqgN+nL+Cjo9lAyVXOtBw=="}
2021-04-16T15:20:05.238Z        DEBUG   dht     adding provider%!(EXTRA string=from, peer.ID=QmQFF15eRAGYpaXfabTRJHG4PxrJSU6cECr3XFPab6HVN4, string=key, dht.loggableProviderRecordBytes=bciqh2iiiccinbpdcfvfyvsjxcjvyy7dkua36ts7yfdupmubskvz22by)
2021-04-16T15:20:05.238Z        DEBUG   dht     handled message {"from": "QmQFF15eRAGYpaXfabTRJHG4PxrJSU6cECr3XFPab6HVN4", "type": 2, "key": "EiB9IQgQkNC8Yi1Lisk3EmuMfGqgN+nL+Cjo9lAyVXOtBw==", "time": 0.000176133}
2021-04-16T15:20:05.411Z        DEBUG   basichost       protocol negotiation took 106.054µs
2021-04-16T15:20:05.411Z        DEBUG   dht     peer found      {"peer": "QmQFF15eRAGYpaXfabTRJHG4PxrJSU6cECr3XFPab6HVN4"}
2021-04-16T15:20:05.411Z        DEBUG   dht     handling message        {"from": "QmQFF15eRAGYpaXfabTRJHG4PxrJSU6cECr3XFPab6HVN4", "type": 4, "key": "EiB/Hk5B6g0idkDqg9pxQ/erIripA31ec5S1y8saqq6Rdw=="}
2021-04-16T15:20:05.412Z        DEBUG   dht     handled message {"from": "QmQFF15eRAGYpaXfabTRJHG4PxrJSU6cECr3XFPab6HVN4", "type": 4, "key": "EiB/Hk5B6g0idkDqg9pxQ/erIripA31ec5S1y8saqq6Rdw==", "time": 0.000248879}
2021-04-16T15:20:05.412Z        DEBUG   dht     responded to message    {"from": "QmQFF15eRAGYpaXfabTRJHG4PxrJSU6cECr3XFPab6HVN4", "type": 4, "key": "EiB/Hk5B6g0idkDqg9pxQ/erIripA31ec5S1y8saqq6Rdw==", "time": 0.00029735}
2021-04-16T15:20:05.670Z        DEBUG   pubsub  error reading rpc from QmZYe3jmMDkGm7JbSpW2ushEBbcZ8NGPCo86r9kAew6mt9: stream reset
2021-04-16T15:20:05.670Z        DEBUG   pubsub  PEERDOWN: Remove disconnected peer QmZYe3jmMDkGm7JbSpW2ushEBbcZ8NGPCo86r9kAew6mt9
2021-04-16T15:20:06.664Z        DEBUG   pubsub  PEERDOWN: Remove disconnected peer QmeGrr36KmgxoDVgUjkq9YJgHGDjgeCYAuqFKcvN6dPrhm
2021-04-16T15:20:06.664Z        DEBUG   pubsub  error reading rpc from QmeGrr36KmgxoDVgUjkq9YJgHGDjgeCYAuqFKcvN6dPrhm: stream reset
2021-04-16T15:20:07.039Z        DEBUG   pubsub  error reading rpc from QmYEoMVWU9JHuxLD71RdvWn6TAFVampSuHpUC1mSHcjocN: stream reset
2021-04-16T15:20:07.039Z        DEBUG   pubsub  PEERDOWN: Remove disconnected peer QmYEoMVWU9JHuxLD71RdvWn6TAFVampSuHpUC1mSHcjocN
2021-04-16T15:20:07.215Z        DEBUG   basichost       protocol negotiation took 418.879µs
2021-04-16T15:20:07.215Z        DEBUG   basichost       protocol negotiation took 105.297µs
2021-04-16T15:20:07.215Z        DEBUG   dht     peer found      {"peer": "QmbE6PuYGdgUXR1ZRTjwrPqxS9PNV9TQbSnUmJG1zSQn5F"}
2021-04-16T15:20:07.215Z        DEBUG   dht     handling message        {"from": "QmbE6PuYGdgUXR1ZRTjwrPqxS9PNV9TQbSnUmJG1zSQn5F", "type": 4, "key": "EiC/eehp6/X1TEI+187gMLcyvZakKyMob88CJrFSrf8rkg=="}

any update on this one. Seeing somewhat similar when running a private network on version v0.24.1. Help would be appreciated