IPFS Cluster stop working when one of the node goes Off

Hello)

I’m working on the deploying Private IPFS Network with data replication among all participant nodes.

What I want to achive is to have private netwrok (for now just with 3 nodes but in future it could be thousands nodes) with data replication. And if by any reason one of the node goes Off rest of the network continue working.

I have 3 VMs on the DigitalOcean Cloud

This is specification details of these machines:

Ubuntu: 16.04
Go version: go1.10.3 linux/amd64
ipfs version 0.4.14
ipfs-cluster-service version 0.5.0
ipfs-cluster-ctl version 0.5.0

I created private network and organized it in the cluster
To create cluster I used method: “Starting a single peer and bootstrapping the rest to it”

This is what I have after run ipfs-cluster-ctl peers ls command

QmTyELZo8uYrRbVNdDSbMfDdQqmYVJLV2hDFUkx44gjzrR | ubuntu-s-1vcpu-2gb-fra1-01 | Sees 2 other peers

Addresses:
- /ip4/10.19.0.5/tcp/9096/ipfs/QmTyELZo8uYrRbVNdDSbMfDdQqmYVJLV2hDFUkx44gjzrR
- /ip4/104.248.38.67/tcp/9096/ipfs/QmTyELZo8uYrRbVNdDSbMfDdQqmYVJLV2hDFUkx44gjzrR
- /ip4/127.0.0.1/tcp/9096/ipfs/QmTyELZo8uYrRbVNdDSbMfDdQqmYVJLV2hDFUkx44gjzrR
IPFS: QmezCHeRwU9n9a8BJFhCxd9ayaNon49Rr5QJfF7zWePYWg
- /ip4/10.19.0.5/tcp/4001/ipfs/QmezCHeRwU9n9a8BJFhCxd9ayaNon49Rr5QJfF7zWePYWg
- /ip4/104.248.38.67/tcp/4001/ipfs/QmezCHeRwU9n9a8BJFhCxd9ayaNon49Rr5QJfF7zWePYWg
- /ip4/127.0.0.1/tcp/4001/ipfs/QmezCHeRwU9n9a8BJFhCxd9ayaNon49Rr5QJfF7zWePYWg
- /ip6/::1/tcp/4001/ipfs/QmezCHeRwU9n9a8BJFhCxd9ayaNon49Rr5QJfF7zWePYWg
QmYwZziUukgsZQtmG4BdVgVLzMRNEWDXJAm1gk6Wkksefz | ubuntu-s-1vcpu-2gb-ams3-02 | Sees 2 other peers
Addresses:
- /ip4/10.18.0.5/tcp/9096/ipfs/QmYwZziUukgsZQtmG4BdVgVLzMRNEWDXJAm1gk6Wkksefz
- /ip4/127.0.0.1/tcp/9096/ipfs/QmYwZziUukgsZQtmG4BdVgVLzMRNEWDXJAm1gk6Wkksefz
- /ip4/159.65.196.76/tcp/9096/ipfs/QmYwZziUukgsZQtmG4BdVgVLzMRNEWDXJAm1gk6Wkksefz
IPFS: QmRcQRsMnv7cpoeJXiZdr5bEf1A3ztTnAWGtaFgzQK5pEV
- /ip4/10.18.0.5/tcp/4001/ipfs/QmRcQRsMnv7cpoeJXiZdr5bEf1A3ztTnAWGtaFgzQK5pEV
- /ip4/127.0.0.1/tcp/4001/ipfs/QmRcQRsMnv7cpoeJXiZdr5bEf1A3ztTnAWGtaFgzQK5pEV
- /ip4/159.65.196.76/tcp/4001/ipfs/QmRcQRsMnv7cpoeJXiZdr5bEf1A3ztTnAWGtaFgzQK5pEV
- /ip6/::1/tcp/4001/ipfs/QmRcQRsMnv7cpoeJXiZdr5bEf1A3ztTnAWGtaFgzQK5pEV
QmZn1tUBPExALzSZqX7J5RKnn5TUcUxLaUjhG9FiD3JVn4 | ubuntu-s-1vcpu-2gb-lon1-03 | Sees 2 other peers
Addresses:
- /ip4/10.16.0.5/tcp/9096/ipfs/QmZn1tUBPExALzSZqX7J5RKnn5TUcUxLaUjhG9FiD3JVn4
- /ip4/127.0.0.1/tcp/9096/ipfs/QmZn1tUBPExALzSZqX7J5RKnn5TUcUxLaUjhG9FiD3JVn4
- /ip4/138.68.157.219/tcp/9096/ipfs/QmZn1tUBPExALzSZqX7J5RKnn5TUcUxLaUjhG9FiD3JVn4
IPFS: QmavdBknF5ReHAoABWW5ygpoFBJrYjkQ6kVKCiajSxMz4p
- /ip4/10.16.0.5/tcp/4001/ipfs/QmavdBknF5ReHAoABWW5ygpoFBJrYjkQ6kVKCiajSxMz4p
- /ip4/127.0.0.1/tcp/4001/ipfs/QmavdBknF5ReHAoABWW5ygpoFBJrYjkQ6kVKCiajSxMz4p
- /ip4/138.68.157.219/tcp/4001/ipfs/QmavdBknF5ReHAoABWW5ygpoFBJrYjkQ6kVKCiajSxMz4p
- /ip6/::1/tcp/4001/ipfs/QmavdBknF5ReHAoABWW5ygpoFBJrYjkQ6kVKCiajSxMz4p

I’m able to to add file from one of the node with command “ipfs-cluster-ctl add file.txt” and it replicate this file among rest of the nodes. Great!

After that I continued my tests. I shut down one node and expected that network continue working but it didn’t…

I add file from one of the node(ipfs-cluster-ctl add file.txt) then I kill all ipfs processes on this machine with “sudo killall -9 ipfs-cluster-service ipfs

I was able to cat this file from my two nodes(ipfs cat filehash), so the pin is worked fine before I shutdown this node

But when I tried to add new file from my remaining two nodes it didn’t happen and I got errors.

ipfs-cluster-ctl add file.txt
routing: not found
 (500)

And that what I can see from my IPFS Cluster Daemon log

This is from first node:

sudo systemctl status ipfs-cluster

● ipfs-cluster.service - IPFS-Cluster Daemon
   Loaded: loaded (/etc/systemd/system/ipfs-cluster.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2018-09-20 10:42:02 UTC; 5h 21min ago
 Main PID: 1434 (ipfs-cluster-se)
    Tasks: 9
   Memory: 27.2M
      CPU: 6min 18.604s
   CGroup: /system.slice/ipfs-cluster.service
           └─1434 /root/gopath/bin/ipfs-cluster-service daemon

Sep 20 15:14:03 ubuntu-s-1vcpu-2gb-fra1-01 ipfs-cluster-service[1434]: 15:14:03.833 WARNI    cluster: Peer <peer.ID TyELZo> received alert for ping in QmZn1tUBPExALzSZqX7J5RKnn5TUcUxLaUjh
Sep 20 15:14:05 ubuntu-s-1vcpu-2gb-fra1-01 ipfs-cluster-service[1434]: 15:14:05.824 ERROR       raft: peer {Voter QmZn1tUBPExALzSZqX7J5RKnn5TUcUxLaUjhG9FiD3JVn4 QmZn1tUBPExALzSZqX7J5RKnn5
Sep 20 15:18:32 ubuntu-s-1vcpu-2gb-fra1-01 ipfs-cluster-service[1434]: 15:18:32.134  INFO   ipfshttp: IPFS Pin request succeeded:  QmcRxcQRHRu7may9vf7Mhzeki188mbfPGL4USv4JdxXnrt ipfshttp.
Sep 20 15:19:21 ubuntu-s-1vcpu-2gb-fra1-01 ipfs-cluster-service[1434]: 15:19:21.311 ERROR  p2p-gorpc: dial attempt failed: <peer.ID TyELZo> —> <peer.ID YwZziU> dial attempt failed: conte
Sep 20 15:25:31 ubuntu-s-1vcpu-2gb-fra1-01 ipfs-cluster-service[1434]: 15:25:31.738 ERROR  p2p-gorpc: dial attempt failed: <peer.ID TyELZo> —> <peer.ID YwZziU> dial attempt failed: conte
Sep 20 15:25:31 ubuntu-s-1vcpu-2gb-fra1-01 ipfs-cluster-service[1434]: 15:25:31.738 ERROR      adder: error adding to cluster:  dial attempt failed: <peer.ID TyELZo> —> <peer.ID YwZziU> 
Sep 20 15:25:31 ubuntu-s-1vcpu-2gb-fra1-01 ipfs-cluster-service[1434]:  adder.go:146
Sep 20 15:56:51 ubuntu-s-1vcpu-2gb-fra1-01 ipfs-cluster-service[1434]: 15:56:51.996 ERROR  p2p-gorpc: routing: not found call.go:63
Sep 20 15:56:51 ubuntu-s-1vcpu-2gb-fra1-01 ipfs-cluster-service[1434]: 15:56:51.996 ERROR      adder: error adding to cluster:  routing: not found
Sep 20 15:56:51 ubuntu-s-1vcpu-2gb-fra1-01 ipfs-cluster-service[1434]:  adder.go:146

And this is from second node:

sudo systemctl status ipfs-cluster

● ipfs-cluster.service - IPFS-Cluster Daemon
   Loaded: loaded (/etc/systemd/system/ipfs-cluster.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2018-09-20 15:13:28 UTC; 51min ago
 Main PID: 1442 (ipfs-cluster-se)
    Tasks: 9
   Memory: 26.3M
      CPU: 44.607s
   CGroup: /system.slice/ipfs-cluster.service
           └─1442 /root/gopath/bin/ipfs-cluster-service daemon

Sep 20 16:03:55 ubuntu-s-1vcpu-2gb-lon1-03 ipfs-cluster-service[1442]: 16:03:55.914 WARNI    cluster: Peer <peer.ID Zn1tUB> received alert for ping in QmYwZziUukgsZQtmG4BdVgVLzMRNEWDXJAm1
Sep 20 16:03:55 ubuntu-s-1vcpu-2gb-lon1-03 ipfs-cluster-service[1442]: 16:03:55.915 WARNI    cluster: Peer <peer.ID Zn1tUB> received alert for freespace in QmYwZziUukgsZQtmG4BdVgVLzMRNEWD
Sep 20 16:04:10 ubuntu-s-1vcpu-2gb-lon1-03 ipfs-cluster-service[1442]: 16:04:10.914 WARNI    cluster: Peer <peer.ID Zn1tUB> received alert for ping in QmYwZziUukgsZQtmG4BdVgVLzMRNEWDXJAm1
Sep 20 16:04:10 ubuntu-s-1vcpu-2gb-lon1-03 ipfs-cluster-service[1442]: 16:04:10.915 WARNI    cluster: Peer <peer.ID Zn1tUB> received alert for freespace in QmYwZziUukgsZQtmG4BdVgVLzMRNEWD
Sep 20 16:04:25 ubuntu-s-1vcpu-2gb-lon1-03 ipfs-cluster-service[1442]: 16:04:25.914 WARNI    cluster: Peer <peer.ID Zn1tUB> received alert for ping in QmYwZziUukgsZQtmG4BdVgVLzMRNEWDXJAm1
Sep 20 16:04:25 ubuntu-s-1vcpu-2gb-lon1-03 ipfs-cluster-service[1442]: 16:04:25.915 WARNI    cluster: Peer <peer.ID Zn1tUB> received alert for freespace in QmYwZziUukgsZQtmG4BdVgVLzMRNEWD
Sep 20 16:04:40 ubuntu-s-1vcpu-2gb-lon1-03 ipfs-cluster-service[1442]: 16:04:40.381 ERROR       raft: NOTICE: Some RAFT log messages repeat and will only be logged once logging.go:105
Sep 20 16:04:40 ubuntu-s-1vcpu-2gb-lon1-03 ipfs-cluster-service[1442]: 16:04:40.382 ERROR       raft: Failed to AppendEntries to {Voter QmYwZziUukgsZQtmG4BdVgVLzMRNEWDXJAm1gk6Wkksefz QmYw
Sep 20 16:04:40 ubuntu-s-1vcpu-2gb-lon1-03 ipfs-cluster-service[1442]: 16:04:40.914 WARNI    cluster: Peer <peer.ID Zn1tUB> received alert for ping in QmYwZziUukgsZQtmG4BdVgVLzMRNEWDXJAm1
Sep 20 16:04:40 ubuntu-s-1vcpu-2gb-lon1-03 ipfs-cluster-service[1442]: 16:04:40.915 WARNI    cluster: Peer <peer.ID Zn1tUB> received alert for freespace in QmYwZziUukgsZQtmG4BdVgVLzMRNEWD

And that what I see when run “ipfs-cluster-ctl peers ls” on both remaining nodes

QmTyELZo8uYrRbVNdDSbMfDdQqmYVJLV2hDFUkx44gjzrR | ubuntu-s-1vcpu-2gb-fra1-01 | Sees 2 other peers
  > Addresses:
    - /ip4/10.19.0.5/tcp/9096/ipfs/QmTyELZo8uYrRbVNdDSbMfDdQqmYVJLV2hDFUkx44gjzrR
    - /ip4/104.248.38.67/tcp/9096/ipfs/QmTyELZo8uYrRbVNdDSbMfDdQqmYVJLV2hDFUkx44gjzrR
    - /ip4/127.0.0.1/tcp/9096/ipfs/QmTyELZo8uYrRbVNdDSbMfDdQqmYVJLV2hDFUkx44gjzrR
  > IPFS: QmezCHeRwU9n9a8BJFhCxd9ayaNon49Rr5QJfF7zWePYWg
    - /ip4/10.19.0.5/tcp/4001/ipfs/QmezCHeRwU9n9a8BJFhCxd9ayaNon49Rr5QJfF7zWePYWg
    - /ip4/104.248.38.67/tcp/4001/ipfs/QmezCHeRwU9n9a8BJFhCxd9ayaNon49Rr5QJfF7zWePYWg
    - /ip4/127.0.0.1/tcp/4001/ipfs/QmezCHeRwU9n9a8BJFhCxd9ayaNon49Rr5QJfF7zWePYWg
    - /ip6/::1/tcp/4001/ipfs/QmezCHeRwU9n9a8BJFhCxd9ayaNon49Rr5QJfF7zWePYWg
QmYwZziUukgsZQtmG4BdVgVLzMRNEWDXJAm1gk6Wkksefz | ERROR: routing: not found
QmZn1tUBPExALzSZqX7J5RKnn5TUcUxLaUjhG9FiD3JVn4 | ubuntu-s-1vcpu-2gb-lon1-03 | Sees 2 other peers
  > Addresses:
    - /ip4/10.16.0.5/tcp/9096/ipfs/QmZn1tUBPExALzSZqX7J5RKnn5TUcUxLaUjhG9FiD3JVn4
    - /ip4/127.0.0.1/tcp/9096/ipfs/QmZn1tUBPExALzSZqX7J5RKnn5TUcUxLaUjhG9FiD3JVn4
    - /ip4/138.68.157.219/tcp/9096/ipfs/QmZn1tUBPExALzSZqX7J5RKnn5TUcUxLaUjhG9FiD3JVn4
  > IPFS: QmavdBknF5ReHAoABWW5ygpoFBJrYjkQ6kVKCiajSxMz4p
    - /ip4/10.16.0.5/tcp/4001/ipfs/QmavdBknF5ReHAoABWW5ygpoFBJrYjkQ6kVKCiajSxMz4p
    - /ip4/127.0.0.1/tcp/4001/ipfs/QmavdBknF5ReHAoABWW5ygpoFBJrYjkQ6kVKCiajSxMz4p
    - /ip4/138.68.157.219/tcp/4001/ipfs/QmavdBknF5ReHAoABWW5ygpoFBJrYjkQ6kVKCiajSxMz4p
    - /ip6/::1/tcp/4001/ipfs/QmavdBknF5ReHAoABWW5ygpoFBJrYjkQ6kVKCiajSxMz4p

So my question is: what am I doing wrong and why it doesn’t work when I disabled one node of the private-network/cluster?

1 Like

Hello,

first of all, thank you for the detailed report. I am really really happy that you took the time to provide this much of information.

I’m going to need more time to look at this into detail:

  • It’s a new one I hadn’t seen before
  • Error is related to DHT-peer not finding a route to the turned-off peer (makes sense because it’s offline)

First questions that come to mind?

  • What’s your replication factor?
  • Is there a chance that, after you turned off the peer, you did not give enough time to cluster for the disk metric to expire? This can controlled with metric_ttl (https://cluster.ipfs.io/documentation/configuration/#disk). There is a chance that if cluster thinks the last disk space metric is still valid it will still try to send stuff to that peer. Since ipfs 0.4.17 it should be ok to set a low ttl.

I see your using an older version of ipfs. In this case, setting the ttl low only causes problems when having a huge ipfs repository with the leveldb datastore because the repo stat call is very slow.

Hi @hector,

Thank you for reply!

Error is related to DHT-peer not finding a route to the turned-off peer (makes sense because it's offline)

Yes, it’s make sense that disabled node is unreachable and heartbeat will fail. But I supposed network should continue working if one or some of the nodes goes Off.
Also I want to add that my IPFS daemon is still working fine on remaining nodes and I’m able to add/get file to the network.

What's your replication factor?

If I correct understand your question to have Private Network with data replication it’s just a business requirements on my project.
Let me know please if you meant something technical here.

Regarding metric_ttl

I checked my metric_ttl settings in the /.ipfs-cluster/service.json and it has 30ms value.
So I test it again and after disabled one node I was waiting about 90 sec to try add file from another node.
On this test I got kind of different error when I tried add file but on second try I got the same:

ipfs-cluster-ctl add file1.txt dial attempt failed: <peer.ID TyELZo> —> <peer.ID Zn1tUB> dial attempt failed: context deadline exceeded (500)

ipfs-cluster-ctl add file1.txt routing: not found (500)

I just want to figure out if it’s really possible to do with IPFS-Cluster from the technical prospective? Or maybe I need to control all my nodes and manually remove dead nodes from cluster to network continued to work.

I mean, replication_factor_max and replication_factor_min in your config. I’m guessing it’s -1. Yeah I think we have a bug here. Thanks for finding it! I’ve opened an issue: https://github.com/ipfs/ipfs-cluster/issues/543 (let’s follow up there).

I checked my metric_ttl settings in the /.ipfs-cluster/service.json and it has 30ms value.

30ms? I hope you mean seconds. Milliseconds would be way too low.

Yes, replication_factor is -1

"cluster": {
    "id": "QmTyELZo6uYrRcVNdDSbMfDdQqmYVJLV2hDFUkx73gjzrR",
    "peername": "ubuntu-s-1vcpu-2gb-fra1-01",
    "private_key": "some key",
    "secret": "somesecret",
    "leave_on_shutdown": false,
    "listen_multiaddress": "/ip4/0.0.0.0/tcp/9096",
    "state_sync_interval": "10m0s",
    "ipfs_sync_interval": "2m10s",
    "replication_factor_min": -1,
    "replication_factor_max": -1,
    "monitor_ping_interval": "15s",
    "peer_watch_interval": "5s",
    "disable_repinning": false
  },

Does it mean that after fixing my approach will work?

I apologize for this typo))
yes, it’s 30s not 30ms

"informer": {
    "disk": {
      "metric_ttl": "30s",
      "metric_type": "freespace"
    },
    "numpin": {
      "metric_ttl": "10s"
    }
  }

Yes! After fixing you’ll be able to add even when a peer is down.

Awesome! I will follow up this issue.

I assume I’ll need to reinstall or update my ipfs-cluster package to apply last changes?

Thanks!

Yes you’ll have to install the new release or build the version with the bug fix.

Hi @hector!

On the opened issue I asked about the supposed period when this bug can be fixed but no one answered to me.

So I will address this question directly to you:) I’m sorry for duplication.

Could you tell me please when are you planning to fix this bug and what I need to do to apply this updates?

If I not mistaken current version of IPFS is v0.4.17 and ipfs-cluster-service is version 0.5.0.
Does this mean that I have to wait until the next release to have this updates in my IPFS package?

Hello. We will fix it for the next release. Hopefully this will come out in about two weeks.

2 Likes

ipfs: 0.4.17
ipfs-cluster: 0.5.0

With a two node cluster. If one node is down and the other is still up and you attempt to do a “ipfs-cluster-ctl pin add” you receive a error:

An error occurred:
  Code: 500
  Message: timed out waiting for leader: context deadline exceeded

We run into this sometimes with cluster nodes that lose power for parts of the day. The whole cluster fails and no one can add to the cluster that is still up. We were hoping to get this fixed so that when the cluster nodes get power again they will rejoin the cluster and will download the added content.

Hello, it is fixed already, and will be part of the next release.

Hi @hector,

I have a question but I’m not sure if I should ask it in this topic or create a new one.

One of my IPFS node deployed on the Alibaba Cloud Server in the Chinese region and others two nodes on the AWS.

And it looks like Great Chinese Firewall blocking your domain/ip https://dist.ipfs.io

git clone https://github.com/ipfs/ipfs-cluster.git $GOPATH/src/github.com/ipfs/ipfs-cluster
cd $GOPATH/src/github.com/ipfs/ipfs-cluster
make install

It can’t finish make install command during ipfs-cluster installation and download all dependencies/packages.

Do you have some mirror server or any idea how can I overcome this issue?

P.S.
I installed ipfs-cluster version 0.6 on the AWS VMs without any problem.

Thanks

Hey @michael-maverick, start your IPFS peer and edit the Makefile to replace any dist.ipfs.io strings with localhost:8080/ipns/dist.ipfs.io, before you run make install . This should probably get around this.

I think we should do better here and not depend on dist.ipfs.io directly, so if you can open an issue in the repository about it it would be great.

Alternatively you can also build and install following these steps (which don’t use the Makefile):

https://cluster.ipfs.io/documentation/download/#windows-and-manual-installation

Thank you @hector for your advice how to overcome Chinese Firewall, unfortunately both methods didn’t work for me, but I was able to install ipfs-cluster to Alibaba server through snap manager. Anyway I created Issue on the GitHub regarding this problem

IPFS Cluster installed but I have a problem to Run ipfs-cluster-service daemon… you can see related errors on the screen-shot

I’ve updated/reinstall my ipfs-cluster-service

Now I have version: 0.6.0+gitf65349e9c86f2e8aeb34067df7ca3a204d1a0d9b

ERROR ipfshttp: error posting to IPFS: Post http://127.0.0.1:5001/api/v0/repo/stat?size-only=true: dial tcp 127.0.0.1:5001: connect: connection refused ipfshttp.go:708

I have a three VMs organized in the Private IPFS Network, ipfs daemon running on all nodes.

This is what I have in my ipfs config:

 `"Addresses": {
    "API": "/ip4/139.59.162.83/tcp/5001",
    "Announce": [],
    "Gateway": "/ip4/139.59.162.83/tcp/8080",
    "NoAnnounce": [],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001"
    ]
  }`

I will be grateful for the advice in solving this problem)

unfortunately both methods didn’t work for me,

why is that? Is the ipfs network blocked too?

The error message says that 127.0.0.1:5001 refuses connection. That means, ipfs is not running there. I am not sure why you have "API": "/ip4/139.59.162.83/tcp/5001", instead of /ip4/127.0.0.1/tcp/5001. That is the local, listening address for the daemon. Normally it’s either 127.0.0.1 or 0.0.0.0.