V0.4.0 - Unable to pin large directories

From @sdockray on Tue Feb 02 2016 10:13:33 GMT+0000 (UTC)

I have been experimenting with adding a large directory on one node (origin) and then attempting to pin that directory on another node (destination). When I simply do: ipfs pin add HASH on the destination, few, if any, hashes get pinned, CPU and memory load are extremely high and ipfs eventually crashes.

As an alternative, I captured all of the hashes that came from the original add operation (one per line) and wrote this simple script to pin hashes one by one: https://github.com/sdockray/ipfs_pin_remote_hashes/blob/master/ipfs_pin_remote_hashes.sh
It is successful for a few hundred hashes (about 4-7gb). Along the way it will take several minutes, sometimes close to an hour for a single hash, and then eventually I start seeing Error: pin: merkledag: not found and the daemon has silently gone away.

The origin is Debian Jessie. I am still collecting data about the destinations but they are typically Debian as well, and all are running 0.4.0-dev. I don’t see a whole lot in the logs that is very helpful.


Copied from original issue: https://github.com/ipfs/support/issues/19

From @jbenet on Wed Feb 03 2016 00:13:34 GMT+0000 (UTC)

Sorry you’re having this problem @sdockray What are the specs of the machines? (The go runtime and our use of it is way more expensive than it should be)

cc @whyrusleeping @diasdavid

From @sdockray on Wed Feb 03 2016 00:25:05 GMT+0000 (UTC)

I’m not sure what to put exactly, but the basics are below.

Origin:
Intel® Xeon® CPU E5-2609 0 @ 2.40GHz
MemTotal: 65953248 kB
MemFree: 30195284 kB
Debian GNU/Linux jessie/sid

Destination:
Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz
MemTotal: 264049788 kB
MemFree: 36578076 kB
Debian GNU/Linux 8 (jessie)

From @jbenet on Wed Feb 03 2016 02:06:40 GMT+0000 (UTC)

The 65GB and 264GB of memory seems like virtual mem to me. i’m wondering how much RAM the machines actually have,

From @sdockray on Wed Feb 03 2016 02:24:51 GMT+0000 (UTC)

On the origin dmidecode shows 4 slots like this one:

Handle 0x004B, DMI type 17, 34 bytes
Memory Device
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Speed: 1066 MHz

also

sudo cat /var/log/dmesg | grep Memory
[ 0.000000] Memory: 65933388K/67079912K available (5693K kernel code, 1119K rwdata, 2812K rodata, 1304K init, 844K bss, 1146524K reserved, 0K cma-reserved)

on the destination I unfortunately don’t have root access and so the best i can do right now is:

free
total used free shared buffers cached
Mem: 264049788 227437292 36612496 4091940 834408 137124540
-/+ buffers/cache: 89478344 174571444
Swap: 0 0 0

From @lgierth on Wed Feb 24 2016 00:53:55 GMT+0000 (UTC)

There’s a PR under way which should improve pinning quite a bit: https://github.com/ipfs/go-ipfs/pull/2384

From @RichardLitt on Thu Mar 10 2016 18:39:52 GMT+0000 (UTC)

That PR has been merged. @sdockray can you try on the new version, if this is still a problem?

From @sdockray on Sat Mar 26 2016 10:22:18 GMT+0000 (UTC)

I have been using distributed ipfs versions, so I did the go install, with gx, etc. just now and tried again and within a couple minutes (before actually pinning anything) the pin crashes ipfs

[x ~] ipfs pin add -r Qmaw9UT8gsgYGZVnFXgqD82x1moha4EPpngrq6vHi1tSq4
Error: Post http://127.0.0.1:5001/api/v0/pin/add?arg=Qmaw9UT8gsgYGZVnFXgqD82x1moha4EPpngrq6vHi1tSq4&encoding=json&r=true&stream-channels=true: EOF
[1]+ Exit 2 nohup ipfs daemon

I will keep trying with smaller large directories and see if there is some limit to the size or something

From @diasdavid on Sat Mar 26 2016 15:43:08 GMT+0000 (UTC)

@sdockray does go-ipfs give any interesting error log? What is the size of the file were are talking about? And before the ipfs pin add -r, where have you added it to IPFS?

From @sdockray on Sun Mar 27 2016 03:14:29 GMT+0000 (UTC)

the error log from the crash is 56M and i can’t see anything particularly noteworthy in it but i’m also not sure what i’m looking for… at the beginning and end i didn’t see anything eye catching. here is clip of beginning and end.

runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7f557aee6067 m=2

goroutine 0 [idle]:

goroutine 1 [chan receive, 7 minutes]:
main.daemonFunc(0x7f55785ec0a8, 0xc8203760c0, 0x7f55785ec260, 0xc820388070)
        /pathToHome/work/src/github.com/ipfs/go-ipfs/cmd/ipfs/daemon.go:308 +0x1121
github.com/ipfs/go-ipfs/commands.(*Command).Call(0x152e420, 0x7f55785ec0a8, 0xc8203760c0, 0x0, 0x0)
        /pathToHome/work/src/github.com/ipfs/go-ipfs/commands/command.go:110 +0x471
main.callCommand(0x7f55785ec180, 0xc820144800, 0x7f55785ec0a8, 0xc8203760c0, 0x152e420, 0x15106a0, 0x0, 0x0, 0x0, 0x0)
        /pathToHome/work/src/github.com/ipfs/go-ipfs/cmd/ipfs/main.go:344 +0xa84
main.(*cmdInvocation).Run(0xc8201447c0, 0x7f55785ec180, 0xc820144800, 0x0, 0x0, 0x0, 0x0)
        /pathToHome/work/src/github.com/ipfs/go-ipfs/cmd/ipfs/main.go:187 +0x1e5
main.main()
        /pathToHome/work/src/github.com/ipfs/go-ipfs/cmd/ipfs/main.go:152 +0x65b

goroutine 17 [syscall, 7 minutes, locked to thread]:
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1721 +0x1

goroutine 6 [chan receive]:
github.com/ipfs/go-ipfs/Godeps/_workspace/src/github.com/codahale/metrics.init.1.func2()
        /pathToHome/work/src/github.com/ipfs/go-ipfs/Godeps/_workspace/src/github.com/codahale/metrics/metrics.go:321 +0x73
created by github.com/ipfs/go-ipfs/Godeps/_workspace/src/github.com/codahale/metrics.init.1
        /pathToHome/work/src/github.com/ipfs/go-ipfs/Godeps/_workspace/src/github.com/codahale/metrics/metrics.go:328 +0x7b

goroutine 7 [syscall, 7 minutes]:
os/signal.loop()
        /usr/local/go/src/os/signal/signal_unix.go:22 +0x18
created by os/signal.init.1
        /usr/local/go/src/os/signal/signal_unix.go:28 +0x37

goroutine 34 [select, 7 minutes]:
...

goroutine 2194930 [semacquire]:
sync.runtime_Semacquire(0xc82027933c)
        /usr/local/go/src/runtime/sema.go:43 +0x26
sync.(*Mutex).Lock(0xc820279338)
        /usr/local/go/src/sync/mutex.go:82 +0x1c4
github.com/ipfs/go-ipfs/exchange/bitswap.(*Bitswap).updateReceiveCounters(0xc8202792c0, 0xc8ac2dd200, 0x0, 0x0)
        /pathToHome/work/src/github.com/ipfs/go-ipfs/exchange/bitswap/bitswap.go:354 +0x52
github.com/ipfs/go-ipfs/exchange/bitswap.(*Bitswap).ReceiveMessage.func1(0xc8247ceb60, 0xc8202792c0, 0x7f55785ec000, 0xc82016a0d0, 0xc82307bf80, 0x22, 0xc8ac2dd200)
        /pathToHome/work/src/github.com/ipfs/go-ipfs/exchange/bitswap/bitswap.go:335 +0x6b
created by github.com/ipfs/go-ipfs/exchange/bitswap.(*Bitswap).ReceiveMessage
        /pathToHome/work/src/github.com/ipfs/go-ipfs/exchange/bitswap/bitswap.go:346 +0x485

rax    0x0
rbx    0x7f557b257708
rcx    0xffffffffffffffff
rdx    0x6
rdi    0x8409
rsi    0x840a
rbp    0x10d86a8
rsp    0x7f557ae6f948
r8     0xa
r9     0x7f557ae70700
r10    0x8
r11    0x206
r12    0x2c
r13    0x10cd860
r14    0x0
r15    0x8
rip    0x7f557aee6067
rflags 0x206
cs     0x33
fs     0x0
gs     0x0

and as for the size…

size of large directory which was added to ipfs, you can see where in the pin command
x% du -sh processed
385G processed
count files and directories
x% find processed -type f | wc -l
52593
x% find processed -type d | wc -l
82933

i tried to pin some of the large subdirectories (about 1G) inside the big target directory and they are recursively pinned just fine.

From @diasdavid on Mon Apr 25 2016 19:14:48 GMT+0000 (UTC)

Hi @sdockray, we’ve recently released go-ipfs 0.4.0 and with that a lot of perf and mem improvements came along. Could you check if you still have this issue with the latest version?

From @sdockray on Tue Apr 26 2016 06:14:02 GMT+0000 (UTC)

updated

i was able to finish adding the directory - first pass got 99% of the contents added. then i kept getting “blockservice is closed” until i stopped the ipfs daemon and tried again. then it completed the whole directory. moving on the testing the pinning now.

Hi @diasdavid - all of my comments on and after March 26 were with 0.4.0 (before that they were with 0.4.0-dev). I tried (a week ago and again just now) to start over again (re-adding the large directory before trying to re-pin it) and unfortunately under 0.4.0 I can no longer add the directory as I consistently get the following error (but at completely different stages of the process, 1%, 3%, 99%):

ERROR commands/h: blockservice is closed client.go:247

So unfortunately, I am stuck trying to recursively add this large directory (which never gave me problems in 0.4.0-dev) before I can test pinning it.

From @sdockray on Tue Apr 26 2016 13:04:24 GMT+0000 (UTC)

OK the pin failed again on a freshly added directory (407G) , running 0.4.0 everywhere.

ipfs pin add -r QmQr8LVFKWDMAR186oi6sEKhfVzoQPig3zcfLFWWkx9yrZ

outputs (after a few minutes):

Error: Post http://127.0.0.1:5001/api/v0/pin/add?arg=QmQr8LVFKWDMAR186oi6sEKhfVzoQPig3zcfLFWWkx9yrZ&encoding=json&r=true&stream-channels=true: EOF

the pin process and the daemon both exit. the rest of the error log is like the one above.

From @whyrusleeping on Tue Apr 26 2016 16:07:24 GMT+0000 (UTC)

@sdockray could I get the full output log from when it crashes? That would be very helpful.

Also, a few things I’d like to have you try:

  • Run the daemon with the env var IPFS_REUSEPORT set to false
    • this is generally a source of excess thread resource consumption, it may help.
  • Also try: ipfs config --json Datastore.NoSync true
    • this disables a few fsync calls that our storage engine makes. If your system crashes during an add you will have less guarantees that data entered into ipfs is still there (the hashes printed to your screen might not be fully synced to disk). But on the other side, it greatly speeds up the add process.
  • After starting the add, in another terminal, run: curl localhost:5001/debug/pprof/profile > ipfs.cpuprof
    • this will hang for 30 seconds collecting profiling data. If you can provide me with that file and a copy of the running ipfs binary it will let me more easily investigate what its working on

From @sdockray on Wed Apr 27 2016 13:44:53 GMT+0000 (UTC)

@whyrusleeping

the full log is here: https://hostb.org/AW
ipfs.cpuprof is here: https://hostb.org/E5
the ipfs binary: https://hostb.org/N1

I haven’t tried changing the Datastore.NoSync config item yet. I have set the IPFS_REUSEPORT env var

From @whyrusleeping on Wed Apr 27 2016 16:10:33 GMT+0000 (UTC)

I’m not sure what the cause of your specific failure is, but it looks like theres an issue in the bitswap provider connector. Its basically not working fast enough and thats causing a buildup of extra goroutines all the way back up the line. I have an open PR ( https://github.com/ipfs/go-ipfs/issues/2437 ) that improves that codepath so i will put effort into landing that sooner and report back

From @spinda on Thu Jul 28 2016 23:10:55 GMT+0000 (UTC)

@sdockray

It looks like @whyrusleeping managed to land ipfs/go-ipfs#2437. Does it help with your issue? I’m also interested in pinning large directories.

From @sdockray on Thu Jul 28 2016 23:38:44 GMT+0000 (UTC)

@spinda - I tried with version 0.4.2 and got the same behavior, but maybe it is in the 0.4.3-rc? I can’t see the change specifically cited in the change log but it does mention that it “improves the performance of Storage, Bitswap, as well as Content and Peer Routing” so maybe?

In the next few days, I’ll try that version and report back

From @Kubuxu on Fri Jul 29 2016 00:02:32 GMT+0000 (UTC)

The major improvements are in 0.4.3-rc1.

From @sdockray on Fri Jul 29 2016 14:02:13 GMT+0000 (UTC)

thanks for the info… will try and get it updated soon

% ipfs update --verbose install v0.4.3-rc1
fetching go-ipfs version v0.4.3-rc1
  - using GOOS=linux and GOARCH=amd64
  - fetching "/ipns/dist.ipfs.io/go-ipfs/v0.4.3-rc1/go-ipfs_v0.4.3-rc1_linux-amd64.tar.gz"
  - using local ipfs daemon for transfer
  - writing to /tmp/ipfs-update776635258/go-ipfs_v0.4.3-rc1_linux-amd64.tar.gz
  - extracting binary to tempdir:  /tmp/ipfs-update032086215/ipfs-new
binary downloaded, verifying...
  - running init in '/dir/ipfs/update-staging/test495205521' with new binary
  - checking new binary outputs correct version
install failed, reverting changes...
ERROR: version didnt match