Testground my journey

Hi,

I will be documenting my journey into testground. I plan to write tests in rust lang for my app.

First step was to install docker, I did but in rootless mode (more on that later) then I installed go.

Worked fine until I tried testground daemon the command was not recognised. I asked on discord and @Jorropo helped me.

When installing go they don’t tell you to add a PATH to .bashrc

export PATH="$HOME/go/bin:$PATH"

Easy fix!

Next step was to run a test plan.

Can’t connect to docker. Docker host endpoint is wrong. Oh no!
Being not super familiar with docker I thought re-intalling docker without the rootless mode would work but it didn’t. You need rootless mode to use the docker extension in VScode anyway.

Back to .bashrc!

export DOCKER_HOST=unix:///run/user/1000/docker.sock

Did the trick since it wasn’t the default.

I was finally able to run the example but some containers errored.

Jan 10 22:14:28.645340  INFO    container isn't running; starting       {"container_name": "testground-redis"}
Jan 10 22:14:29.149892  ERROR   starting container failed       {"container_name": "testground-redis", "container_name": "testground-redis", "error": "Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: write sysctl key net.core.somaxconn: open /proc/sys/net/core/somaxconn: no such file or directory: unknown"}
Jan 10 22:14:29.153360  INFO    container isn't running; starting       {"container_name": "testground-sync-service"}
Jan 10 22:14:29.656777  ERROR   starting container failed       {"container_name": "testground-sync-service", "container_name": "testground-sync-service", "error": "Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: write sysctl key net.core.somaxconn: open /proc/sys/net/core/somaxconn: no such file or directory: unknown"}
Jan 10 22:14:29.659622  INFO    container isn't running; starting       {"container_name": "testground-influxdb"}
Jan 10 22:14:30.002163  INFO    container isn't running; starting       {"container_name": "testground-sidecar"}
Jan 10 22:14:30.387401  ERROR   starting container failed       {"container_name": "testground-sidecar", "container_name": "testground-sidecar", "error": "Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting \"proc\" to rootfs at \"/proc\" caused: mount through procfd: operation not permitted: unknown"}
Jan 10 22:14:30.387479  ERROR   doRun returned err      {"err": "healthcheck fixes failed; aborting:\nChecks:\n- local-outputs-dir: ok; directory exists.\n- control-network: ok; network exists.\n- local-grafana: failed; container state: exited\n- local-redis: failed; container state: created\n- local-sync-service: failed; container state: created\n- local-influxdb: failed; container state: exited\n- sidecar-container: failed; container state: created\nFixes:\n- local-outputs-dir: unnecessary; \n- control-network: unnecessary; \n- local-grafana: ok; container created.\n- local-redis: failed; failed to start container.\n- local-sync-service: failed; failed to start container.\n- local-influxdb: ok; container created.\n- sidecar-container: failed; failed to start container.\n"}

I’ll see tomorrow, if I start writing my own test plans or if I try to fix this mess.

1 Like

I investigated the errors and they were related to unprivileged containers. So I decide to re-install docker again… Did my best to remove any trace of it before hand (removing DOCKER_HOST from my .bashrc too).

Afterwards, tried to run the network example test plan and had new errors, I had forgotten to run make install again after reinstalling docker and was missing some containers.

Then it WORKED! All containers where online and the test ran to completion.

But why?

The containers must run under the default docker context not rootless otherwise they don’t have the privilege necessary.

Yesterday I ran my first test that I made!

I had to build my own synchronisation and figured out what environment variables are used.

The docs are useful but they did not list the variables so I had to log them, then built a client for the redis db. I took way too long to figure out how to build a “barrier”, I was overthinking it. A timer checking a number to see if it’s high enough, that’s it.

The docs are not made for other language that go, I wish they were more agnostic.

I’m now trying to find what kind of syncing is done by the sync service itself. It’s not documented… They assume your using the go-sdk.

1 Like

Yesterday I was able to list the IP address of all the containers and sync it using pubsub and barriers.

50 containers all syncing up is a beautiful thing!

I still can’t figure out how to wait for the “network-initialized” state. The topic and the event probably need to be formatted exactly right for it to work. Again, the docs assume your using go and they don’t even mention how it works.

Got problem compiling static bins (for tiny containers), so I used debian:buster-slim (only 8x bigger :stuck_out_tongue: ). Works fine, 80mb is manageable.

I still can’t figure out that “network-initialized” state, so I just YOLOed and I got libp2p to ping other containers!

You need far less things than buster-slim.
The only thing you need is:

$ ldd $(which ipfs) | awk '{$1=$1};1'
linux-vdso.so.1 (0x00007fff837fd000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f00f3234000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f00f322e000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f00f303c000)
/lib64/ld-linux-x86-64.so.2 (0x00007f00f3288000)

If you are feeling adventurous you can make your own container with only thoses things, it’s gonna be 4MiB big or so and work (you don’t actually need a valid linux install, that all golang use (if you don’t use plugins or cgo)).

You can also use a reflinking capable FS, like btrfs or ZFS (I don’t actually know if docker supports btrfs, I guess so but not tested).
Then docker is gonna use a reflink to create your dockers, meaning they will be copy on write backed and all files that they share will be only stored once and share between all of them.

1 Like

Thanks I’ll keep this in mind when I want to scale up. BTW I’m not using go it’s all in rust.

ldd /path/to/your/bin

Will tell you what you really need :slight_smile: (in most languages / things that doesn’t use after link time relocations)

I don’t know if rust supports it, but you can also build a fully static binary.
Which should just work without installing everything (it basically only require a kernel which your host provides). (now that I think about it you might need the linker still, not sure)

That’s what I was doing but had problems compiling some rust crates (libs), plus I read that the compile target had multi-threaded performance issues too and so I switched to buster-slim to focus on the simulation.

If I end up having problems with container size I’ll try what you said. Thank you.

1 Like

I’m having problems again… Below you can see the Kademlia then the dial.

Jan 24 23:53:05.394241  INFO    2.3925s      OTHER << single[001] (3e34ea) >> Behaviour(Kademlia(RoutingUpdated { peer: PeerId("12D3KooWA6jWHDmDn8yRtdKZrKwqQhS5Xp7CwkEmSicmGQhTe35y"), is_new_peer: true, addresses: ["/ip4/16.0.0.4/tcp/8080"], bucket_range: (Distance(28948022309329048855892746252171976963317496166410141009864396001978282409984), Distance(57896044618658097711785492504343953926634992332820282019728792003956564819967)), old_peer: None }))
Jan 24 23:53:05.394859  INFO    2.3932s      OTHER << single[001] (3e34ea) >> Dialing(PeerId("12D3KooWA6jWHDmDn8yRtdKZrKwqQhS5Xp7CwkEmSicmGQhTe35y"))
Jan 24 23:53:05.395235  INFO    2.3936s      OTHER << single[001] (3e34ea) >> Behaviour(Kademlia(RoutingUpdated { peer: PeerId("12D3KooWC2YdgQr1L2dbKo9tTKsrNTYDspbHauxvadKFsyiEH4jM"), is_new_peer: true, addresses: ["/ip4/16.0.0.5/tcp/8080"], bucket_range: (Distance(14474011154664524427946373126085988481658748083205070504932198000989141204992), Distance(28948022309329048855892746252171976963317496166410141009864396001978282409983)), old_peer: None }))
Jan 24 23:53:05.395404  INFO    2.3938s      OTHER << single[001] (3e34ea) >> Dialing(PeerId("12D3KooWC2YdgQr1L2dbKo9tTKsrNTYDspbHauxvadKFsyiEH4jM"))

but then.

Jan 24 23:53:05.396327  INFO    2.3937s      ERROR << single[001] (3e34ea) >> thread 'libp2p-swarm-task-0' panicked at 'there is no reactor running, must be called from the context of a Tokio 1.x runtime', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.15.0/src/runtime/context.rs:29:26

Could it be a GossipSub error? Because Kad keep trucking along.

Jan 24 23:53:28.376025  INFO    25.3744s      OTHER << single[001] (3e34ea) >> Behaviour(Kademlia(OutboundQueryCompleted { id: QueryId(0), result: Bootstrap(Ok(BootstrapOk { peer: PeerId("12D3KooWGqR7SUexTM8utANUAaQaWu4R7AVX4uG9UMHm6n6px8B7"), num_remaining: 3 })), stats: QueryStats { requests: 4, success: 0, failure: 0, start: Some(Instant { tv_sec: 37471, tv_nsec: 91839935 }), end: Some(Instant { tv_sec: 37493, tv_nsec: 71778801 }) } }))
Jan 24 23:53:48.379885  INFO    45.3782s      OTHER << single[001] (3e34ea) >> Behaviour(Kademlia(OutboundQueryCompleted { id: QueryId(0), result: Bootstrap(Ok(BootstrapOk { peer: PeerId("1AX8Kr7uxshK1iDqwnB3j6nR6TaEGZeUtoNUAkEUpvb6Tj"), num_remaining: 2 })), stats: QueryStats { requests: 4, success: 0, failure: 0, start: Some(Instant { tv_sec: 37493, tv_nsec: 72011763 }), end: Some(Instant { tv_sec: 37513, tv_nsec: 75516712 }) } }))
Jan 24 23:54:09.384235  INFO    66.3826s      OTHER << single[001] (3e34ea) >> Behaviour(Kademlia(OutboundQueryCompleted { id: QueryId(0), result: Bootstrap(Ok(BootstrapOk { peer: PeerId("1AjPcL2Wbrj3MnUwcpzc4nxAponDG7toHJb1eFNrWBdiee"), num_remaining: 1 })), stats: QueryStats { requests: 4, success: 0, failure: 0, start: Some(Instant { tv_sec: 37513, tv_nsec: 75630414 }), end: Some(Instant { tv_sec: 37534, tv_nsec: 79939442 }) } }))
Jan 24 23:54:30.388457  INFO    87.3868s      OTHER << single[001] (3e34ea) >> Behaviour(Kademlia(OutboundQueryCompleted { id: QueryId(0), result: Bootstrap(Ok(BootstrapOk { peer: PeerId("1AguzNTx7ZtDv93JWscChCvw9dRyhQxV6Ft6WTofKu9N3T"), num_remaining: 0 })), stats: QueryStats { requests: 4, success: 0, failure: 0, start: Some(Instant { tv_sec: 37534, tv_nsec: 80035713 }), end: Some(Instant { tv_sec: 37555, tv_nsec: 84259818 }) } }))

And finally when I try to use GossipSub to publish.

Jan 24 23:53:40.865840  INFO    37.8642s      ERROR << single[003] (04b65c) >> InsufficientPeers

No sure what to do I’ll look into it tomorrow.

I was able to fix the problem. When building the swarm you have to pass an executor, in this case Tokio.

See here → tokio-based tcp transports panic when dialing due to threadpools with unset runtimes. · Issue #2230 · libp2p/rust-libp2p · GitHub
And here → Re-design feature sets · Issue #2173 · libp2p/rust-libp2p · GitHub

I also noticed that I had forgotten to tell the swarm to listen on the local IP :stuck_out_tongue:

My simulation then worked! Bootstrapping a DHT then sending messages via GossipSub!

1 Like

Today I made a block store and hooked Bitswap to it.

I did a test; DHT bootstrapping, 1 GossipSub publisher others are subscribers, sent msg containing block cids, then fetched the block via Bitswap.

I’m amazed at how well it works although there’s no difficult network conditions set yet.

1 Like

Still trying to think about scenarios for my test.

I added “roles”, one streamer, 25% neutral node then the rest are viewers who join later. I had to re-organize the code a bit but it works well.

Also, I discovered the rust-sdk for testground. Can’t use it yet but that’s a good sign of things to come!

I don’t think I understand what’s happening under the hood. When a viewer node has connected to the DHT and subscribe to a GossipSub topic, how does it find other nodes on that topic? Using the DHT probably, otherwise how would they find each other? Then, if your listening on a topic with multiple other nodes and ask for a block via Bitswap are you already connected to those nodes? I would think so, a connection is a connection doesn’t matter which behaviour use it.

If I’m correct then the longest a viewer would have to wait is to find another viewer via the DHT. As soon as that connection happen they would have someone to relay the pubsub messages and also the blocks.

Now at scale I could see some problems. If one node is overloaded and your node can’t get the data fast enough from it will your node “load-balance” and try to find other nodes? If your connected to others on the same topic maybe your node would have multiple peers to get the blocks from.

Good news is that, I don’t see a case where live streams on IPFS don’t scale yet.

I added pubsub and network shaping (untested) to rust-sdk.

It took a long time, had to learn how to handle websocket stuff.

Tomorrow, difficult network condition simulation!

Trying to guest what Json requests should look like is hard. Just looking at the go code isn’t cutting it.

I fixed some bugs but it still doesn’t work. I’m frustrated! If your going to work exclusively with Json give us example of well formed messages at least!

Next week, I’ll learn and modify the go-sdk to log raw request and response.

Dear journal I’m sorry I neglected you. :rofl:

rust-sdk is now good enough for my use case. I added metrics too. Been running simulations for some time.

This will probably be the last entree but for posterity I’ll list my wishes.

  • Language agnostics documentation.
    • Including examples of all the sdk messages in json.
  • Make group index available as an env var.
  • Make the network bandwidth limiter configurable bidirectionally.
    • Specify it’s Bits per second not Bytes.
  • Finalise and polish rust-sdk.

See here → GitHub - SionoiS/sdk-rust at metrics

The end.

2 Likes