Any work on compression of ipld data?

We have an use case where we want to store a lot of data on very modest devices for various reasons. We make sure to create the data in such a way that it is sufficiently chunky. Our data is json data stored as dag objects, with a lot of potential for compression (your typical json data I guess).

I am tempted to just compress the data before storing and storing a compressed blob, but that would lose many of the advantages of ipld dag objects, namely the addressability of their content.

I think a much better solution would be to have compression at a file store and transport layer level. E.g. compress individual blocks in the file store once they exceed a certain size. Then communication between nodes via bitswap etc. could use this compressed data as well, provided that both sides understand the compression. Similar to content encoding negotiation in http.

The hash of the data would be computed before compression, so that the hash remains stable when switching compression formats and options.

I think for the data that is typically stored in ipfs, this could increase both storage efficiency and network bandwidth usage by a large factor. There would probably have to be a noop compression option for data that is already compressed, such as images or videos. But it should be pretty easy and cheap to detect if compression provides benefit when adding dag objects.

Is there any work ongoing on this, or do I have to resort to compressing individual blocks myself and lose the benefits of interplanetary linked data?

1 Like

It isn’t really an answer, but have you thought about using yml instead of json? just the bytes you save from dropping brackets and commas is (in big files) already a lot…

If they’re feeding the JSON to ipfs dag put in the normal way, it’s actually stored as CBOR – no commas or brackets or anything like that :slight_smile:

Yes, we are storing json as dag objects. However, I think for our kind of data (basically pretty regular telemetry / events) there is definitely a factor of 10 or more to be had using intelligent compression. And I don’t think this is very uncommon.

So I guess the best short term solution would be to encode the data using cbor / ipld and then compress it before storing/hashing it. Longer term it would be great if ipfs would deal with the last step, but I fully understand that there other more pressing issues (like making pubsub and IPNS fast and production ready :slight_smile: )