Data Structure Optimization Research For Optimal Deduplication

TL;DR at the end.

Due to the lack of concrete research into data duplication savings I’ve started experimenting with some research myself. To my surprise, the data deduplication savings wasn’t as easily attainable as I was assuming it would be, perhaps to misreading or misinterpreting existing documentation. This is in part related to UnixFS Object Overhead, as when I was experimenting with different “on-ipfs formats” I discovered there was noticeable differences in “total ipfs data sizes”.

This also lead me to the conclusion that the data deduplication gains will not be the same across different data types (emails, movies, etc…), so I think that in order to calculate how much you will save for storing data, it’s important to tackle this from the perspective of individual data types.

For example, when creating unixfs you have different chunk methods at your disposal (rabin, buzzhash, size based, etc…), you also have different options for the type of data you put into unixfs (is it just a serialized representation of the original data, is it a protocol buffer object, etc…

Given this, it’s quite clear that there’s a variety of factors which influence your data deduplication gains, but one thing I haven’t seen discussed is the structure of your data, whether its a protocol buffer object, json, etc…

In an attmept to experiment with this and get some concrete numbers, I’m attempting to take emails and put them on IPFS namely because the format of emails is standardized, and its a data type im familiar with.

For example I have the following protocol buffer object to described RFC-5322 emails

syntax = "proto3";
package pb;
import "github.com/gogo/protobuf/gogoproto/gogo.proto";
import "google/protobuf/timestamp.proto";

// ChunkedEmail is like Email but chunked into parts
message ChunkedEmail {
	// maps the chunk part to its hash
	map<int32, string> parts = 1;
}

// Email is an ERFC5322 compatible protocol buffer intended to be used
// as an IPLD object type, allowing long-term space-efficient archiving of data
// taken from https://github.com/DusanKasan/parsemail/blob/master/parsemail.go
message Email {
    Header headers = 1 [(gogoproto.nullable) = false];
    string subject = 2;
	Addresses addresses = 3 [(gogoproto.nullable) = false];
	google.protobuf.Timestamp date = 4 [(gogoproto.stdtime) = true, (gogoproto.nullable) = false];
	string messageID = 5;
	repeated string inReplyTo = 6;
	repeated string references = 7;
	Resent resent = 8;
	string htmlBody = 10;
	string textBody = 11;
	// a slice is nil by default
	repeated Attachment attachments = 12 [(gogoproto.nullable) = false];
	// a slice is nil by default
	repeated EmbeddedFile embeddedFiles = 13 [(gogoproto.nullable) = false];
}

message Attachment { 
	string fileName = 1;
	string contentType = 2;
	// hash of the unixfs object for the file
	string dataHash = 3; 
}

message EmbeddedFile {
	string contentId = 1;
	string contentType = 2;
	// hash of the unixfs object for the file
	string dataHash = 3; 
}

message Addresses {
	Address sender = 1;
    repeated Address from = 2 [(gogoproto.nullable) = false];
    repeated Address replyTo = 3 [(gogoproto.nullable) = false];
    repeated Address to = 4 [(gogoproto.nullable) = false];
    repeated Address cc = 5 [(gogoproto.nullable) = false];
    repeated Address bcc = 6 [(gogoproto.nullable) = false];
}

message Resent {
	Addresses addresses = 1 [(gogoproto.nullable) = false];
	google.protobuf.Timestamp resentDate = 2 [(gogoproto.stdtime) = true, (gogoproto.nullable) = false];
	string resentMessageId = 3;
}
message Header {
    map<string, Headers> values = 1 [(gogoproto.nullable) = false];
}

message Headers {
	repeated string values = 1;
}

// Values is basically an embedded slice in an email header
message Values {
    repeated string v = 1;
}

message Address {
    string name = 1; // proper name, may be empty
    string address = 2;
}

And to test this I’ve created some “sample sets”, namely a few emails I sent myself with the same pictures, as well as a boat load of randomly generated emails here if you are interested, but theres 5k of them.

So I went about storing this on IPFS and discovered some interesting things. The initial sample set which contains largely of deduplicated data (the same set of phrases, and pictures), maybe 75% of the data was duplicated. When storing this on IPFS, the space savings were absolutely mind boggling, 572%!!!

So to not get ahead of myself and proclaim victory I generated a large sample set of entirely random emails. When this was added to IPFS in the same manner, deduplication savings were 8%. Now 8% isn’t slim pickings, when you scale this up to petabytes of data on regular disks, and then consider petabytes of data with RAID, savings are looking pretty cool.

But I refuse to believe that 8% is as much as you can save when dealing with “random emails”, which leads me to believe that the bulk of data deduplication savings will come largely from how your data structure is optimized to work with IPFS.

For example simply chunking your data into extremely small sizes doesn’t always work. I did some experimentation with chunking emails into 100 byte chunks, and that lead to a massively larger “on-ipfs” storage size, than either using default chunks, which makes sense because if your chunk size is small enough, you will create more data linking together these objects, than the actual size of the data.

In terms of doing further investigation I am a bit lost as to various ways to R&D this kind of research, and looking for some discussion around possible plans of attack so I can publish some research on this. All research published will be open + freely available

TL;DR

  • I’m looking to do research into how you can optimize data structures for IPFS to get optimal deduplication
  • Data duplication savings isn’t straight forward
  • Savings can be influenced by a variety of factors including:
    • Chunk size
    • Data structure
    • How much of your data is duplicated, vs how much will be duplicated simply due to coincidences with content addresssing
  • Not too sure about the best path forward to do this research
1 Like

Well, I’m not really sure what exactly you’re trying to say here … when you save random emails, regardless of the chunker and the chunk size, the deduplication rate will be 0.

For movies that’s exactly the same.

With the exception of storing the same files twice, while using the same chunker/chunk-size parameters.

Using something like rabin/buzzhash is only interesting for large files with small changes which shifts the content forwards or backwards. Like if you add a file to a large (uncompressed) tar file, in the middle. Another example would be certainly large VM images.

Deduplication is general is on the other hand very efficient if you add files at the end of an (uncompressed) tar archive or a similar storage format or concat multiple files (when using buzzhash).


Compression on the other hand is extremly efficient on emails and other text files. The small file size is somewhat of an issue, that’s why I investigated the dictionary option of zstd selected by mime type for ipfs in the last month.

I don’t think this is entirely true for all cases?? Objects on IPFS are saved as chunks (well depends on how you store them i guess, but in this case with the emails and with unixfs there are chunks). Given this, inadvertently there will be some chunks that are shared across all emails, and you will get deduplication savings since chunk XYZ shared between object A+B+C, will only need to be stored once, not each time for A+B+C. So there might not be “out of ipfs deduplication”, but once the data is chunked and stored on IPFS, due to content addressing there can, and most likely will be some form of duplicated chunks.

So what I’m trying to say is that there has to be some optimal form of structuring, and chunking your data that leads to higher chances of chunks being shared, without having insane overheads like what happened with the 100 byte chunks, because even though the chunks were shared, having tens of thousands of links for each chunk adds up.

For example, lets say in the case of emails there’s some form of chunk where chunks are done based off parts of the email (header, body, etc…) you may increase the likelihood for duplicated content.

Using something like rabin/buzzhash is only interesting for large files with small changes which shifts the content forwards or backwards. Like if you add a file to a large (uncompressed) tar file, in the middle. Another example would be certainly large VM images.

Interesting, I never knew this. good to know.

Deduplication is general is on the other hand very efficient if you add files at the end of an (uncompressed) tar archive or a similar storage format or concat multiple files (when using buzzhash).

Interesting, I wonder if this could be applied to groups of data. For example emails for one person, or one company go into the same archive, etc…

Compression on the other hand is extremly efficient on emails and other text files. The small file size is somewhat of an issue, that’s why I investigated the dictionary option of zstd selected by mime type for ipfs in the last month.

Good point I hadn’t considered compression at all

You will find that a de duplicating focused chunker will share lots of theory with compression as both are attempting roughly the same thing, finding repeating sections of data on unstructured/unknown-structured data and using a symbols to represent them.

A few considerations though, first as you ran into with small chunk sizes the overhead was often larger then the benefits of having a better de duplication rate. Most chunks use 32 bytes for the CID alone so that 100 byte chunk will need to be used at least 5 times elsewhere to break even for the overhead.

Another consideration for an ipfs based chunker is that it should be deterministic for individual files across nodes, ie if two nodes both add the same file they should both chunk the same way as a file chunked differently have different CIDs, this would make finding content on the network difficult.

Last I heard because of this the planned method going forward was to use a generic chunker by default (buzzhash) and slowly add in additional chunkers that are specialized to the input file/byte streams that do a better job. (e.g a video aware chunker that aligns to key frames for fast seeking or an email focused chunker that can identify headers, attachments, signatures, etc) while this does temporarily create a split in old and new CIDs everything should converge to the more efficient/specialized chunkers as they are implemented.

Thanks for the response, definitely cleared up some gray areas, just one thing im confused on:

A few considerations though, first as you ran into with small chunk sizes the overhead was often larger then the benefits of having a better de duplication rate. Most chunks use 32 bytes for the CID alone so that 100 byte chunk will need to be use at least 5 times elsewhere to break even for the overhead.

Apologies if this comes across as dense, how did you arrive at 5 times elsewhere ? My math is absolute trash.

Apologies if this comes across as dense, how did you arrive at 5 times elsewhere ? My math is absolute trash.

My Apologies, apparently it is my math that is trash, and may still be. (I think I reversed the saving and overhead values)
Anyways looks more like the break even point is 2, not 5.

132 for the chunk itself (100 bytes for the payload + 32 bytes for the CID)
We then need to use the 32 byte CID wherever the 100 would have been used

1 instance: 132 + (32 * 1) = 164 vs 100 (64% worse)
2 instance: 132 + (32 * 2) = 196 vs 200 (2% better)
3 instance: 132 + (32 * 3) = 228 vs 300 (31% better)

This of course assumes that the 32 Bytes for the CID is the only overhead.

Well at least we’re both bad at math :joy:

Thanks for the clarification, I understand now.

Well, deduplicating mails is only possible with specialized chunkers, which remove each body (if it’s an html mail they got two bodys), attachments and the header from eachother. Then the attachments needs to be converted to 8-bit from the two (IIRC) possible other encodings to make sure that the same file always end up with the same hash, regardless of the sending client’s settings.

Than it MIGHT be possible to dedup a large archive of a company for example a little. But the whole body is probably ending up as one chunk, the header definitely and the attachments might be the best dedupable thing, if there’s a lot of large files send by mail around.

But some corporate mail servers do exactly this already, before running a compression algorithm over their backups. So we don’t invent the wheel here :stuck_out_tongue:

The compression rates up to 6 times done on the linux kernel git, while maintaining fast and full random access on the other is a game changer. The zstd developers did a great job :slight_smile:

If you got a large email archive which you can share, I would appreciate that - I’m currently looking into generating the necessary dictionaries. :slight_smile:

Well, deduplicating mails is only possible with specialized chunkers, which remove each body (if it’s an html mail they got two bodys), attachments and the header from eachother. Then the attachments needs to be converted to 8-bit from the two (IIRC) possible other encodings to make sure that the same file always end up with the same hash, regardless of the sending client’s settings.

Good to know wasn’t aware removing the bodies was crucial.

But some corporate mail servers do exactly this already, before running a compression algorithm over their backups.

Idk man I worked at a company which had 6PB of emails archived, and they were literally either saved straight onto disk, in a database, or into WORM tape drives, 0 deduplication efforts, big yikes there.

The compression rates up to 6 times done on the linux kernel git, while maintaining fast and full random access on the other is a game changer. The zstd developers did a great job

Will take a look at zstd, it’s being used in badger v2 and its pretty dope.

If you got a large email archive which you can share, I would appreciate that - I’m currently looking into generating the necessary dictionaries

No real archives other than my personal email accounts and definitely not sharing those lol. I have a cli tool that I wrote to generate random emails, but I don’t think is sufficient.

@postables

Due to the lack of concrete research into data duplication …

There is such concerted research taking place in fact, but the setup took about twice as long as I originally anticipated. You happen to open this topic right around the time the tooling for this is getting ready to ship. Please keep an eye on Research/quantify performance envelopes of multiple CDC algorighms · Issue #227 · ipfs/specs · GitHub for the general progress on that. In about 2 weeks there should be better idea how others can join and help out with this effort.

Several notes on the discussion in this thread:

important to tackle this from the perspective of individual data types

This is a very tempting approach to take, but it simply does not scale. The current focus is to determine what is possible without data-type evaluation. I will elaborate soon on what approaches have already been tried, and what could be tried without resorting to heuristics.

there has to be some optimal form of structuring, and chunking your data that leads to higher chances of chunks being shared, without having insane overheads like what happened with the 100 byte chunks

This is precisely what is being sought, yes :wink:

that 100 byte chunk will need to be use at least …

Small chunks due to “narrow boundaries” between larger chunks, or some other reasons can be represented verbatim using the identity CID. For instance the string This is a valid CID can be represented legally ( and entirely offline ) as bafkqafcunbuxgidjomqgcidwmfwgszbaineuicq ( try to ipfs cat it ). TLDR: various size balances are even trickier than at first appears :frowning: Luckily the tooling I mentioned is exposing knobs for all of this.

There is such concerted research taking place in fact, but the setup took about twice as long as I originally anticipated. You happen to open this topic right around the time the tooling for this is getting ready to ship. Please keep an eye on Research/quantify performance envelopes of multiple CDC algorighms · Issue #227 · ipfs/specs · GitHub for the general progress on that. In about 2 weeks there should be better idea how others can join and help out with this effort.

Good timing then lol, thanks for the link I will follow that issue.

This is a very tempting approach to take, but it simply does not scale. The current focus is to determine what is possible without data-type evaluation. I will elaborate soon on what approaches have already been tried, and what could be tried without resorting to heuristics.

Not scaling is a good point, it would be unrealistic to have a particular chunking method for each and every data type, so a good baseline suitable for a variety of data types is good. But depending on the use case I guess it might make sense in some areas to have chunking methods optimal for that particular data type?

For example lets say you’re doing a large email archive on IPFS, it might make sense for that use case to investigate a specialized chunker. Granted this not necessarily something applicable or scaleable for IPFS as a whole ,but I think there’s definitely use-cases which would warrant specific chunkers.

Small chunks due to “narrow boundaries” between larger chunks, or some other reasons can be represented verbatim using the identity CID. For instance the string This is a valid CID can be represented legally ( and entirely offline ) as bafkqafcunbuxgidjomqgcidwmfwgszbaineuicq ( try to ipfs cat it ). TLDR: various size balances are even trickier than at first appears :frowning: Luckily the tooling I mentioned is exposing knobs for all of this.

Good points, it definitely seems a lot more involved than I originally imagined.

Do you know if anyone has looked into the use of homomorphic hashing functions like lthash with IPFS? With something like that you could conceivably rechunk and still maintain the same hash for the file. Even if you just did it locally you might be able to sub-split a block to a more optimal split while maintaining the root hash.

as far as I know there hasn’t been any investigation or integration of homomorphic hashing into go-multihash, although it certainly can be extended to support homomorphic hashing iirc.

See also here for my summary on everything I could think of related to chunking, compression, and data-deduplication;