TL;DR at the end.
Due to the lack of concrete research into data duplication savings I’ve started experimenting with some research myself. To my surprise, the data deduplication savings wasn’t as easily attainable as I was assuming it would be, perhaps to misreading or misinterpreting existing documentation. This is in part related to UnixFS Object Overhead, as when I was experimenting with different “on-ipfs formats” I discovered there was noticeable differences in “total ipfs data sizes”.
This also lead me to the conclusion that the data deduplication gains will not be the same across different data types (emails, movies, etc…), so I think that in order to calculate how much you will save for storing data, it’s important to tackle this from the perspective of individual data types.
For example, when creating unixfs you have different chunk methods at your disposal (rabin, buzzhash, size based, etc…), you also have different options for the type of data you put into unixfs (is it just a serialized representation of the original data, is it a protocol buffer object, etc…
Given this, it’s quite clear that there’s a variety of factors which influence your data deduplication gains, but one thing I haven’t seen discussed is the structure of your data, whether its a protocol buffer object, json, etc…
In an attmept to experiment with this and get some concrete numbers, I’m attempting to take emails and put them on IPFS namely because the format of emails is standardized, and its a data type im familiar with.
For example I have the following protocol buffer object to described RFC-5322 emails
syntax = "proto3";
package pb;
import "github.com/gogo/protobuf/gogoproto/gogo.proto";
import "google/protobuf/timestamp.proto";
// ChunkedEmail is like Email but chunked into parts
message ChunkedEmail {
// maps the chunk part to its hash
map<int32, string> parts = 1;
}
// Email is an ERFC5322 compatible protocol buffer intended to be used
// as an IPLD object type, allowing long-term space-efficient archiving of data
// taken from https://github.com/DusanKasan/parsemail/blob/master/parsemail.go
message Email {
Header headers = 1 [(gogoproto.nullable) = false];
string subject = 2;
Addresses addresses = 3 [(gogoproto.nullable) = false];
google.protobuf.Timestamp date = 4 [(gogoproto.stdtime) = true, (gogoproto.nullable) = false];
string messageID = 5;
repeated string inReplyTo = 6;
repeated string references = 7;
Resent resent = 8;
string htmlBody = 10;
string textBody = 11;
// a slice is nil by default
repeated Attachment attachments = 12 [(gogoproto.nullable) = false];
// a slice is nil by default
repeated EmbeddedFile embeddedFiles = 13 [(gogoproto.nullable) = false];
}
message Attachment {
string fileName = 1;
string contentType = 2;
// hash of the unixfs object for the file
string dataHash = 3;
}
message EmbeddedFile {
string contentId = 1;
string contentType = 2;
// hash of the unixfs object for the file
string dataHash = 3;
}
message Addresses {
Address sender = 1;
repeated Address from = 2 [(gogoproto.nullable) = false];
repeated Address replyTo = 3 [(gogoproto.nullable) = false];
repeated Address to = 4 [(gogoproto.nullable) = false];
repeated Address cc = 5 [(gogoproto.nullable) = false];
repeated Address bcc = 6 [(gogoproto.nullable) = false];
}
message Resent {
Addresses addresses = 1 [(gogoproto.nullable) = false];
google.protobuf.Timestamp resentDate = 2 [(gogoproto.stdtime) = true, (gogoproto.nullable) = false];
string resentMessageId = 3;
}
message Header {
map<string, Headers> values = 1 [(gogoproto.nullable) = false];
}
message Headers {
repeated string values = 1;
}
// Values is basically an embedded slice in an email header
message Values {
repeated string v = 1;
}
message Address {
string name = 1; // proper name, may be empty
string address = 2;
}
And to test this I’ve created some “sample sets”, namely a few emails I sent myself with the same pictures, as well as a boat load of randomly generated emails here if you are interested, but theres 5k of them.
So I went about storing this on IPFS and discovered some interesting things. The initial sample set which contains largely of deduplicated data (the same set of phrases, and pictures), maybe 75% of the data was duplicated. When storing this on IPFS, the space savings were absolutely mind boggling, 572%!!!
So to not get ahead of myself and proclaim victory I generated a large sample set of entirely random emails. When this was added to IPFS in the same manner, deduplication savings were 8%. Now 8% isn’t slim pickings, when you scale this up to petabytes of data on regular disks, and then consider petabytes of data with RAID, savings are looking pretty cool.
But I refuse to believe that 8% is as much as you can save when dealing with “random emails”, which leads me to believe that the bulk of data deduplication savings will come largely from how your data structure is optimized to work with IPFS.
For example simply chunking your data into extremely small sizes doesn’t always work. I did some experimentation with chunking emails into 100 byte chunks, and that lead to a massively larger “on-ipfs” storage size, than either using default chunks, which makes sense because if your chunk size is small enough, you will create more data linking together these objects, than the actual size of the data.
In terms of doing further investigation I am a bit lost as to various ways to R&D this kind of research, and looking for some discussion around possible plans of attack so I can publish some research on this. All research published will be open + freely available
TL;DR
- I’m looking to do research into how you can optimize data structures for IPFS to get optimal deduplication
- Data duplication savings isn’t straight forward
- Savings can be influenced by a variety of factors including:
- Chunk size
- Data structure
- How much of your data is duplicated, vs how much will be duplicated simply due to coincidences with content addresssing
- Not too sure about the best path forward to do this research