(draft) Common Bytes - standard for data deduplication

This is a standard proposal for deduplicing common bytes on different versions of same kind of a file.
It takes inspiration on git objects, but takes more approaches to ensure the right content parts are organized. This is not only for deduplicate data, but also for linking the same data which is represented in different kinds of files.

  • store each line of a text, and know the text-format of each part (no duplications when comparing text from a 2read saved page or its full HTML site mirror), and same for PDF/ePub/MOBI
  • use references, for example, in which lines of a single-page HTML is the same content of a JS/CSS file; also works for SVG and base64 images
  • windows/other screenshots, keeping same bytes in objects, for example, parts of taskbar and window frame
  • different qualities from same video; version inside video files and know when frames are similar then diff it
  • all kinds of compressed files (and partition/disk images), also the supported by 7Zip and Linux archiver
  • deb, aur, rpm and other linux/bsd packages
  • midi and 8bit sounds
  • other wave-based audio/music
  • .exe, .msi, .appimage and other executables
  • git packs: consider their content, same as git objects; http://web.archive.org/web/20191205203745/https://github.com/radicle-dev/radicle/issues/689
    new for dedup/plugz/download.json:
    instead of downloading lots of dupliced appimages, DEB and RPM, get their internal files and deduplicate them. Make these files internally symlink the common files. Should also support the browser downloads, with a API to get internal files hash and verify if local device already haves them.

GitHub issue: https://github.com/ipfs/ipfs/issues/444

More updated version is at GitHub: https://github.com/ipfs/ipfs/issues/444

@stebalien, you said on GitHub (https://github.com/ipfs/go-ipfs/issues/6815#issuecomment-573562865) IPFS already haves deduplication like git.
Then, Pinata isn’t updated? Uploading pages from pagesaver made it quickly reach limit without deduplicing common bytes as the pages are from same site but in unique .html files.

IPFS does block-based deduplication (where blocks are 256KiB by default). It will deduplicate whole files or common chunks of large files but it won’t deduplicate small parts of files.

GIT does something in addition to deduplication: it stores delta objects in a pack. IPFS doesn’t do that. Unfortunately, that technique only really works on versioned files where (a) the user ends up downloading the entire history and (b) there’s a clear history/relationship between these files.

@stebalien and about the Pinata case?

What do you think about Common Bytes, deduplicing parts from known filetypes which also includes compressed/binaries?

For example, getting content from and tags which will be same from their .js and .css counterparts.

This has probably already been answered elsewhere - noob here.

I’m pretty new to ipfs. What is the relationship between de-duplication and replication? With something like bittorrent, the more people who have a file, potentially just a part of a file, the more seeders and the better things perform as a new person who wants that file. I know a lot of effort has gone into de-duplication, but I wonder if there is an over-emphasis in that regard. If I’m paying for folks to keep copies of my data, for example, it is in my interest to compress the data before storing in ipfs (less space --> less money). If on the other hand, people are interested in my content and have full (or partial) copies of the data, then I would think the network performance would improve for new users.

We have considered file-type specific chunkers/deduplication logic. We’ve also considered “transformations” (including compression/decompression) but that quickly gets out of hand and would likely require shipping around content-addressed decoding algorithms.

Deduplicating at the tag/selector level would result in tiny, inefficient blocks so I don’t really see any benefit there.

Then, reconsider it.

IPLD can handle that.

Well, it’s best to have the data decompressed added to IPFS, this way others which got the same file already will be able to send the data to other peers as well.

So like if you got the ISO for a Linux distribution, you can just add it and it will get the same content ID on your computer than on someone’s else computer. So both can offer the same file to the network.

If you want to conserve disk space, you can use a filesystem which supports compression, like ZFS or BTRFS and store the IPFS data there.

Deduplication on the other hand makes sense if you got many similar files, like many different versions of a text document. So when you’re for example save a lot of different versions of the same Wikipedia article, most of the data stays the same between the versions and can be referenced instead of saved twice.

I’ve been thinking of posting a summary of everything related to this. I’m typing on my phone with clumsy big old fingers so please excuse typos. Correction; thank the lord for discuss.ipfs.io’s magic poster-aware auto-draft saving, I was able to resume editing on my laptop.

IPFS currently supports chunk/block based de-duplication with several different chunkers. This has the following features/limitations (numbered for easy reference in followup discussions);

  1. The same file added with different chunkers or chunker parameters will not de-duplicate at all. This could be solved to some degree if merkle-dag entries were indexed by the full hash of the content they pointed to instead of the hash of the merkle-dag node. However, this would be a very invasive change likely to introduce more problems than it solves.

  2. Chunk de-duplication can only de-duplicate identical data as big as the chunks. This means smaller chunks can do more de-duplication by finding smaller duplications, but at some point the per-block overheads will override the de-duplication benefits. It would be good to know what the minimum practical block size is, but it’s definitely larger than a single line of text. Note that this means something as simple as converting a text file from \r\n DOS lines to Unix \n lines will break block de-duping.

  3. The default chunker is fixed 256K blocks. This will only de-duplicate differences within files that exactly align with the 256k block boundaries. This pretty much limits it to things like large database files with fixed record sizes or files that are only appended to.

  4. The rabin chunker should be able to produce more de-duplicate-able blocks at arbitrary offsets provided the same parameters are used. The default rabin chunker parameters are min/avg/max 128k/256k/378k, which is very large and thus won’t find any duplication smaller than that. I would guess 16k/32k/64k would be better, depending on per-block overheads.

  5. Content aware chunkers could do even better at deciding on chunk boundaries than rabin, but at a high complexity cost (content-specific chunkers for each different file type will need to be written/maintained) for probably marginal benefit over rabin with decent parameters. Note that they will still not be able to address 2 and could make 1 even worse (yet more different ways of chunking). It also can’t solve problems like 6 below where the content doesn’t actually have any duplicated chunks.

  6. Compressed files will typically have no duplication for even tiny changes in the compressed content. This can be solved to some extent using things like gzip --rsyncable which minimizes changes in the compressed output for a small compression cost, but that needs to be done when the file is compressed before it is added to IPFS.

  7. Identical content compressed with different compressors or compression arguments will have no de-duplication. Sometimes different compressor versions or even just compression dates can produce different output due to format changes or included timestamps.

  8. The decompressed content of originally identical files can be different for lossy compression like mpeg, jpeg, mp3 etc. This can be addressed for things like music storage by storing only the highest quality versions, fuzzy matching content to de-duplicate identical songs on upload, and transcoding down to the desired quality/format on demand for downloads. Does this sound easy to you?

  9. Content-aware transcoding could be used to automatically convert formats with poor block-de-duplication into different formats with good block-de-duplication, eg files could be uncompressed, tarballs expanded into directories, etc. A top level node could include details necessary to convert the content back to the originally uploaded format, so the same content uploaded in different formats would fully de-duplicate all the content, with only different top-level “format” nodes. Alternatively canonical formats could be always be used for contents, and an API provided to request the content transcoded into different formats… eg a directory could be fetched as a tar-file, or tar.gz file. Recursion could handle arbitrarily nested formats. Again, the complexity cost of this is high, with lots of format-specific transcoding support, but the return might be high for some file formats.

  10. Compression is fine-grained efficient smaller-than-block-size de-duplication to get past 2. Note using 9 to uncompress content to get better block-de-duplication removes the compression-de-duplication, which might be worse overall. Normally compression only finds duplication within a file, but using a shared compression dictionary is how you can de-duplicate across files. Some advanced compressors have built in dictionaries for different content-types, but in the most general case things like zstd have rich support for building and re-using custom compression dictionaries. This could be used to do something like add compressed block support, where the merkle-dag could indicate that the blocks have been compressed using a dictionary stored in another file/blocks. It’s better if dictionaries are widely re-used for many files, and exactly how best to build/share/use dictionaries for different content is a broad and complex topic. You could end up making things worse, forcing people to download large dictionaries to uncompress small files.

  11. Compression of blocks can be already implemented on nodes by using a compressing filesystem under IPFS. However, this can only do generic compression and cannot take advantage of content-type information for better compression. It also doesn’t help network transfer of blocks, which is probably more important. Network transfer can benefit from on-the-fly network compression, but it’s pretty ugly when combined with filesystem compression; you uncompress it to read it, only to compress it again to send it, and you do this every time you send it. Implementing block compression within IPFS may be nice to get decent storage, network, and cpu wins. If I was to add this I’d make it another option when adding the file, similar to how you can select the chunker, so that users could make content-type based decisions on the best compression options/dictionaries/etc.

  12. Different file formats and compression are effectively content encoding, and sometimes the particular encoding used matters. Sometimes you really don’t want the file messed with by magic transcoding etc. For example, the entire point of an *.mkv file of a movie might be the finely tuned encoding parameters to give the best compromise between file size and quality. A *.tar.gz file might be an important software snapshot that has been check-summed and signed; see https://github.com/librsync/librsync/issues/146 for an example where github’s re-generating of release archives broke this for many people.

Note that almost all of the ideas above to give better de-duplication don’t have to be built into IPFS at all. They could be implemented by adding a content-type aware layer on top of IPFS to do things like transcode to/from cannonical formats with good block-de-duplication, and select the optimal chunking settings to use. Given the complexity of many of these, keeping them OUT of IPFS and in their own optional layer where people can experiment with and tune them without breaking IPFS itself is a good idea.

IMHO the most important (and easy) things to focus on within IPFS would be better generic chunkers and default parameters (see 1->4 above) and maybe (harder) block compression (see 11).

4 Likes

Note another way to address 1->4 would be to completely remove the option of specifying a chunker when adding a file at all, and instead always use a rabin (or buzzhash?) chunker with parameters optimized for the sweetspot between de-duplication and per-block overheads. This might be better than letting users break de-duplication by playing with the chunker settings.

You could take a similar approach with adding compression for 11 and just hide a decent fast generic compressor in the implementation.

Of course this removes the ability for a content-type aware layer to optimize the chunking and/or compression, though it could still add it’s own higher-granularity chunking and better compression above them, and it doesn’t prevent content-type aware transcoding, which is probably where the most wins are for that layer anyway.

1 Like

Note that after reading the FastCDC paper, I want to update recommendations for rabin in 4.

The rabin chunker (and most other chunkers) have an average chunk size of min+avg, not just avg, and the distribution of chunk sizes is an exponential distribution with most chunks smaller than that average. The chunk boundary resynchronization after changes, and thus duplicate detection, works better with a smaller min and larger max sizes. For this reason 8K/32K/128K is probably what I’d recommend for better deduping, but it really depends on what the sweet-spot size is for IPFS performance. So 16K/64K/256K might be better for IPFS performance.

I did some additional testing and analysis of chunker algorithms that had surprising results that have completely changed my recommendation on settings for the existing Rabin and buzzhash chunkers;

Again, it depends on the sweet spot for IPFS chunk sizes, but min/tgt/max of 16K/16K/128K for an average chunk size of 32K should give optimal deduplication and faster chunking. For even faster chunking with some deduplication cost 32K/16K/128K with an average chunk size of 48K will work well. Yes, that is min size 2x as large as the “average size” setting! For larger chunks that might be better for IPFS use 32K/32K/256K (average 64K) or 64K/32K/256K (average 96K) for more speed.

FTR, I’ve updated my rollsum-chunking results to include “Regression Chunking” and the effects of a small max_len setting with lots of trunctions.

The conclusion is “RC4” is the best algorithm to use for deduplication if you require max_len <= 2.0x your average length, but it has a complexity/performance cost, and a simple exponential chunker with max_len=4.0x your average length will deduplicate just as well and be faster.

I don’t know if IPFS has a strong preference for tightly constrained chunk sizes, but I suspect not.

Now that I’m adventuring at my own VCS project, I’m more interested than never before about file/shared bytes deduplication.

I have a lot to read here.