(draft) Common Bytes - standard for data deduplication

This is a standard proposal for deduplicing common bytes on different versions of same kind of a file.
It takes inspiration on git objects, but takes more approaches to ensure the right content parts are organized. This is not only for deduplicate data, but also for linking the same data which is represented in different kinds of files.

  • store each line of a text, and know the text-format of each part (no duplications when comparing text from a 2read saved page or its full HTML site mirror), and same for PDF/ePub/MOBI
  • use references, for example, in which lines of a single-page HTML is the same content of a JS/CSS file; also works for SVG and base64 images
  • windows/other screenshots, keeping same bytes in objects, for example, parts of taskbar and window frame
  • different qualities from same video; version inside video files and know when frames are similar then diff it
  • all kinds of compressed files (and partition/disk images), also the supported by 7Zip and Linux archiver
  • deb, aur, rpm and other linux/bsd packages
  • midi and 8bit sounds
  • other wave-based audio/music
  • .exe, .msi, .appimage and other executables
  • git packs: consider their content, same as git objects; http://web.archive.org/web/20191205203745/https://github.com/radicle-dev/radicle/issues/689
    new for dedup/plugz/download.json:
    instead of downloading lots of dupliced appimages, DEB and RPM, get their internal files and deduplicate them. Make these files internally symlink the common files. Should also support the browser downloads, with a API to get internal files hash and verify if local device already haves them.

GitHub issue: https://github.com/ipfs/ipfs/issues/444

More updated version is at GitHub: https://github.com/ipfs/ipfs/issues/444

@stebalien, you said on GitHub (https://github.com/ipfs/go-ipfs/issues/6815#issuecomment-573562865) IPFS already haves deduplication like git.
Then, Pinata isn’t updated? Uploading pages from pagesaver made it quickly reach limit without deduplicing common bytes as the pages are from same site but in unique .html files.

IPFS does block-based deduplication (where blocks are 256KiB by default). It will deduplicate whole files or common chunks of large files but it won’t deduplicate small parts of files.

GIT does something in addition to deduplication: it stores delta objects in a pack. IPFS doesn’t do that. Unfortunately, that technique only really works on versioned files where (a) the user ends up downloading the entire history and (b) there’s a clear history/relationship between these files.

@stebalien and about the Pinata case?

What do you think about Common Bytes, deduplicing parts from known filetypes which also includes compressed/binaries?

For example, getting content from and tags which will be same from their .js and .css counterparts.

This has probably already been answered elsewhere - noob here.

I’m pretty new to ipfs. What is the relationship between de-duplication and replication? With something like bittorrent, the more people who have a file, potentially just a part of a file, the more seeders and the better things perform as a new person who wants that file. I know a lot of effort has gone into de-duplication, but I wonder if there is an over-emphasis in that regard. If I’m paying for folks to keep copies of my data, for example, it is in my interest to compress the data before storing in ipfs (less space --> less money). If on the other hand, people are interested in my content and have full (or partial) copies of the data, then I would think the network performance would improve for new users.

We have considered file-type specific chunkers/deduplication logic. We’ve also considered “transformations” (including compression/decompression) but that quickly gets out of hand and would likely require shipping around content-addressed decoding algorithms.

Deduplicating at the tag/selector level would result in tiny, inefficient blocks so I don’t really see any benefit there.