This is a standard proposal for deduplicing common bytes on different versions of same kind of a file.
It takes inspiration on git objects, but takes more approaches to ensure the right content parts are organized. This is not only for deduplicate data, but also for linking the same data which is represented in different kinds of files.
- store each line of a text, and know the text-format of each part (no duplications when comparing text from a 2read saved page or its full HTML site mirror), and same for PDF/ePub/MOBI
- use references, for example, in which lines of a single-page HTML is the same content of a JS/CSS file; also works for SVG and base64 images
- windows/other screenshots, keeping same bytes in objects, for example, parts of taskbar and window frame
- different qualities from same video; version inside video files and know when frames are similar then diff it
- all kinds of compressed files (and partition/disk images), also the supported by 7Zip and Linux archiver
- deb, aur, rpm and other linux/bsd packages
- midi and 8bit sounds
- other wave-based audio/music
- .exe, .msi, .appimage and other executables
- git packs: consider their content, same as git objects; http://web.archive.org/web/20191205203745/https://github.com/radicle-dev/radicle/issues/689
new for dedup/plugz/download.json:
instead of downloading lots of dupliced appimages, DEB and RPM, get their internal files and deduplicate them. Make these files internally symlink the common files. Should also support the browser downloads, with a API to get internal files hash and verify if local device already haves them.
GitHub issue: https://github.com/ipfs/ipfs/issues/444