Does IPFS provide block-level file copying feature?

There is a .tar.gz file, which contains a data.txt file, file.tar.gz (~1 GB) stored in my IPFS-repo, which is pulled from another node-a.

I open the data.txt file and added a single character in a random locations in the file(beginning of the file, middle of the file, and end of the file), and compress it again as file.tar.gz and store it in my IPFS-repo.

When node-a wants to re-get the updated tar.gz file a re-sync will take place.

I want to know using IPFS whether the entire 1 GB of file will be synced or there exists a way by which parts of the file that is changed (called the delta) get synced.


Similiar question is asked for the Google Drive: Does Google Drive sync entire file again on making small changes?

The feature you are asking is called “Block-Level File Copying”. With
this feature, when you make a change to a file, rather than copying
the entire file from your hard drive to the cloud server again, only
the parts of the file that changed (called the delta) get sent.

As far as I know, Drobox, pCloud, and OneDrive, which however only supports it for Microsoft Office documents, offers block level sync.

No, IPFS doesn’t provide this feature.

(Exception, for uncompressed text file where the modification is at the end, only the last block will be modified.)

It has to be solved at the chunker level in IPLD, probably by chunking differently depending on the format (for example chunk each file separately for tarballs, each paragraph for long articles, images and text separately for PDFs, etc.)

This is tricky because there are many cases. You might want to play with parameters to optimize your use cases, but you might lose on deduplication is you don’t use the standard.

IPFS team is considering adding that eventually, but it’s not a priority.

1 Like

@Akita: Some developler says yes some says no so I am really confused at this moment.


But as I understand from your comment if I change a character at any place except end of the 1 GB file, that file is already stored in the ipfs-repo; complete 1 GB should be re-downloaded all over again.

If you have only 1 file, what the chunker does is split it into 124kb-blocks, hashes each block, put these hashes into a list, hash that list, and this becomes the hash of the file.
When you request a file, you first request the hash list thanks to the “hash of the file” (the hash of the list, really), look what blocks you are missing, and request these missing blocks.

In a text file, if you change a character, it will only change this block as the chunker will still split the file at the same place. The client will request the new hash for the list, get sent the hash list and say “Oh, I already have most of them! I will only request the one I lack”. You are lucky and only request 124kb \O/.

But. If :

  • you add or remove a character somewhere in the text file, the chunker will split the file at the same positions… up to that point, so this block all subsequent block will have different hash. It’s better than downloading everything, but worst than downloading only one block
  • you compress the file, any change will likely change bits in every blocks of the compressed file. so you will have the retrieve the whole file. :confused: because the “chunckage” happens on the compressed file (for now). To date, IPFS sees any file as a string of bytes, it doesn’t understand that it would be smarter to decompress the file and chunk wisely. if you compress and do any modification, it’s game over: you will have to dowload everything.

So my bad, the above sentence should be:
(Exception, for uncompressed text file where the modification is at the end, only the last block will be modified; or uncompressed text file where you just change characters (not deleting, not adding). But these are very specific cases.)

1 Like

@Akita: As I understand if I delete a single character at the beginning of a file, since the hash of all 124kb-blocks will be altered, updated file should be downloaded fully.

But if I change a single character, only its related 124kb-block should be downloaded.

Theorically, on an uncompressed, unencrypted .txt or .csv file, yes.

But I wouldn’t count on it since many things can defeat that: some ASCII character can be encoded over several bytes, for example. Maybe the chunker is not exactly the same from one version of IPFS to another (should have the same output, but who knows), etc.

But keep in mind that changing a single character is an edge case and I wouldn’t build anything on top of that. The takeaway is that IPFS does not provide block-level copy yet.

And you mentioned compression: this will defeat the “hack” we’re discussing.

1 Like

Thanks for the clear explanation @Akita:

In order to clarify, since IPFS does not provide block-level copy yet, I should count on that when a file is updated (a character added), I will assume that IPFS will re-download the chunks since the first update location, because it will affect the hash of the consequent chunks.

If I submit as a folder and only one file is updated and other files are untouched; IPFS will only get/pull the updated chunks only for the updated file and there will be no sync operation for the untouched files. But if I compress that folder, again updated file will change the hash of the consequent chunks depending on that file’s location inside the tar till the end of the tar file, which will lead to download some untouched files’ data/chunks as well.


I was considering compression because I observe that using IPFS compressed files are pulled much faster than folders,

If there is hundreds of files under a directory, for example under .git, it will take longer time to pull the complete folder.

Afaik, this will eventually be solved by IPLD graphsync.

Currently when retrieving directories, IPFS will (at a minimum) first ask the remote peer for the directory description block (1-RTT), then ask for the description blocks of each file (2-RTT), then ask for the content blocks of each file (3-RTT). This will of course send a copy of the just-received hash list (as want-list) back to the peer we just received it from and greatly increases latency for no good reason.

(The above is slightly simplified for the case of large file/directory description blocks that involve sharding and of course does not include nested directories.)

The idea of graphsync is of course to instead ask the remote peer send the contents of this directory description block and everything below that. That should then give you speeds comparable to downloading a single large archive file.

Right now you the only thing you could do to work around this behaviour in good faith would be upload the data as an uncompressed (or per-file compressed) archive and implement a custom chunker that will split the file at archive entries.

However…


I also just had a really easy idea that should produce acceptable results at the cost of some extra size: Archive all data as plain .tar (no compression!) and upload it with

ipfs add --cid-version=1 --raw-leaves --chunker=size-512 [FILENAME]

The trick here is that entries in .tar archives are always aligned at 512byte boundaries (classic I/O operation size) – for entries that are smaller their remaining space will be filled up with NUL bytes. This way all archive blocks for unchanged files will remain the same and only the modified files will have to be synced.

If have not tested this and did not do estimation of the size overhead incurred, but it might just be what you need. Let me know in any case of anything that came out of this. :slightly_smiling_face:

Use IPFS rabin chunking and gzip --rsyncable.

This should result in most blocks being unchanged for small changes in the files inside the tar.gz. You may need to do some experimenting and tuning to get the best rabin chunking settings to work well with gzip --rsyncable.

It’s arguable if this is better than using rabin chunking with a raw .tar file (uncompressed), or even just the raw files. You sacrifice some compression at the start, but all the deltas after that will be much smaller.

2 Likes

@dbaarda

I have not tested gzip --rsyncable when creating your tar.gz and use ipfs add -s rabin to get any deduping yet. How small changes are we talking about, changes could be 1 MB, 100 MB or even 1 GB, so it dynamically changes. Would this approach will work for changes lets say > 1GB?

First thing to check is that your gzip has a working --rsyncable option. Some details here;

https://superuser.com/questions/653057/caveats-with-gzip-rsyncable-no-speedup

Next thing to check is that the tar files actually have any common data in the first place. You could do this by using rdiff or xdelta to calculate a patch between the uncompressed tar files… The size difference between the patch and the new file is how much duplicate data there is. Using xdelta will tell you the maximum amount of duplication that could be found (it finds pretty much optimal deltas). Playing with the block size for rdiff can figure out how much the duplicate detection depends on the block size; smaller blocks will find more/smaller identical sections, larger blocks will be cheaper to calculate. In particular, the largest block size that still finds some duplication (patch file smaller than new file) will tell you the largest of contiguous block of duplicate data. If there is no identical data in the raw tar files, then gzip --rsyncable will not help.

After that repeat the experiment on the compressed tar files using gzip --rsyncable. I checked the gzip-rsyncable patch and it appears to reset the compressor every input “chunksize” of min 4KB, avg 8KB, max unlimited?. Assuming a compression ratio of 50% this means the output should have identical sections at least 2KB, average 4KB long. This suggests that a librsync block size of 2KB (which is the default) should be the best compromise between finding duplicate data and computation cost. Larger librsync block sizes will not find as many small duplicates, but will be cheaper to calculate.

Finally, assuming above experiments showed the tar.gz files have any duplication at all, you need to tune your rabin chunk size to find it. The largest librsync block size that still finds duplicate data gives you the largest rabin chunk size that could ever find any duplication at all, and you want a chunk size less than half that size to have any chance of finding it. The smaller you set it, the more duplication it can find, but the more blocks you end up with. At some point the overheads of having more small blocks will override the benefits of finding duplicates of those small blocks.

The ipfs default chunker is a static 256KB block size, and this is also the average rabin chunk size if not explicitly specified. The default is usually indicative of a sweet-spot in terms of block size vs number of blocks, though for static blocks this sweet spot is usually the largest practical block size, so for rabin the sweet spot is almost certainly smaller, probably at least half that. If I had to guess I’d say find the block size in librsync that gives you diminishing returns in terms of finding duplicates and use an average rabin chunk size half that to find the most duplication. If that size is significantly less than say 16KB you may find the number of blocks you end up with makes it not worth it. If the largest block size librsync can still find duplicates with is <16KB then rabin chunking is probably not worth it.

TLDR, check if the tar.gz files have any duplication worth finding. Then use --rabin-min-avg-max where avg is half the best librsync block size, min is 1/4 avg, and max is 4*avg, or just try --rabin-4096-16384-65536 and see if it helps.

Actually, thinking about this a little more, the best rabin chunk avg size should be the same as the best librsync block size, not half the librsync block size. Using the same size should find similar amounts of duplicate data.

Both librsync and rabin chunking can only guarantee finding duplicate blocks if the block/chunk size is less than half the size of the contiguous duplicate data (due to alignment of the block/chunk within that duplicate data).