Does IPFS provide block-level file copying feature?

There is a .tar.gz file, which contains a data.txt file, file.tar.gz (~1 GB) stored in my IPFS-repo, which is pulled from another node-a.

I open the data.txt file and added a single character in a random locations in the file(beginning of the file, middle of the file, and end of the file), and compress it again as file.tar.gz and store it in my IPFS-repo.

When node-a wants to re-get the updated tar.gz file a re-sync will take place.

I want to know using IPFS whether the entire 1 GB of file will be synced or there exists a way by which parts of the file that is changed (called the delta) get synced.

Similiar question is asked for the Google Drive: Does Google Drive sync entire file again on making small changes?

The feature you are asking is called “Block-Level File Copying”. With
this feature, when you make a change to a file, rather than copying
the entire file from your hard drive to the cloud server again, only
the parts of the file that changed (called the delta) get sent.

As far as I know, Drobox, pCloud, and OneDrive, which however only supports it for Microsoft Office documents, offers block level sync.

No, IPFS doesn’t provide this feature.

(Exception, for uncompressed text file where the modification is at the end, only the last block will be modified.)

It has to be solved at the chunker level in IPLD, probably by chunking differently depending on the format (for example chunk each file separately for tarballs, each paragraph for long articles, images and text separately for PDFs, etc.)

This is tricky because there are many cases. You might want to play with parameters to optimize your use cases, but you might lose on deduplication is you don’t use the standard.

IPFS team is considering adding that eventually, but it’s not a priority.

1 Like

@Akita: Some developler says yes some says no so I am really confused at this moment.

But as I understand from your comment if I change a character at any place except end of the 1 GB file, that file is already stored in the ipfs-repo; complete 1 GB should be re-downloaded all over again.

If you have only 1 file, what the chunker does is split it into 124kb-blocks, hashes each block, put these hashes into a list, hash that list, and this becomes the hash of the file.
When you request a file, you first request the hash list thanks to the “hash of the file” (the hash of the list, really), look what blocks you are missing, and request these missing blocks.

In a text file, if you change a character, it will only change this block as the chunker will still split the file at the same place. The client will request the new hash for the list, get sent the hash list and say “Oh, I already have most of them! I will only request the one I lack”. You are lucky and only request 124kb \O/.

But. If :

  • you add or remove a character somewhere in the text file, the chunker will split the file at the same positions… up to that point, so this block all subsequent block will have different hash. It’s better than downloading everything, but worst than downloading only one block
  • you compress the file, any change will likely change bits in every blocks of the compressed file. so you will have the retrieve the whole file. :confused: because the “chunckage” happens on the compressed file (for now). To date, IPFS sees any file as a string of bytes, it doesn’t understand that it would be smarter to decompress the file and chunk wisely. if you compress and do any modification, it’s game over: you will have to dowload everything.

So my bad, the above sentence should be:
(Exception, for uncompressed text file where the modification is at the end, only the last block will be modified; or uncompressed text file where you just change characters (not deleting, not adding). But these are very specific cases.)

1 Like

@Akita: As I understand if I delete a single character at the beginning of a file, since the hash of all 124kb-blocks will be altered, updated file should be downloaded fully.

But if I change a single character, only its related 124kb-block should be downloaded.

Theorically, on an uncompressed, unencrypted .txt or .csv file, yes.

But I wouldn’t count on it since many things can defeat that: some ASCII character can be encoded over several bytes, for example. Maybe the chunker is not exactly the same from one version of IPFS to another (should have the same output, but who knows), etc.

But keep in mind that changing a single character is an edge case and I wouldn’t build anything on top of that. The takeaway is that IPFS does not provide block-level copy yet.

And you mentioned compression: this will defeat the “hack” we’re discussing.

1 Like

Thanks for the clear explanation @Akita:

In order to clarify, since IPFS does not provide block-level copy yet, I should count on that when a file is updated (a character added), I will assume that IPFS will re-download the chunks since the first update location, because it will affect the hash of the consequent chunks.

If I submit as a folder and only one file is updated and other files are untouched; IPFS will only get/pull the updated chunks only for the updated file and there will be no sync operation for the untouched files. But if I compress that folder, again updated file will change the hash of the consequent chunks depending on that file’s location inside the tar till the end of the tar file, which will lead to download some untouched files’ data/chunks as well.

I was considering compression because I observe that using IPFS compressed files are pulled much faster than folders,

If there is hundreds of files under a directory, for example under .git, it will take longer time to pull the complete folder.

Afaik, this will eventually be solved by IPLD graphsync.

Currently when retrieving directories, IPFS will (at a minimum) first ask the remote peer for the directory description block (1-RTT), then ask for the description blocks of each file (2-RTT), then ask for the content blocks of each file (3-RTT). This will of course send a copy of the just-received hash list (as want-list) back to the peer we just received it from and greatly increases latency for no good reason.

(The above is slightly simplified for the case of large file/directory description blocks that involve sharding and of course does not include nested directories.)

The idea of graphsync is of course to instead ask the remote peer send the contents of this directory description block and everything below that. That should then give you speeds comparable to downloading a single large archive file.

Right now you the only thing you could do to work around this behaviour in good faith would be upload the data as an uncompressed (or per-file compressed) archive and implement a custom chunker that will split the file at archive entries.


I also just had a really easy idea that should produce acceptable results at the cost of some extra size: Archive all data as plain .tar (no compression!) and upload it with

ipfs add --cid-version=1 --raw-leaves --chunker=size-512 [FILENAME]

The trick here is that entries in .tar archives are always aligned at 512byte boundaries (classic I/O operation size) – for entries that are smaller their remaining space will be filled up with NUL bytes. This way all archive blocks for unchanged files will remain the same and only the modified files will have to be synced.

If have not tested this and did not do estimation of the size overhead incurred, but it might just be what you need. Let me know in any case of anything that came out of this. :slightly_smiling_face:

Use IPFS rabin chunking and gzip --rsyncable.

This should result in most blocks being unchanged for small changes in the files inside the tar.gz. You may need to do some experimenting and tuning to get the best rabin chunking settings to work well with gzip --rsyncable.

It’s arguable if this is better than using rabin chunking with a raw .tar file (uncompressed), or even just the raw files. You sacrifice some compression at the start, but all the deltas after that will be much smaller.