Efficiency of IPFS for sharing updated file

For example userA added a file sized 1 GB. ipfs add file.txt and userB get that file into his storage through ipfs. Later userA released a mistake and changed only a single character on the file and wants to share this updated version with userB. At this point what is the best solution via IPFS to share the updated file without re-sharing the whole version of the file?

So userA again added same file into IPFS via ipfs add file, and useB have to fetch that 1 GB of file instead of updating that single character. Is there any better approach to solve this issue, where only the updated version should be pulled by userB like how git works when we do git pull?

Git have much better approach please see (https://stackoverflow.com/a/8198276/2402577). Does IPFS uses delta compression for storage (https://gist.github.com/matthewmccullough/2695758) like Git? or similar approach?

They will not fetch the full 1GB, only the block that contains the changed character

I think you are wrong!

Further investigation:

I did a small experiment.
First I have added 1GB file into IPFS. Later, I have updated a small line on the file, that is already shared via IPFS. I observe that userA pushes complete 1GB file all over again, instead only pushing the block that contains the changed data. That is very expensive and time consuming in my opinion. I have shared the hash of the new updated file and again complete file is downloaded via IPFS on userB instead of downloaded only the block that contains the changed character.

  • Step 1:

userA

$ fallocate -l 1G gentoo_root.img
$ ipfs add gentoo_root.img
 920.75 MB / 1024.00 MB [========================================>----]  89. 92added QmdiETTY5fiwTkJeERbWAbPKtzcyjzMEJTJJosrqo2qKNm gentoo_root.img

userB

$ ipfs get QmdiETTY5fiwTkJeERbWAbPKtzcyjzMEJTJJosrqo2qKNm
Saving file(s) to QmdiETTY5fiwTkJeERbWAbPKtzcyjzMEJTJJosrqo2qKNm
 1.00 GB / 1.00 GB [==================================] 100.00% 49s

  • Step 2:

userA

$ echo 'hello' >> gentoo_root.img
$  ipfs add gentoo_root.img   # HERE node pushing 1 GB file into IPFS again. It took 1 hour for me to push it, instead only updated the changed block.
32.75 MB / 1.00 GB [=>---------------------------------------]   3.20% 1h3m34s
added Qmew8yVjNzs2r54Ti6R64W9psxYFd16X3yNY28gZS4YeM3 gentoo_root.img

userB

# HERE complete 1 GB file is downloaded all over again.
ipfs get Qmew8yVjNzs2r54Ti6R64W9psxYFd16X3yNY28gZS4YeM3
[sudo] password for alper:
Saving file(s) to Qmew8yVjNzs2r54Ti6R64W9psxYFd16X3yNY28gZS4YeM3
 1.00 GB / 1.00 GB [=========================] 100.00% 45s

[Q] At this point what is the best solution via IPFS to share the updated file without re-sharing the whole version of the updated file and for IPFS to share only the updated blocks of the file?


In addition to that; on the same node whenever I do ipfs cat <hash> it keep downloads same hash all over again.

$ ipfs cat Qmew8yVjNzs2r54Ti6R64W9psxYFd16X3yNY28gZS4YeM3
 212.46 MB / 1.00 GB [===========>---------------------------------------------]  20.75% 1m48s

$ ipfs cat Qmew8yVjNzs2r54Ti6R64W9psxYFd16X3yNY28gZS4YeM3
 212.46 MB / 1.00 GB [===========>---------------------------------------------]  20.75% 1m48s

ipfs add of course has to chunk-and-hash the whole file again, since it cannot know only one chunk is changed until after it has done that.

You never pin to userB, this is probably not an issue (since you probably did not run GC during your test) but in theory could be.

Running ipfs get / cat always has to extract the data from your local datastore, at least, which may take some time. That doesn’t mean it’s “downloading” again. To only download but not extract, I suggest to pin first and then second hash as your test

Sorry I got lost where should I do pinning. Could you give an example?

Should userB pin the object after he get the first hash?

Something like:

ipfs get QmdiETTY5fiwTkJeERbWAbPKtzcyjzMEJTJJosrqo2qKNm sudo ipfs pin add QmdiETTY5fiwTkJeERbWAbPKtzcyjzMEJTJJosrqo2qKNm
pinned QmdiETTY5fiwTkJeERbWAbPKtzcyjzMEJTJJosrqo2qKNm recursively

After I pinned the first hash than try to download its second version, again it downloaded the complete file.

$ ipfs get Qmew8yVjNzs2r54Ti6R64W9psxYFd16X3yNY28gZS4YeM3
Saving file(s) to Qmew8yVjNzs2r54Ti6R64W9psxYFd16X3yNY28gZS4YeM3
 227.00 MB / 1.00 GB [=============>---------------------------------------------]  22.16% 21s

Just because ipfs get writes out a 1GB file, that doesn’t mean it needed to redownload all of it again. Some of it could be cached. ipfs get or ipfs cat necessarily means it’s going to output the entire contents of the file to either the local filesystem or stdout, respectively.

Looking at the two hashes you listed (Qmew8yVjNzs2r54Ti6R64W9psxYFd16X3yNY28gZS4YeM3 and QmdiETTY5fiwTkJeERbWAbPKtzcyjzMEJTJJosrqo2qKNm), most of the file is the same and wouldn’t need to be retransmitted between nodes (instead it would be read from cache). Because so much of the file is duplicated, it should also take much less than 1GB to store in the IPFS repo due to the built-in deduplication.

Main goal of using ipfs pin is the keep the files in the cache right? So the reason I need to ipfs pin is to prevent ipfs garbage collector to remove that file?

But on userB whenever I do ipfs get <hash> . It keep does:

Saving file(s) to Qmew8yVjNzs2r54Ti6R64W9psxYFd16X3yNY28gZS4YeM3
 302.49 MB / 1.00 GB [=================>-----------------------------------------]  29.53% 33s

As I understand it just uncompresses the file from local ipfs repository rather than downloading from the network since the hash’s block are stored on the cache?

=> To clarify, the duplicated file takes much more less memory since its some section’s already stored under ipfs and no-need to download again, right? @leerspace

Yes, but garbage collection doesn’t run automatically by default. And I don’t see where you’re running it in your test.

Unless the datastore implementation uses compression, it’s not even decompressing the files. It’s reconstructing the entire file from the blocks in the datastore and retrieving any files that aren’t locally cached from other nodes.

Within IPFS’ repository it should only store the parts of the file that changed. However, ipfs get doesn’t only write out the parts of the file that changed; it writes out the whole file even if it didn’t necessarily need to retrieve it.

If you want to test how efficient IPFS is at sharing updated files, I would suggest instead looking at the ipfs repo stat output after ipfs pin add-ing incrementally different files and observing how the repository size changes. You could also look at the network traffic, but there’s some background traffic not directly related to transferring the file that will likely inflate those numbers.

I was personally under the impression that you have to redownload the whole file, which it seems like your tests show to be the case. This is something that the Bluzelle project is trying to get around (database file storage so you don’t need to upload/download entire datasets written in one file just to make a change).

If there is a way to update only a part of a file that would be very interesting.

@dennisonb I was on the same impression, but based on the other’s comments I believe only the changed block is downloaded (I have not tested it).

@leerspace I tried your suggestion and observe following, both have the same increase on the repo size:

First I create 100 MB file ( file.txt)

NumObjects: 5303
RepoSize:   181351841
StorageMax: 10000000000
RepoPath:   /home/alper/.ipfs
Version:    fs-repo@6

   $ ipfs add file.txt
   added QmZ33LSByGsKQS8YRW4yKjXLUam2cPP2V2g4PVPVwymY16 file.txt
   $ ipfs pin add QmZ33LSByGsKQS8YRW4yKjXLUam2cPP2V2g4PVPVwymY16

Here number of objects increased 4. Changed repo size (37983)

NumObjects: 5307
RepoSize:   181389824
StorageMax: 10000000000
RepoPath:   /home/alper/.ipfs
Version:    fs-repo@6

Than I did echo 'a' >> file.txt then ipfs add file.txt

Here I observe that number of objects increased 4 more so it added the complete file, changed repo size (38823)

NumObjects: 5311
RepoSize:   181428647
StorageMax: 10000000000
RepoPath:   /home/alper/.ipfs
Version:    fs-repo@6

I’ve found the NumObjects counter to be more variable since I’ve seen that counter change just from having the daemon running – which leads me to think there are some other things getting stored in the repo (maybe related to DHT or peer lists?).

Unless I’m missing something, it seems like your test confirms what I said. Note that RepoSize is in bytes, so in your test you started with a 181 MB repo.

You added a 100MB file and your repo size increased 37.9KB, which seems pretty great to me. I’m guessing that before you even added the file, most of it was already in your repo.

You then modify the file and add that to your repo and it increases again about 38KB (still <<100MB). This also seems pretty great. Your test seems to confirm that your node only needs to transfer a very small portion of the updated file.

If you started doing inserts into the file instead of appending updates to the end it might throw off the chunking algorithm, leading to blocks not being deduplicated like they apparently are in your test. However, I’d expect simply appending data like what you’re doing to result in great deduplication and more efficient file updates.

I forget to mention, the way I created 100MB. Since I create it as random, I am not sure I added the same file before but maybe the `dd algorithm does not work that random.

dd if=/dev/urandom of=file.txt bs=2048 count=10

I do not understand that when I added a new file repo sized increased only 37.9KB instead of 100MB?

But the issue its (the point I don’t understand), when a new file is added and its updated version is added, the repo size increment is same amount. Should the updated version’s repo be much less?

Some observation, when I do: ipfs repo statmultiple times
It decreases and increases. Like it play around same values smaller and larger one.

I don’t understand that either. I’m guessing the command in your last post is not the exact one you used because the dd command in your last post will only produce a file that is only 20KiB in size (instead of 100MB).

The chunking algorithm’s output block size is larger than the single character changes in your examples. Apparently the default is a fixed-size chunker (though rabin chunking is an option, which is what I had been assuming was the default) with a fixed block size of 256 KiB. In which case the ~38KB differences you’re seeing between updates don’t really make sense either. I’m inclined to think that there’s an issue with the testing methodology.

If you do this when the daemon isn’t running it will open the repo which can create some metadata that can accumulate and eventually be cleaned up. When the daemon is running there seem to be some background objects that get stored in the repo that can cause the stats to fluctuate.