Disk space consumption in IPFS

Preface

I’ve been thinking about what disk space consumption means in IPFS and how it differs from disk space consumption on a normal file system. I decided to share my thoughts here in the hopes that someone will help fill in gaps in my current knowledge, but also to explore whether or not…

  • We can make disk space consumption easier for users such as myself to understand (given the extra data consumed by IPLD, UnixFS, de-duplication or not, general differences from what a user might expect on their local filesystem, etc.)
  • We can compute disk space consumption more efficiently at runtime where necessary
  • We can precompute disk space consumption and save it somewhere in IPLD (maybe we already do this in part and just need to handle deduplicated size?)

If it makes sense I will gladly move this to GitHub, but I wasn’t positive I should start there with something that feels more open-ended.

Thank you in advance for your help. :pray:

Details

If you visit https://ipfs.io/ipfs/bafybeidv23lhcox2mcys63nhhd5g7fyitotfb4wikcpfsuszpaeczayjfm you can see that the disk space displayed in the upper right is 1.4 PB. I believe this size ultimately comes from the sum of BlockStats.Size() for each block without deduplication.

For convenience, here is what I believe to be the related code.

This large size appears in other places in the code, such as CumulativeSize, which is included in ipfs files stat bafybeidv23lhcox2mcys63nhhd5g7fyitotfb4wikcpfsuszpaeczayjfm output.

Related code…

If you run the same CID through dag stat (ipfs dag stat bafybeidv23lhcox2mcys63nhhd5g7fyitotfb4wikcpfsuszpaeczayjfm) you will (eventually) see a much smaller size of about 4.6 GB.

Related code…

Other places where different and/or confusing sizes are returned can be seen for example in Archive size wrong · Issue #5690 · ipfs/go-ipfs · GitHub, where progress numbers aren’t in sync with each other, leading to higher than 100% progress.

Questions

Noticing the difference in the sizes that were returned above, I’ve been asking myself what I think should be returned instead, and I feel like I’m not very clear on any answer. My confusion is related to how IPFS and IPLD change my assumptions about the world.

  • How many different answers are there to the question “How much disk space is consumed by this CID in IPFS?” and who would want to know each answer?
    • With or without dupes?
      • Dupes at the file level or block level?
    • Raw data before or after encoding in IPLD?
      • How many raw bytes would be needed for a directory? Isn’t this completely dependent on how you represent a directory, which by definition is tied to UnixFS?
    • Any other details related to how the data is encoded, CBOR vs DAG-PB, …?
  • What does disk space consumption mean in IPFS? Consumed on disk on the computer where the IPFS implementation is running? Disk consumption in the blockstore (is that different)? The amount of data that is meaningful to the user outside the context of IPFS and IPLD (is that the raw data portion of the IPLD data)?
  • Is there a meaningful way to compare any of these disk space numbers with what I see locally on my OS’s filesystem? Seems like something users would do to try to compare and understand.
  • Are any disk space metrics stored directly in IPLD and are therefore part of the data itself?
    • If yes, is that where the cumulative size numbers are stored currently, or is that something that is saved separately somewhere in the blockstore as blocks are downloaded to a node?
  • Whether or not this kind of metadata is part of the IPLD data hashed as a CID, is it possible to add additional metadata, like perhaps a cumulative size with duplicates skipped, to make something like ipfs dag stat more performant?
  • Given how confusing disk space feels to me currently, is this an area where there is opportunity to both educate IPFS users on what disk space might mean and also to come up with possible standards for how to present disk space to users in consistent and more meaningful ways, perhaps changing based on context?

That’s a rather messy brain dump. I’m going to share as is to see if anyone has thoughts on the topic. Again, thank you. :pray:

3 Likes

I was thinking exactly this before reaching the end of your post.

I think a first cheap and effective step to better reason around all this is for PL to standardize a name for each size: File size, deduplicated file size, rich size (with the metadata), rich deduped size, pinset size, pinset deduped size, retrivable file size (relative to your pinset) (the space you would gain by deleting the CID, taking into account that some blocks can be in some other CID that you still want pinned), etc. , or some similar names.

Having a name for each concept can help everybody wrap their head around them and make them realise sooner that they need to ask themselves first the following question: “What metric am I actually interested in for my use-case?”.

(Also, having API calls for each size could simplify code, I guess? (not familiar with the details of the implementations’ codebases, tho) )

2 Likes

I’ll add to the messy brain dump. I find working with deduplication with IPFS difficult other than the very straightforward. Add the same file (with the same hasher and chunked) and you get the same blocks. Using the standard chunker you could get some deduplication although it’s not that likely. It’s difficult to tell how much deduplication has occurred after adding content. If you want better deduplication by using a different chunked you basically have to add it a second time and compare to even see if it was helpful.

I’ve also found there are almost two notions of deduplications, maybe even three. You can have deduplication within content that’s being added. I make the distinction here because at this point you can control the hasher/chunker because you can always add it agin with new ones. Then there’s deduplication with data that might already be on your node. I make this distinction because you can’t really distinguish between deduplication from data that was already on your node, possibly from other content and deduplication that occurred within the content the was just added. The third one would be deduplication that can take place because it’s already available on the network. It’s a weird one because it’s like duplicate deduplicated data. It’s on the network multiple times but has a single reference, the CID.

I kind of wonder what the value of a content based chunker is if it can so easily be added a second time that would result in two completely separate copies of the same content. I can measure the first type of deduplication (the internal type) but I can’t really know the kind of deduplication that may occur when more content is added.

I’ve also been wondering if deduplication in the datastore is always a good idea. Sure there is the concept of deduplication across IPFS by CID but that doesn’t mean it needs to be deduplicated in the datastore. I can think can think of some cases where you might say, “I’ve got plenty of storage that I’d be happy to trade for better access times”

1 Like

Hello,

You slipped here, unfortunately.

The size is summed from directory entries. These directories and nodes come from GitHub - ipfs/go-ipfs-files: An old files library, please migrate to `github.com/ipfs/go-libipfs/files` instead.. Directory and Node are interfaces: the actual type behind is unixfs (called from here , which arrives to https://github.com/ipfs/go-unixfs/blob/v0.4.0/file/unixfile.go#L153).

In term of directories, the underlying type would be ufsDirectory (https://github.com/ipfs/go-unixfs/blob/59752aec6306c2ca2d9a020a2a9556d5f8bce956/file/unixfile.go#L19. This has a Size() which is the sum of the sizes of the links (at least for basicDirectories). That is the size of the data of a unixfs directory node itself.

The gateway wants to show the sum of the sizes of the files and folders in the directory. The size of a unixfs folder would be counted as I mentioned just above. The rest would be unixfs files, which are of type ufsFile, and that implements the Size() method with: FileSize().

FileSize() eventually gets the FileSize field of the unixfs node, which is a field that does not require any computing as it comes already hardcoded in the unixfs file protobuf.

As you know, a big file will be made of many Blocks glued together in a MerkleDAG. The FileSize is set during the construction of the dags and contains the sum of sizes of the blocks (https://github.com/ipfs/go-unixfs/blob/59752aec6306c2ca2d9a020a2a9556d5f8bce956/importer/balanced/builder.go#L157-L170).

And this is why the directory view in the gateway can report the total size of the files in the directory. Note this requires fetching the root node for every item in the directory, which is why a recent optimization avoids that for directories with many entries, in which case I think it won’t show the size.


Now for the topic of what size is what, there are several layers wrapping each other (Merkledag (“ipfs object stat…” for merkledag-pb dag nodes), unixfs (“ipfs files stat” for unixfs nodes) and blocks (“ipfs block stat”, which is what the the BlockStat you found gives), each of them with a concept of what “size” is. See Finding the DataType of a dag-pb block via the HTTP API - #6 by hector. “ipfs dag stat” would operate on the same level as “ipfs object stat”, except for arbitrary IPLD DAG nodes, not necessarily merkledag-pb.

Hope that helps.

2 Likes

Thank you @hector! That is quite helpful.

Thanks for responding @Akita.

Just thinking aloud here… :thought_balloon:

Yeah, it makes sense to me to have names for the various sizes, and I know names exist in the code. Thinking about how this contrasts with OS sizes… I feel like OS users can for the most part be somewhat oblivious to how their OS’s file system encodes things. They mostly care about the bytes of the file, which they can get easily from their graphical file navigator or command line or whatever. With IPFS it feels like there are a bunch of extra layers they have to understand first. So yeah, exposing users gently to these extra layers and then having consistent and meaningful names for things makes sense.

And then maybe the main extra issue I’ve been running into is just making sure there is always a way to get the deduplicated numbers, and that feels like another opportunity to make it easy for users to understand what blocks are shared, etc.

Interesting points @zacharywhitley.

To summarize what I’m taking from this, additional layers of complexity that users may or may not need to be exposed to related to deduplication are:

  • Chunkers
    • What a chunker is
    • Chunkers available and their impact on deduplication
  • Deduplication within the fresh content the user is adding
  • Deduplication between added content and content already on my node
  • Deduplication between added content, content on my node, and content elsewhere on the network
  • Impact of deduplication choices on access time

These feel like developer facing concerns to me and likely not user facing concerns in most cases, and there are certainly opportunities to make this kind of information easier to understand, easier to reason about the tradeoffs when making design decisions.

I’ve been thinking about tools like https://eth.build and what something like ipfs.build might look like for making some of these layers easier to understand.

Thanks. I agree that it should be a developer concern but the problem is it’s not. Everything kinda works out ok as long as you say, “F it, just use the defaults unless you know what you’re doing” but even then I think it has problems. Lets just go through a scenario. I have a file and I’d like to add it to IPFS. You can do the “just use the defaults” route. But I’m pretty sure that there’s some duplication in there and for some reason I’d like to save some space or I’d like to stream it and use the trickle-dag so I go and use a different hasher. Now I have to add it twice. Once with each hasher. Not a problem for small files but it they’re large it would be a non-trivial amount of compute, time and storage. Assuming that remember not to pin it I can reclaim the space for the one I don’t use later. But I have to remember to juggle the pins so I don’t accidentally pin them both. If I mess that up in any way I’ve doubled my storage space to store the exact same file twice completely negating any possible savings from using a different hasher. Now I somehow need to compare the actual blocks to see if it even did any good. This entire process is based on my intuition that there might be some savings. There is the possibility that there is no significant savings and it’s a waste of time and resources. Even if there is little to no savings in space I need to intuitively decide If there might be further savings when files are added in the future, which I can’t possibly know.

Say the non-default hasher performs pretty well so I decide to use it. Now suppose someone else adds the same file but with the defaults. Now there are two exact copies of the file on the network. Ok, that’s not really my problem until I go to use something that references that CID and I end up pinning it. Not it is my problem and I"m storing two copies, again negating any possible space savings.

So I have files on the network that have identical content, which is difficult or impossible to even communicate that the content is the same, and that have different properties and will perform differently. I’m not even sure if there is a way to communicate the different layouts in advance so that even if I did know they were the same content I could choose one or the other depending on how I’d like to use them. ie. CID1 and CID2 are the same content but one used a trickle-dag so I’ll go with CID2.