Will filecoin cost be based on reported size or actual?

I’m working on a git remote that pushes to IPFS. One of the options is to output the commit chain with each associated tree as an IPFS filesystem.

This means there are hundreds of links to the same files, and this inflates the reported sizes. The repo itself is ~20MiB with ~120 commits, and the reported size in the webui file browser is 5TiB.

Pinata and Temporal both refused to process it as too large. Eternum said they’d pin it for $728/mo.

Will I have this same issue with Filecoin where the cost to store is based on the reported size rather than the space on disk?

Lead developer of Temporal here. The reason for the webui reporting it as 5TiB is that all methods within go-ipfs to report size don’t report deduplicated size. If it really is 20MiB I can pin it on our systems for you to circumvent the way go-ipfs calculates size, but the problem you’re experiencing across all 3 services is due to the way go-ipfs nodes calculate overall size. Until go-ipfs provides a reliable way to get deduplicated size, it’s unlikely any pinning service will charge deduplicated costs.

In terms of the question the thread is about (filecoin) you are probably best to go to a filecoin discuss, reddit, or slack to ask that

That is what I figured.

It seems like you have to walk the tree to recursively pin things. Could you just spider every submitted hash keeping track of the CIDs you’ve seen, quitting when the size is too large or the number of nodes too great. At the end you can compute the size from the seen CIDs.

It does sound kinda costly relative to the current system.

It’s frustrating because I’m working on a distributed proofreading platform which leverages the filesystem deduplication quite a bit.

The test corpus I’m working with is the Hugo Award nominees. I pulled down around 500k book covers and created a file system of the form book/by/#{author}/#{title}/covers/#{isbn}.#{type}. About a quarter of those are duplicates because the same image could often be used for multiple ISBNs.

Then I put the covers (and sometimes something from irc://irc.irchighway.net/#ebooks) into git repositories, so all those duplicates are duplicated again.

I then create a “context forest” which is a bunch of different paths encoding the award information leading to the books — redoing the duplication more.

Hopefully it’s popular enough I can get people to join a cluster. Do you happen to know if a cluster follower has to pin the entire cluster? Can they be configured to only pull down a limited number of pins?

Yea you have to walk the tree. Part of the problem with calculating the deduplicated storage costs is needing to walk this tree. If you have a 20MiB with a ton of deduplicated data, but the non-deduplicated storage cost is 5TiB, that’s going to be an insane amount of walking to get the deduplicated costs.

Project sounds cool best of luck.

Hopefully it’s popular enough I can get people to join a cluster. Do you happen to know if a cluster follower has to pin the entire cluster? Can they be configured to only pull down a limited number of pins?

I’m not super familiar with this part of ipfs cluster, but I believe followers will pin whatever is advertised. If you are advertising the entire DAG for this dataset, they will pin it.

@dysbulic if you’re interested, I should have a patch in our production environment tonight or by sunday at the latest that enables deduplicated size estimation :rocket: Do you happen to have the hash of your dataset so I can run some tests against it?

@postables, that’s really surprising, but very cool.

QmaPsQNVSCXGSpB1bU1G1tbiAk58uccBiaQHYdB2AVHCLs is one of the hashes produced by my git remote.

I was a little off on the disk usage. I forgot that all the past versions of files are saved in .git/blobs, but the deduplicated size should still be under 100MiB.

After your thread I did some research and discovered that I had already wrote a wrapper tool that included the ability to do deduplicate size calculation, just forgot about it i guess :man_shrugging:https://github.com/RTradeLtd/rtfs/blob/master/ext.go#L7

Is the node hosting that behind NAT or something? my dev ipfs node is stuck trying to find it, and I can’t seem to find any hosts that are providing the content with ipfs dht findprovs. When you see this could you issue an

ipfs swarm connect /ip4/206.116.153.42/tcp/4004/p2p/QmePr8gxUswSsD7anQCm8P1F599CrmK2Wze1DjoN8LaLAx

on the node hosting the content? it will connect to my dev ipfs node and hopefully make pinning of the content faster.

When you see this could you issue an ipfs swarm connect …?

@postables, Done.

Thanks, looks like it worked :slight_smile:

2 Likes

Just a note here for completeness. IPFS Cluster Service has the configuration options replication_factor_min and replication_factor_max to specify the number of peers on which to pin items.

Both are set by default to -1 which means “pin on all”.

So it is possible for cluster followers to only pin part of a collection. I was concerned with the size of the repo and saving disk space. I’m uncertain though how smart the cluster server is in figuring out how to evenly distribute the load.

Currently not smart. All means all. But have some ideas: Probabilistic pinning for "replication everywhere" mode · Issue #1058 · ipfs-cluster/ipfs-cluster · GitHub