Is it possible to map the blocks to existing files?

From @githubber314159 on Sun Apr 16 2017 18:45:27 GMT+0000 (UTC)

Hey everyone! I have a idea regarding the data blocks. Blocks are definitively a good idea, but I have sorrows about the way the data is effectively stored.

It would be cool if .ipfs/ contained an index which links the blocks to files (file path + offset), since files are still advantageous compared to blocks. For example, you can open them in a system where no ipfs is running.

When the original files are still available locally, they remain usable for web publishing on the own server, tracking changes with git, publishing with NFS, enabling mirroring with rsync, deduplication in a modern file system, etc. After all, I can open them with normal applications using the quite strong hierachical index of filesystems. Dealing with hashes is complicated in relation to dealing with files, folders and sub-folders. I don’t know Sia/dat very well atm, but it seems that I can use the same set of files for another network. Thinking in the long term, keeping files is way more sustainable. And consuming less disk space.

Of course, the mentioned server use cases are possible with blocks too, but I would be happy if I’m not dependent on a certain program (IPFS in this case) and a certain storage format. I would like to manage the way I store the data on my own. If the files are kept in original state, the IPFS storage format could change dramatically and one could reinject the very same data again, resulting in an entirely newly organized index.

When the files are stored as blocks in .ipfs/, the are some sort of protected from being accidentally deleted, altered or renamed by the user. This may not be the case when the files are referenced, but still somewhere in the file system. I offer five possible solutions do deal with the problem:

  1. Rename the files in a way no other application does. For example, have you seen Filename_ID_.extension anywhere? ".ext" is ugly and not used. If my file_hash.ext is marked as such, I know it is referenced. The hash in the filename is only for fast scanning with a script, see later.
  2. Change the ownership to group ipfs. I add myself to this group, too, but see that the file is referenced by the index.
  3. Enable this approach as opt-in option. The users who enable this know what they are doing. We can expect that they store the files in a directory purposed for publication and thereby know that they have to pay attention.
  4. Track/check files like IA.BAK does (git-annex).
  5. (Read-)Lock the files being referenced by the index (flock -s file.ext).

Now what if human error still occurs? Well, one could argue that a file missing file means that it can not be hosted by the peer and the index should be updated accordingly. Simple thing if the user then re-adds the renamed file: the index just points to the new path. One could imagine a periodic “scan.sh false” which produces list.txt. list.txt then informs the index of renamed files.

Otherwise, we may keep track of the files by their contents, e.g. by using the output from the script attached.

The advantage of this approach may vanish if the FUSE module is integrated better, e.g. showing the files that are in the local repo without having to call ipfs pin ls --type=recursive.

So what do you think? Let me know!

Edit: I’ve seen ipfs filestore, but I don’t know where I can look up more / who I have to contact.

Attachement:

#!/bin/bash
#License: GPLv3
#usage: scan.sh /absolute/path/to/rootDir [full=true]

if [ -z "$1" ]; then
	pwd="$(pwd)"
else
	pwd=$1
fi

if [ -z "$2" ]; then
	full="1"
else
	full="0"
fi

find "$pwd" -type f -regex ".*[^txt]$" -print > scan_list.txt
sort scan_list.txt > scan_list_sorted.txt
cat scan_list_sorted.txt|xargs -d '\n' ls -alU --time-style +%s > scan_ls.txt

if [ $full -eq "1" ]; then
	echo "full scan"
	cat scan_list_sorted.txt|xargs -d '\n' sha512sum > scan_sha512sum.txt
	cat scan_ls.txt scan_sha512sum.txt > scan.txt
	rm scan_list.txt scan_list_sorted.txt scan_ls.txt scan_sha512sum.txt
else
	echo "file listing"
	mv scan_ls.txt scan.txt
	rm scan_list.txt scan_list_sorted.txt
fi


Copied from original issue: https://github.com/ipfs/faq/issues/253

From @githubber314159 on Tue Apr 18 2017 15:40:14 GMT+0000 (UTC)

Thank you.
So if I loaded a foreign file into .ipfs/ I can
$ cat /ipfs/Qm… > file.ext
and then unpin it, garbage collect, and re-add it to ipfs with --nocopy and my .ipfs folder remains small? Perfect!

I don’t know where to report tests with nocopy (see https://github.com/ipfs/go-ipfs/issues/875#issuecomment-218060294), but here are some results:

SSD:
starting with empty .ipfs/
adding 41830 files / 37.7 GiB: .ipfs contains 27027 files, 89.4 MiB (lot of symlinks which aren’t supported yet)
additionally adding 17665 files / 17.1 GiB: .ipfs now contains 30766 files, 114.7 MiB (time needed: 5m22s, also lot of symlinks)
Normal work with ipfs files is possible.

HDD:
starting with empty .ipfs/
adding 54248 files / 8.5 GiB: no symlinks. (time needed >1h)
additionally adding 7632 files / 45.1 MiB concurrently: insertion worked fine, finished before the other and was browsable/downloadable from localhost:8080. (time needed: 5m22s)
Resulting .ipfs/ contains 12200 files, 30.2 MiB

Interestingly, sometimes the hashes began with Qm, and sometimes with zb2rh. This is not correlated with symlinks.
The same file inserted normally, then unpinned, garbage-collected and reinserted with --nocopy resulted in a different hash :frowning: → This should not happen.

Now what happens if the file vanishes:
ipfs add --nocopy <file> rm
$ ipfs cat
The last command behaves as expected: it searches the contents from other peers.

In FUSE, however, cat did not work as expected. Error log:
hh:mm:ss.527 ERROR fuse/ipfs: fuse lookup got non-protobuf node readonly_unix.go:145
Not sure whether this is symlink related, though. I think there is a correlation with the hash (Qm… works, zb2rh not)