@flyingzumwalt Awesome, ipwb is pretty much what I want. Iâve been looking into OrbitDB (https://github.com/orbitdb/orbit-db) too, to use IPFS as a KV store.
@es_00788224 @flyingzumwalt Thanks a lot!
@flyingzumwalt Awesome, ipwb is pretty much what I want. Iâve been looking into OrbitDB (https://github.com/orbitdb/orbit-db) too, to use IPFS as a KV store.
@es_00788224 @flyingzumwalt Thanks a lot!
You are right. ipwb didnât make uri searchable in IPFS. What I take from ipwb is:
While skimming the code, I thought ipwb uses IPFS directories for CDXJ urls but obviously they didnât do it. Seems ipwb has very limited functionality.
Iâm going back to OrbitDB.
Why do you need orbit-db? A directory is a directory, just treat it as one:
IPWB is just a way to put WARC files on IPFS.
While IPWB is not just a way to put WARC files into IPFS, but for now, this statement holds true to some extent, that is just the indexer part, not the reply side of it. It is not difficult to build a transactional, on-demand, or client-side archiving system using IPWB. However, IPFS is not directly usable for the purpose described in this thread, but some techniques can be borrowed from it to implement such a system. Here are a few assorted thoughts around it, some are utilized in IPWB and some are not.
Via
header that describes all the headers that can be used for content negotiation. For example, an exact same URL can be used to serve the content in many different languages using Accept-Language
header by the client (if the server advertises that it allows content-negotiation on that header). If a proxy/cache does not consider the Via
header, it would overwrite the cache with the content in a different language, if the URL is the same. As far as I know, VA Tech research mentioned above does not consider Via
header as it is still only a research project and not a usable product. Additionally, in case of IPWB, archival records are often explored just by a URI and a datetime, so other possible content negotiation dimensions are not part of the index.Hope, this would help you come up with something that you can share with the community.
Add pointers to each CDXJ record at /ipns/YOUR_HASH/THE_URL
.
Unless youâre planning to build something extremely large, a simple reverse word index (also distributed over IPFS) searched by some quick JS code with boolean operators is enough for most purposes.
We have thought about a few approaches around the full-text search implementation. here are some assorted points in that regard:
Content Hash => Array of URI-Ms
(where a URI-M is a pointer to a memento, i.e., an entry in the CDXJ) can be maintained independently for final presentation purpose which would require some clever ways to rank or bundle competing entries with the same hash (of the same URI-R at different times or different URI-Rs).@mfan if youâre looking at orbit-db you should also look at @pgteâs work with CRDTs on IPFS. It does the same things as orbit-db but uses yjs to handle the CRDTs instead of relying on a homegrown implementation, which makes it easier to use.
@flyingzumwalt @ibnesayeed @Kubuxu @es_00788224 @daviddias Thanks for all the information. They are very helpful to get me started with some experiments.
For the web proxying, I did a simple experiments with Proxy2 (https://github.com/inaz2/proxy2) on popular websites, e.g. cnn.com. The http and page content are fetched and cached using their URLs, and get replayed at a later time in the browser. It seems relatively easy and no blocking issues:
Caching on IPFS
In my experiment, I used IPFS unixfs-dir and IPNS. Itâs very slow.
....
....
# Received a HTTP request, check for page cache in IPFS
uid = hashlib.md5(req.path).hexdigest()
ipfs_path = '/ipns/%s/%s' % (IPFS_BASE, uid)
try:
# Return the page cache (http + page content) to browser.
page_dict = self.api.get_pyobj(ipfs_path)
print 'URL %s found in ipfs: %s' % (req.path, ipfs_path)
print 'UID %s found in ipfs: "%s"' % (uid, page_dict['body'][:60])
print 'UID %s IPFS path: %s ' % (uid, ipfs_path)
self.wfile.write(page_dict['header'])
self.end_headers()
self.wfile.write(page_dict['body'])
self.wfile.flush()
return
except Exception as ex:
# No cache, go fetch the page.
print 'URL %s not found in ipfs: %s' % (req.path, str(ex))
print 'UID %s not found in ipfs.' % (uid,)
print 'UID %s IPFS path: %s ' % (uid, ipfs_path)
print 'Fetching the page... for %s' % (uid,)
...
...
# Fetch page content.
...
...
# Downloaded the page content, store it into IPFS.
uid = hashlib.md5(req.path).hexdigest()
with self.lock:
ipfs_res = self.api.name_resolve('/ipns/%s' % (IPFS_BASE,))
root_hash = ipfs_res['Path']
page_hash = self.api.add_pyobj({
'header': header_file.getvalue(),
'body': page_file.getvalue()
})
ipfs_res = self.api.object_patch_add_link(root_hash, uid, page_hash)
ipfs_hash = ipfs_res['Hash']
self.api.name_publish(ipfs_hash, key=IPFS_BASE_KEY)
Iâm afraid this design is not going to work in a production environment, due to the contention when publishing the IPNS name (IPFS_BASE) with a shared key (IPFS_BASE_KEY) from many nodes at the same time. The performance is also not acceptable, itâs extremely slow even with a single flow in my experiment. Any suggestions to fix the contention and have a working design using IPFS directory?
For the project, Iâm thinking about using IPFS to help people gain access to blocked internet content.
For a country which has national level Internet censorship, people usually use VPN to workaround the national Firewall to access blocked content or websites. We could use decentralized IPFS to circumvent censorship and make those content available to more people:
The wikipedia snapshot project is great and inspiring. However, many of the wikipedia content in snapshot could be long tail content. Now thinking about using IPFS as a web proxy and provide the Google or Wikipedia search pages to users. We could fetch up-to-date (or almost up-to-date) info those users are seeking, and hot/popular content is guaranteed cached in the swarm.
Since IPFS directory is too slow and not safe for the web cache store, I also looked into IPFS Key-value Stores.
Orbit-db and Tevere are amazing. However, both of them are maintaining a separate local storage other than IPFS swarm. Both require IPFS pubsub to trigger log sync among peers.
A quick skim of go-floodsub code told me that it doesnât use IPFS store for its messaging and peer management. Guess itâs going to have a hard time to scale and probably wonât work well with nodes have intermittent availability most of time (for better censorship resistant.)
Itâs surprising to me that itâs not easy to build a Key-value store upon IPFS. Nobody ever suggested to support âcontent-tagging or tag-addressedâ, in addition to âcontent-addressedâ on IPFS?
I spent some time thinking about content-tagging while reading about IPFS. I start feeling the intense stare of âcontent-addressedâ from the title of Benetâs IPFS paper (DRAFT 3)
As I understood, if we make content addressable/searchable by its tag, 1) many of the benefits provided by the system will be gone, e.g. Tamper resistance, Dedup; 2) the conflicts caused by a arbitrarily named tag seems not manageable, especially when the data propagated through the network, and how the nodes decide which value should be the latest one. etc. 3) other issuesâŚ
However, I have a strong feeling that tagging is very useful and could be a great feature for IPFS even though there is no guarantee to make tagged content consistent across the system.
Are there discussions about similar features in the past? Thanks!
You should use an existing web scraper such as httrack instead of rolling your own.
How are you going to share this key?
No, there are a couple of big issues with this.
ipfs dht findprovs QmOFFENDING_CONTENT
to put someone in jail or worse.No, itâs not. Tor can evade GFW/Golden Shield with ease. What countries are you talking about where a VPN ban is enforced?
Itâs trivial. You can use traffic patterns, or just block the DHT seed nodes. You could also create some custom implementation of the IPFS DHT that connects to as many nodes as possible and silently adds all the IPs it can find to the IP block list.
That just makes it fault-tolerant.
As I said earlier, they can just block all the IPFS nodes. The ones that really manage to piss them off might get a <img href="1.2.3.4/veryveryverylongstring.png">
tag inserted somewhere, where 1.2.3.4 is their IP. This has been done in the past, and overwhelming a residential connection isnât very hard.
Why? With Wikipedia, you have the entire database. Build a simple reverse word index instead, this is good enough for most uses. Google doesnât like people scraping their searches.
You can just use the IPFS directories for that. The key is the file name, the value is whatâs referenced by the link in the DAG.
Because itâs a downright horrible idea. You can build a search engine on top, and there are decentralized implementations available in various forms. A simple reverse word index is good enough for most purposes and easy to decentralize.
No, the system will be gone. What you describe is susceptible to DoS attacks.
There was something called ipfs-search, but they shut down. Code available on GitHub.
I copied the IPFS_BASE_KEY into another node So far itâs the only way (I figured out) to let many nodes updating the same directory. You mentioned âYou can just use the IPFS directories for that. The key is the file name, âŚâ How exactly to do it to allow many nodes to update the same file/directory name?
Distributed web cache
In my experiment, IPFS web cache works fine for dynamic web sites. It cached all HTTP requests including those web services calls.
About Censorship
Searchable tag in IPFS
Now I realize that IPFS knows only the content hash and not content value. I see why âthe system will be gone.â However, I donât thinking searchable tag is a horrible idea in IPFS.
In order to support tag, we need to support tag queries as a new message type than normal content queries in IPFS. We need to do some changes in routing. To support âtagâ as a new storage object, we need to extend IPFSLink and IPFSObject and the corresponding protobufs. Itâs relatively easy.
Guess we can live with a tag have many different values in the system, itâs up to users to find good use cases for tag. Hopefully people will find creative ways to use tags for their Apps, e.g. tag prefixing, tag namespace. Not used tags will be eventually removed from the system. In order to help publishing tags in the network, we might need to augment routing key to be a pair of keys (hash(tag), hash(content)), or even better (hash(tag), hash(hash(tag), hash(content))) to find a way to help tracking hot tag/value pairs and keep them in the system.
Guess one important use case for searchable tag is a native key-value store in IPFS.
Same way, share the IPNS key.
Russia does not currently have a VPN ban.
The Tor Browser Bundle includes obfsproxy which does the same thing.
Itâs not a bad idea per se, but it will be much harder to implement than youâre claiming. How will you solve flooding? This is a non-trivial problem, since you canât check the correctness of a tag programmatically (you need a human to do it).
So then you need some mechanism for trust, and youâre suddenly looking at a lot more complexity than âadd a new IPFS object typeâ.
You already have a key-value store. The database is a folder, the key is a file name, the value is the content.
This supports multiple clients. But data loss could happen when many clients update the same IPNS endpoint at the same time.
Let me explain my idea in more detail. A proposed Tag object has two fields, âNameâ is the name of the tag, and âMultiHashâ is the multihash of the IPFS object the tag pointing to. Letâs defined H1: multihash(Tag.Name), and H2: multihash(Tag).
H2 is the traditional IPFS content addressed key. It is used as usual to find the Tag object and then the content object the tag pointing to.
H1 is used for tag query, which is different than normal content query. A tag query could return a list of (H2, H2â, H2â', âŚ), and itâs up to the application to decide how to deal with the data.
To support tag query, the routing key for tag is (H1, H2) which has one more dimension than usual content key. Content propagation is driven by H2 and the flooding pattern shall be the same as usual. We might have (H1, H2) and (H1, H2â) cached in the same node and itâs OK.
The IPFS tag query operates at a different plane to support IPFS content annotation. The above design shall serve our goal and make IPFS content searchable/addressable by tag name. Note that when the same tag Name pointing to different IPFS objects, there will be multiple different IPFS Tag objects. A tag query for Name can therefore return multiple results.
For the implementation, we need to augment IPFS routing API, e.g. add new APIs PutTag(), GetTag(), etc. Tag objects are small and can be stored within DHT.
Iâm afraid this kind of feature shall be supported by Applications. We donât need to worry about it here.
Itâs far from a key-value store. I believe an IPFS directory (folder) is no different than a blob object. Anything changed within the subtree creates a new IPFS object, which requires publish the directory again over IPNS. With shared keys, it supports multiple clients. But itâs not safe and could have data loss.
So, a race condition. Itâs trivial to implement locking.
This is where you have the problem. What happens when someone inserts some trillion of tags with random keys and values? Is the application supposed to download terabytes of data and âdecide how to deal with the dataâ later? So you need trust BEFORE downloading the tags, and then you have no need for direct indexing anymore.
Hereâs a simpler idea: ask web.archive.org to publish their collection to IPFS. They already index most sites, and theyâre considered trustworthy. You could then navigate to /ipns/web.archive.org/*/webpage
and get redirected to /ipns/web.archive.org/YYYYMMDD/webpage
.
Yep, makes sense, in essence this is what cyber is doing. Building a user defined semantics core using IPFS CIDs, blockchain and a decentralized knowledge graph