Would there be an interest, in an IPFS Search Engine?

Looks like a good model. I would organize the pool as following (mostly summing up your design).

Running crawlers:

  • Run a few first crawlers for bootstrapping purpose.
  • They connect to each other and subscribe to a common topic: generalNewsCrawlers/v1.
  • They run a common OrbitDB in CRDT mode
  • They add “records” (let’s call it that way) to the DB (aka pool INDEX) along with their signature and metadata (maybe date, a provider (to shorter lookup for searchers), possibly the IPNS record or DNSLink, the size, a guess on the type of file, etc.)
  • To verify others work, they could take the records of others, reindex it and compare the result. If it is the same, they add their signature to the record in the index and increase the confidence they have in that fellow crawler. If not, they broadcast a message on PubSub: "Crawler X made a mistake on CID Y. It did Z when we expected W. See their signature of the faulty work: S. Here is my own signature for this alert broadcast. "
  • Upon receiving the alert message, the other crawlers check who is faulty (the bad indexer, or the whistle-blower?) and either relay the message or drop it and bring shame on the faulty whistle-blower instead. The faulty node is penalized, and a honest one is rewarded (increased confidence).
  • Below a certain confidence threshold, honest nodes drop the bad crawler from their local routing tables.

Searchers:

  • Searchers connect to one or several crawlers. They do not subscribe to generalNewsCrawlers/v1 which is only for crawlers (if they do, they won’t be penalized but will quickly be dropped as crawlers prefer keeping connections with honest crawlers, not cute but non-indexing searchers).
  • They query one or several crawlers. They got result back. They should have almost the same result from all crawlers (not exactly, as they are always indexing). These results have different signatures form different crawlers. Searchers compute the union of the results.
  • The searcher then filters and order as it sees fit. Possible criteria to combine: the order proposed by the crawler, number of signature (more trusted result), date it was seen first, different weights for the signature of different crawlers, etc.

Crawler joining the pool:

  • Jonny (joining crawler) connects to some Crawlers of the pool. He will enter a trial period and will seek the approval of a “senior” crawler who has successfully passed this trial and is in the pool.
  • Jonny asks a senior crawler for CID to index.
  • Jonny indexes them but doesn’t publish them to the Index. He sends them to the senior.
  • Senior crawler checks (some of) the results. If they are good, it sign them, it puts the record with both signature on the INDEX, and finally gossip about Good Jonny wanting to join in the PubSub channel. If Jonny tried to publish himself to the INDEX, or if results are not following the pool standard, they are rejected and Jonny is badly gossiped about and he has to start from scratch again after a backoff period. This backoff global as all seniors know when Jonny was bitten by his senior. If Jonny tries to make his senior check a too big file, he is punished too as he tried to DoS a senior. The Senior informed him of the maximum size, or it’s written in the pool’s rules.
  • Senior challenges Jonny with bigger and bigger files.
  • The more Jonny deliver, the more he is trusted by his peers. If he fails once, he starts from scratch.
  • After a lot of successful runs, he becomes a regular crawler and publishes himself to the INDEX. The senior send Jonny his diploma: a message saying that Jonny has now graduated, thanks to the Senior, + the diploma of the senior. This a chain of trust that can be traced to the original bootstrapers.
    -Jonny can become a senior for a new Jonny, too.

Bootstrapping the pool:

  • Each peer wants to be a Jonny.
  • After some time, they see they didn’t find any senior crawler.
  • They say to N other Jonnys: "I didn’t find a senior crawler to send me tasks. Do you want to be my Senior, and I’ll be yours? "
  • The other Jonny either say “yes”, or “I found a Senior at this address. You should contact them, or try again with me in X time (I hope to have graduated by then).”
  • There is a possibility that several groups consolidate on their own with a risk of netsplit. To avoid that, seniors in each group can vet each other following the same process. They are Jonny’s in each other’s network. After successful vetting on both sides, their two INDEX and their two networks should eventually merge.

Open problems:

  • Depending on available resources, trust level among the pool, and pool size, crawlers may want to only check a fraction of the record.
  • We may want to check more for newer crawlers, with a low score, to evict bad ones faster, and not spend resources to check old, honest reliable nodes. However, this is a DoS vector (Sybil nodes spawning rapidly, joining and making honest node check their crap. Be evicted, rinse, repeat).
  • Bad crawlers have signed some records on the INDEX. How do we make the Searchers not trust these results?
    – Revoke access to the INDEX and delete their record. How? And we lose information about bad peers.
    – Make crawlers not send the records that were rejected.
    – Make crawlers remember bad crawlers and not send the records that were sent by rejected crawlers (unless verified by someone else)
    __ They can have a parallel OrbitDB which is a list of bad peers, along with the proof(s) of their faultiness.
  • How to prevent Jonny the Joiner to index useless data he generated and present that as results to the Crawlers? He will quickly earn the trust of the pool and be able to DoS it.
    – Maybe Make the Senior send him some work to do first.
    ------ How do Crawlers know the message they just received saying a distant honest senior node trusted a distant Jonny with a result is legit? This unknown distant senior may be a Sybil.
    ---------- Should all Crawlers check Jonny’s results before trusting him? That increases DoS vector and doesn’t scale well if high churn.
    ---------- Alternatively should they check that Jonny’s Senior(s) was vetted by another crawler that was vetted by themselves (find a web of trust path going from Jonny to the skeptical crawler)? Then the web of trust should be stored on yet another OrbitDB, or be redundant enough to compute it on the fly by jumping from node to node. But long-range attacks could infiltrate good nodes that then introduces bad peers and say they trust them.
    ---------- Alternatively, we trust Jonny by default. BUT, we test our fellow crawlers regularly. If Jonny fails, we decrease our confidence in Jonny by 1, and his senior by 0.5 (to be tuned)

The rules of the pool will determine what is a good contribution and what is not. I guess the pool will provide an implementation to run.

Organizing a vote is tricky because of Sybil nodes voting power. Weighting by node “reputation” is tricky because the reputation is local.


Okay, I will stop now, this is getting out of hand.

I have started to try to develop a crawler MVP.
I am not yet going to implement pools, because they increase the complexity.

But for this, I still need a way to observe the network for files being transferred.

The conditions:

  1. The crawler has to be given the CID before adding them to the index. It doesn’t observe the network.
  2. It only adds two file types to the index: MIME Types text/plain and text/html.
  3. It publishes the results onto IPNS. The result is a JSON Dict, mapping from keyword to CIDs.
1 Like

Great! Best of luck :muscle:!

Also: this:kissing_smiling_eyes: :notes:

There is a few already. There is cyber.page, which is a search engine \ protocol which works with IPFS. And there are a few crawlers

Here are the docs for cyber.

PS. I am affiliated

In short the way cyber works is something like this or here. There is of course the WP

@CSDUMMI would love to chat more =)

1 Like

This is true.

IPFS based Search Engine already works and everyone can take part in creating a knowledge graph. I would be very happy if IPFS community takes an active part in this.

1 Like

I’m greeted with total darkness. And can it be accessed from IPFS as well?
If it can’t, then it is no different then using DuckDuckGo with the site:ipfs.io in the search line.

This is more like a cyberlink knowledge graph. You must upload your files or IPFS hashes with keywords to the knowledge graph. After that you will be able to find it in search engine. The knowledge graph is now in the process of filling. It already has over 100k cyber links.

https://cyber.page/

My Problem is:
Is it really more than just a search query on DuckDuckGo with the site: attribute set to an IPFS Gateway?
Is it decentralized, interplanetary, independent, censorship resistant or modular?
Can you download the knowledge graph yourself and work on it independently? Can you create your own search algorithms?

Yes, yes, yes and yes. I sent you links to some docs above. Check it out =)

It is decentralized. It is interplanetary. It is independent. There is no censorship. It is modular. You can fork the client or the chain. You can build your own graph, etc

You can create your own search algorithms. You can govern the whole system, via onchain governance, etc

What do you mean by that? Cyber uses IPFS as DB for storing content and it uses IPFS CIDs to create cyberlinks. However cyber can work with any stateful or statelss protocol as long as you can have a pair of CIDs and prove their source

1 Like

I mean, suppose that cyber.page was blocked in my country. Could I still access it, for example through IPFS,
by downloading some software or accessing some File on IPFS?

cyber.page is just a POC reference gateway. No more. You can access the protocol via nay possible client. For example there is a TG bot @cyberdBot or there are firefox and chrome alpha extensions. We have started to work on a browser, called cyb, which is actually a personal blockchain application on top of a protocol. So no way to block it, unless you shut down the network.

Anyone is free to fork the client and to build whatever gateway they want to it, they could even make it private or semi private by filtering the front end.

Its still early days and and not in the mainnet. If you fancy, i’d be happy to chat and tell you how it works more in detail. Or you can check out the code on GH: https://github.com/cybercongress/go-cyber (thats the protocol repo)

The short answer is yes, you can access it

1 Like

I wonder if any blockchain-based system can be “interplanetary”. Up to 24 minutes of the latency between Earth and Mars easily kills it (miners/validators cannot be distributed at least). Also, one of the super fancy goals of IPFS is to be partitionable, that is:

Info and apps function equally well in local area networks and offline. The Web is a partitionable fabric, like the internet.

This is also a reasonable requirement for interplanetary system, and seems to exclude the current blockchain technology which assumes the existence of the global, unpartitioned internet.

1 Like

I have to agree with you, @sinkuu That’s why it is not such a bad idea to have no global index, but rather small crawler communities,
that eventually exchange their findings, but don’t stay in constant contact with everything else.

1 Like

I think that my MVP Crawler kind of works.
It creates a reverse index, mapping keywords to CIDs in the JSON Format.
I didn’t yet get it to automatically publish to IPNS, because that seems to take very long.

Hi! I created Cyber and would love to address your concerns.

I wonder if any blockchain-based system can be “interplanetary”. Up to 24 minutes of the latency between Earth and Mars easily kills it (miners/validators cannot be distributed at least)

You are correct that It’s likely we will not be able to sync one chain between Mars and Earth. But we should not. I am pretty sure semantics on Mars will be very different from the Earth semantics.

That is why we defined the mechanism which allow to prove the rank for any given CID from Earth chain to any knowledge graph which will run by Martians. So you will need to sync only ranks of anchor CIDs back and forth using some relay.

1 Like

I am pretty sure that solution 5 will not be able to work without solution 4.

  1. You can learn from Yaca, that you cant build the search engine which will be useful following complete bottom-up utopia. The reason for this is quite straightforward: relevance have to be somehow protected from trivial sybil attacks. You can not achieve this without some economics. And yep, you cant add economic layer without dlt due to double spends.

  2. Another problem with bottom-up utopia is that due to inability to have the full index such search will never be able to answer questions better than top-down solution.

  3. Top-down approach must not be complex and centralized around one blockchain to rule them all. Сheck wp

  4. I am pretty sure that it is a good idea to develop bottom-up utopia on top of top-down so you can get the betterness from two worlds.

2 Likes

You can run you own node and get all raw index from peer2peer network by yourself.

2 Likes

I don’t think, we should use either:

  • a central data structure, that has to be synced with all nodes.
  • a distinction between peers.

This means, that I don’t think, we should any longer talk about crawlers and searchers, because a search engine on IPFS should embrace the design philosophy of IPFS, such that wherever IPFS can be used, the IPFS Search Engine can be used as well.

And IPFS has no distinction between peers, thus Search Engine Peers shouldn’t either.
They should be an addon upon an IPFS Peer, any IPFS peer.

To talk about relevance, this should be a decision made by every peer.
So, please, what is the concrete definition of a sibyl attack? And is it a real threat for a decentralized, unsynced data structure?

Another part of the design philosophy of IPFS, that I would implement in an IPFS Search Engine, is the common format, around multiple formats.
Meaning, that there should be a format, that leaves almost nothing about a search index implied, from format of the data, the data structure, the encoding and even the content and focus of the index.