Looks like a good model. I would organize the pool as following (mostly summing up your design).
- Run a few first crawlers for bootstrapping purpose.
- They connect to each other and subscribe to a common topic: generalNewsCrawlers/v1.
- They run a common OrbitDB in CRDT mode
- They add “records” (let’s call it that way) to the DB (aka pool INDEX) along with their signature and metadata (maybe date, a provider (to shorter lookup for searchers), possibly the IPNS record or DNSLink, the size, a guess on the type of file, etc.)
- To verify others work, they could take the records of others, reindex it and compare the result. If it is the same, they add their signature to the record in the index and increase the confidence they have in that fellow crawler. If not, they broadcast a message on PubSub: "Crawler X made a mistake on CID Y. It did Z when we expected W. See their signature of the faulty work: S. Here is my own signature for this alert broadcast. "
- Upon receiving the alert message, the other crawlers check who is faulty (the bad indexer, or the whistle-blower?) and either relay the message or drop it and bring shame on the faulty whistle-blower instead. The faulty node is penalized, and a honest one is rewarded (increased confidence).
- Below a certain confidence threshold, honest nodes drop the bad crawler from their local routing tables.
- Searchers connect to one or several crawlers. They do not subscribe to generalNewsCrawlers/v1 which is only for crawlers (if they do, they won’t be penalized but will quickly be dropped as crawlers prefer keeping connections with honest crawlers, not cute but non-indexing searchers).
- They query one or several crawlers. They got result back. They should have almost the same result from all crawlers (not exactly, as they are always indexing). These results have different signatures form different crawlers. Searchers compute the union of the results.
- The searcher then filters and order as it sees fit. Possible criteria to combine: the order proposed by the crawler, number of signature (more trusted result), date it was seen first, different weights for the signature of different crawlers, etc.
Crawler joining the pool:
- Jonny (joining crawler) connects to some Crawlers of the pool. He will enter a trial period and will seek the approval of a “senior” crawler who has successfully passed this trial and is in the pool.
- Jonny asks a senior crawler for CID to index.
- Jonny indexes them but doesn’t publish them to the Index. He sends them to the senior.
- Senior crawler checks (some of) the results. If they are good, it sign them, it puts the record with both signature on the INDEX, and finally gossip about Good Jonny wanting to join in the PubSub channel. If Jonny tried to publish himself to the INDEX, or if results are not following the pool standard, they are rejected and Jonny is badly gossiped about and he has to start from scratch again after a backoff period. This backoff global as all seniors know when Jonny was bitten by his senior. If Jonny tries to make his senior check a too big file, he is punished too as he tried to DoS a senior. The Senior informed him of the maximum size, or it’s written in the pool’s rules.
- Senior challenges Jonny with bigger and bigger files.
- The more Jonny deliver, the more he is trusted by his peers. If he fails once, he starts from scratch.
- After a lot of successful runs, he becomes a regular crawler and publishes himself to the INDEX. The senior send Jonny his diploma: a message saying that Jonny has now graduated, thanks to the Senior, + the diploma of the senior. This a chain of trust that can be traced to the original bootstrapers.
-Jonny can become a senior for a new Jonny, too.
Bootstrapping the pool:
- Each peer wants to be a Jonny.
- After some time, they see they didn’t find any senior crawler.
- They say to N other Jonnys: "I didn’t find a senior crawler to send me tasks. Do you want to be my Senior, and I’ll be yours? "
- The other Jonny either say “yes”, or “I found a Senior at this address. You should contact them, or try again with me in X time (I hope to have graduated by then).”
- There is a possibility that several groups consolidate on their own with a risk of netsplit. To avoid that, seniors in each group can vet each other following the same process. They are Jonny’s in each other’s network. After successful vetting on both sides, their two INDEX and their two networks should eventually merge.
- Depending on available resources, trust level among the pool, and pool size, crawlers may want to only check a fraction of the record.
- We may want to check more for newer crawlers, with a low score, to evict bad ones faster, and not spend resources to check old, honest reliable nodes. However, this is a DoS vector (Sybil nodes spawning rapidly, joining and making honest node check their crap. Be evicted, rinse, repeat).
- Bad crawlers have signed some records on the INDEX. How do we make the Searchers not trust these results?
– Revoke access to the INDEX and delete their record. How? And we lose information about bad peers.
– Make crawlers not send the records that were rejected.
– Make crawlers remember bad crawlers and not send the records that were sent by rejected crawlers (unless verified by someone else)
__ They can have a parallel OrbitDB which is a list of bad peers, along with the proof(s) of their faultiness.
- How to prevent Jonny the Joiner to index useless data he generated and present that as results to the Crawlers? He will quickly earn the trust of the pool and be able to DoS it.
– Maybe Make the Senior send him some work to do first.
------ How do Crawlers know the message they just received saying a distant honest senior node trusted a distant Jonny with a result is legit? This unknown distant senior may be a Sybil.
---------- Should all Crawlers check Jonny’s results before trusting him? That increases DoS vector and doesn’t scale well if high churn.
---------- Alternatively should they check that Jonny’s Senior(s) was vetted by another crawler that was vetted by themselves (find a web of trust path going from Jonny to the skeptical crawler)? Then the web of trust should be stored on yet another OrbitDB, or be redundant enough to compute it on the fly by jumping from node to node. But long-range attacks could infiltrate good nodes that then introduces bad peers and say they trust them.
---------- Alternatively, we trust Jonny by default. BUT, we test our fellow crawlers regularly. If Jonny fails, we decrease our confidence in Jonny by 1, and his senior by 0.5 (to be tuned)
The rules of the pool will determine what is a good contribution and what is not. I guess the pool will provide an implementation to run.
Organizing a vote is tricky because of Sybil nodes voting power. Weighting by node “reputation” is tricky because the reputation is local.
Okay, I will stop now, this is getting out of hand.