You should use an existing web scraper such as httrack instead of rolling your own.
How are you going to share this key?
No, there are a couple of big issues with this.
1. Most national firewalls operate on a DNS level. You already have cachebrowser for those who try to be a bit more sophisticated.
2. IPFS isn't anonymous. If they live under an actual repressive regime, all it takes is a simple
ipfs dht findprovs QmOFFENDING_CONTENT to put someone in jail or worse.
3. It only works for static pages.
4. You need nodes you trust to do it.
No, it's not. Tor can evade GFW/Golden Shield with ease. What countries are you talking about where a VPN ban is enforced?
It's trivial. You can use traffic patterns, or just block the DHT seed nodes. You could also create some custom implementation of the IPFS DHT that connects to as many nodes as possible and silently adds all the IPs it can find to the IP block list.
That just makes it fault-tolerant.
As I said earlier, they can just block all the IPFS nodes. The ones that really manage to piss them off might get a
<img href="18.104.22.168/veryveryverylongstring.png"> tag inserted somewhere, where 22.214.171.124 is their IP. This has been done in the past, and overwhelming a residential connection isn't very hard.
Why? With Wikipedia, you have the entire database. Build a simple reverse word index instead, this is good enough for most uses. Google doesn't like people scraping their searches.
You can just use the IPFS directories for that. The key is the file name, the value is what's referenced by the link in the DAG.
Because it's a downright horrible idea. You can build a search engine on top, and there are decentralized implementations available in various forms. A simple reverse word index is good enough for most purposes and easy to decentralize.
No, the system will be gone. What you describe is susceptible to DoS attacks.
There was something called ipfs-search, but they shut down. Code available on GitHub.