Decentralized Search Engines for the Decentralized web

We need decentralized tools for indexing the content on IPFS and searching through that content in a decentralized manner. One option (of many) is to modify YACY to crawl IPFS content as well as classical web sites.

Related: the ipfs-search project is looking for a maintainer.

2 Likes

This may be a noob question, but how is a search engine able to label/name an object, if it’s a single file that is just represented by an IPFS hash. It could be anything, from a docx to a zip to a PDF to a DMG, with no discernible filename.

It may be impossible to find the filename for some content just from the file data, but a lot of files on ipfs come contained in directories (e.g. if added recursively, or with ipfs add -w), containing a named link to the file.
For example:

QmDIR... - Directory, containing:
  cat.jpg: QmAAA...
  secrets.txt: QmBBB...
  movie.mp4: QmCCC...

After indexing the QmDIR… directory (for example, finding it in the DHT), the search engine will see the filenames that the hashes QmAAA…, QmBBB… and QmCCC… were given, saving them in some index.
Now, when the file QmBBB… comes up in a search, the system can look at the filename index, and see that the file was named secrets.txt.
I’m not sure if this is how ipfs-search works, but it’s one of the possible ways to implement a way to bind filenames to indexed hashes. Another is to search for the hash of the file you want the filename for. The directory, having a reference to that file, will come up in the search.

1 Like

For these “anonymous IPFS objects” (added without the -w option) you might at least be able to get some information on the file type, e.g. with

ipfs cat Qm... | file -
/dev/stdin: Zip archive data, at least v2.0 to extract

But the search engine would have to ipfs cat (i.e. actually download) the object first. So maybe file type detection should eventually be built directly into ipfs. (Possible?)

However, for some file formats that information seems to get lost in IPFS; if I cat and file-examine a local DMG, it tells me bzip2 compressed data, block size = 100k, but when I ipfs cat and file-examine the DMG directly from the IPFS, it only says data.

2 Likes

There is a need to index content as well as filename. Think like Google Drive or Dropbox service so one can search his own PDF or doc by name or content. The next thing is what if a node trying to index encrypted storage (permitted by the owner). Both indexing and encryption should be baked into the IPFS.

I also want to add metadata to a hash, for example to identify it or tag it for easier searching.

If it’s the song Sunshower by Rza, I want to add it to a decentralized wikipedia

i dont understand how everything is encrypted if any node can visit any ipfs site and watch the code and download it?everything is delievered encrypted between peers.

During that time (and probably until now) there is no encryption by default for files or contents.
Anyone can always encrypt the file before added into the IPFS manually.
Encrypted content doesn’t mean to show it publicly but I’m referring to the owner (or anyone permitted) to search trough his/her content.

I could see in ipfs-search that they are sniffing the DHT gossip and indexes file and directory hashes. It takes IPFS hashes and extracts content and metadata through Apache’s Tika. Then the indexing and searching is done using Elasticsearch 5.

I also found another tool by Consensys IPFS-Store that is an API on top of IPFS with search capabilities and much more.

I could also find https://ipfs.io/ipfs/QmYo5ZWqNW4ib1Ck4zdm6EKteX3zZWw1j4CVfKtnAzNdvu/ another search tool built using searx.

For me, the problem with searching the DHT or the vast ocean of hashes people will be generating in the future, is the amount of useless hashes that need to resolved to discover content. We should focus more on searching IPNS.

I do it manually sometimes for peerIDs…when I should be sleeping…ipfs swarm peers… ipfs resolve …ls …get <hash that peerID has published…not very efficient, though I’m sure someone could write javascript app make it easier. And this is just peerID…not to mention the files added that are not published.

As for searching ipfs objects directly…this will also be inefficient. I add my website every time I change something, creating a new hash. Does anyone really want to search through 50 copies nearly the same content? or each time you add the same folder with new content.

My suggestion:
Add a distributed database to an IPFS client like Siderus Orion or IPFS Manager. Alternatively, files could be added/indexed/verified with a service like Stamp.io…But I prefer native interactions.
When adding files, the users could add tags i.e. title/description/username, and the client adds the tags file size/type/date added (very important).
The local DB is updated with a new hash which could be broadcast to/syncronized with connected peers. In this way, when a user searches the DB they will be directed to the latest version of the content, with the choice to also view older version (if the file still exists somewhere.
Websites or companies with large data storing capabilities and a lot of content such as Google, Instagram, Wiki, Linkedin, would be huge contributors to the trusted hash list. Having synced with large “trusted” entities, the users would get the latest search results.
Other options such as rating content as “safe, explicit, offensive, harmful” could help remove negative content from the web.

This relates to publicly shared info.
As for private info, I think that should be handled before adding to IPFS. For simple things just create a .rar with a password attached, then add to ipfs. Next share the link+password with the recipient. It’s not military grade, but it doesn’t require any special software.

I’m hoping this function will also be implemented into clients.