[deprecate] IPFS ❤️ human readable URLs - Use a DHT to upgrade URLs to IPFS-CIDs

Hey guys, Hope you’re doing all very well!

We have the great features DNSLink and IPNS, which allows us to upgrade domains in the browsers (and URLs to a degree) to IPFS-CIDs which the browser can access via the IPFS network.

But there are certain limitations that currently hinder smooth transitions between a regular web server to a site running fully on IPFS. The main limitations come from queries behind the path part in the URL as well as the missing support for any non-http(s) scheme.

The idea to fix this is simple:

  • Hash the URL
  • Write the CID-informations in a DHT with the URL-Hash
  • Sign the information with the IPNS key for the Domain
  • Store the IPNS key for the Domain in the DNS

To avoid that false information overflow the DHT, all information added should be individually verified by the nodes storing it, so they should ask the DNS system for the IPNS key and check the signatures before storing it and offering it to the network. Also the timestamps should be checked if they are within the clock of the host. This way no entries for the future or the past can be published.

Rationale

We are currently expecting that the Web has only http(s) URLs without a query part and can’t support the rest.

Using hashes to inform a client about the availability of the information in the IPFS-network reduces the needs for workarounds and helps to reduce the complexity of a transition between web servers and IPFS-libraries.

Since this approach isn’t limited to the http/https scheme we can extend this in the future to other URIs.

Additionally, we can use the URIs to resolve to p2p service inside the IPFS-network. This allows us to extend the clients in the future route something like IRC, SIP, or SNMP traffic to a p2p service inside of IPFS instead of natively over the internet. This approach allows for interesting failover, mobility, and encryption possibilities while also extending the usability of IPFS beyond storing data.

Technical specification

I haven’t given the technical details that much thought yet, so sorry for all the rough edges here. I just want to outline how it might work, not how it should work!

Redirects

Redirects are often used in web servers to move clients from old URLs to new ones or to move certain links to other locations.

IPFS could support feature this natively, to avoid that a web server has to be contacted to do the redirect before the client can upgrade to an IPFS-path.

An example of how the data stored in the DHT could look like:

<ipns-pubkey>
---
type: "redirect"
from:
  scheme: "http"
  authority:
    host: "example.com"
  path: "/old-link/"
  query: ""
  fragment: ""
to:
  scheme: "ipns"
  authority: "example.com"
  path: "/home/"
  query: ""
  fragment: ""
valid-since: "2020-12-12T00:00:00Z"
valid-until: "2021-01-15T00:00:00Z"
...
<signature>

Wildcard URLs

If the DHT doesn’t contain a valid result for the full URL, the client might drop certain parts of the URL to find a matching entry - for example, the fragment-part might not be necessary to fetch this data from IPFS. As a last resort, the client can ask the DHT for entries just for the scheme and authority part of the URL, to find matching entries.

This opens not only the opportunity to specify the same information for multiple URLs but also to specify a 404-page if the URL isn’t valid.

An example entry for a redirect with URL wildcard:

<ipns-pubkey>
---
type: "redirect"
settings:
  from:
    wildcard-path: true
    wildcard-query: true
    wildcard-fragment: true
from:
  scheme: "http"
  authority:
    host: "example.com"
  path: "*"
  query: "*"
  fragment: "*"
to:
  scheme: "ipns"
  authority: "example.com"
  path: "/404.html"
  query: ""
  fragment: ""
valid-since: "2020-12-12T00:00:00Z"
valid-until: "2021-01-15T00:00:00Z"
...
<signature>

This entry would be published in the DHT with the hash of the scheme and authority, to avoid having to publish this under every possible URL-hash :wink:

Here’s an example entry that identifies the source while ignoring the fragment part:

<ipns-pubkey>
---
type: "cid"
settings:
  from:
    wildcard-fragment: true
from:
  scheme: "http"
  authority:
    host: "example.com"
  path: "/welcome-page/"
  query: "moreinfo=false"
  fragment: "*"
content:
  id: "QmPZ9gcCEpqKTo6aq61g2nXGUhM4iCL3ewB6LDXZCtioEB"
  address-hint: [
    "/ip4/6.7.8.9/tcp/46147/p2p/QmZHrtsCdrkfTkq56Q96vCbN16rEkzWogN7P58w9ytgWAj",
    "/ip4/6.7.8.9/udp/47187/quic/p2p/QmZHrtsCdrkfTkq56Q96vCbN16rEkzWogN7P58w9ytgWAj",
  ]
valid-since: "2020-12-12T00:00:00Z"
valid-until: "2021-01-15T00:00:00Z"
...
<signature>

CID entries for URLs

As already seen above the content of an URL can be linked to a content-id while adding optionally address-hints to accelerate further network operations - if those nodes are online.

The simplest entry for a file stored on an FTP-server would look like this:

<ipns-pubkey>
---
type: "cid"
from:
  scheme: "ftp"
  authority:
    host: "ftp.example.com"
  path: "/demo-file.txt"
  query: ""
  fragment: ""
content:
  id: "QmPZ9gcCEpqKTo6aq61g2nXGUhM4iCL3ewB6LDXZCtioEB"
valid-since: "2020-12-12T00:00:00Z"
valid-until: "2021-01-15T00:00:00Z"
...
<signature>

Note that specifying all parts of the URL is mandatory.

IPNS entries for URLs

Apart from permanently static files, the user might want to specify a dedicated IPNS-key to publish new versions of a file under the same URL without having to update the URL-DHT entries every time.

This type of entry allows just that:

<ipns-pubkey>
---
type: "ipns"
from:
  scheme: "ftp"
  authority:
    host: "ftp.example.com"
  path: "/demo-file.txt"
  query: ""
  fragment: ""
ipns:
  pubkey: "QmSrPmbaUKA3ZodhzPWZnpFgcPMFWF4QsxXbkWfEptTBJd"
valid-since: "2020-12-12T00:00:00Z"
valid-until: "2021-01-15T00:00:00Z"
...
<signature>

Storing the information in the DHT

I think it might be best to create CIDs from these data, with something like a folder/file structure to make updates space-efficient even with many many elements stored under one hash and the clients having to update the data to fetch the next URL from the network.

This way the DHT could either be asked for the current CID or for the CID and the data in one request - if the node has zero information yet. This reduces the number of roundtrips necessary to fetch the first byte of content, while updates with many items would remain very efficient since the CID could be fetched via the regular network - having the DHT nodes holding the data temporarily like it’s pinned.

If I understand your ideas correctly, you want to create a separate DHT for the web where:

  • opaque URIs (foo://bar/buz?query=val) are mapped to content-addressed paths (CIDs or path under IPNS key)
  • (and optionally) content-addressed paths are mapped to some URIs (as fallback if no providers)

And with for this web specific DHT:

  • have protocols for publishing/querying this DHT
  • sign records with libp2p-key
  • verify that libp2p key is present in DNS TXT record at the time of storing record (DHT node) and lookup (DHT client)

@RubenKelevra If that is the correct description?

I feel I’m missing something, what is the value added on top of existing DNSLink (required in both cases), apart from publishing libp2p pubkey and signing DNSLink with it (which we can add without the need for inventing new DHT)?

@lidel that’s correct. The idea is to add more flexibility and smooth out the bumps for a better transition between web 2.0 and web 3.0.

  • Say you got a page that uses links with query-part you have a hard time switching to IPFS without running it under a dedicated subdomain since links like “https://www.domain.tld/videos?id=332” would just break if you add a DNSLink to your domain.

  • With an URL database, you could either create a redirect to an URL without the query part, like “https://www.domain.tld/videos/id/332/” or you could use the same URL and just add the content-id to it.

  • Another possibility is to be able to add content-ids to URLs with schemes currently not supported, like an RTSP stream of a static file.

  • It also avoids that you have to maintain a folder with all the data for a domain since you can just put links to certain CIDs in the DHT instead.

  • With long paths like “https://www.domain.tld/videos/building-6/camera-4/2020/04/10/time/22/10/” you also get a nice speed benefit because there’s no need to do request, parse, request, parse… for each level in the path of the URL.

  • It enables users to put meta data behind URLs for identity purposes, like GPG keys for email addresses: mailto://domain.tld/user@domain.tld/gpg → CID. Which looks granted a bit clumsy. But noone holds the application developers back to use gpg://user@domain.tld instead. Which would work fine, since we just use the host part of the authority for checking the authenticity.

  • Addressing books for example with their ISBN urn would also be possible, via a domain as an authority of course: urn://archive.org/isbn:379200027X

  • In the future we might want to extend this, to be able to connect URLs to dynamic content through IPFS via the currently experimental Libp2p stream mounting option.

I thought a bit over the current solution and I think the DHT nodes should NOT pin the data behind the CID, even after it’s checked.

There’s a chance of misuse, where someone might use the redirect URL function to store data distributed over all nodes in the DHT, as there’s basically no limit in how many items can be stored and you can also enumerate blocks of a file, like http://domain.tld/1, http://domain.tld/2 etc.

This part of misuse might be present for all kinds of DHT, but in this case the amount of storage which might be used is significantly higher.

Therefore I think we should just fetch the CID, parse it, validate it and don’t pin it. This way the garbage collector could just clean the space up, when needed.

You can hold your CID as long as you like, but the DHT nodes might not.

Since a cleanup by the DHT nodes would result in you having to provide the data again, you can basically not use this function for storing any data reliably. :slight_smile:

Once again, this would work only if you are publishing URL2CID records for a domain which has a dnslink=/ipns/{key} and sign those URL2CID records with the mentioned key.

I struggle to see the value added by this complexity:

  • Remain skeptical if you would get any meaningful performance boost from this, as this additional DHT lookup will be most likely more expensive than simply traversing the DAG to resolve path to a file while you are already connected to a peer that has the root CID.
  • Redirecting query params on HTTP gateway can be handled by simpler means, like a flat manifest file: #6214
  • Speeding up provider discovery can be implemented by publishing dnsaddr TXT record and making go-ipfs preconnect to multiaddrs as one of discovery methods for DNSLink names

You could experiment with this URL2CID DHT idea in a separate project that acts as an HTTP proxy in front of go-ipfs and see if you can produce some benchmarks to prove me wrong.

I suspect we can do most of the things you mentioned via less complex means and existing DNSLink+dnsaddr :thinking:

I don’t think @RubenKelevra made any claims about performance improvements for his proposal.

I don’t think there was anything in the proposal about speeding up provider discovery either.

Hey guys,

just wanted to let you know that I want to deprecate this proposal. I’m currently working on a secondary version that focuses more on URIs than URLs.

@lidel I see the points in decreased performance, but on the other hand, you’ve not shown any way to convert an URL like this https://www.domain.tld/videos?id=332 into a CID.

Is there one?

@zacharywhitley Well, you’re right that @lidel is focusing pretty much on the performance aspect, but he’s right about that. Performance is lower for simple URLs. Probably enough to cause issues for regular users.

But on the other hand I like the flexibility of this approach, so I try to focus on URIs in the next proposal which might be more interesting than URLs. While they are still included.

If I understood it correctly, person doing this is trying to replace dynamically generated responses with static content that was put on IPFS by crawling preexisting old dynamic site and putting static output for each query on IPFS. And each update requires re-crawling.

In my mind this means they did not made their website independent from the backed: they are faking decentralization, because the source of truth is still the old app that generates output based on queries.

Personally, I’d rather not see people wasting time on partial solutions like this, and instead move to model where source of truth is not dependent on some backend service.

Is there one?

Right now, if someone wants to put their website on IPFS, and they want to keep the old URLs working, they need to make sure the static HTML+JS at /videos is capable of acting on ?id= query or fragment parameter in the URL. This is trivial to do in JS by inspecting window.location object and does not require wasteful creation of oid-based variants of /videos file.

In addition to JS route, I hope to have alternative in form of a manifest file where you can define redirects from legacy URLs to new paths, but we don’t have that yet.