Back in November 2017 I started a thread on the topic of addressing large scale (genomic) data with IPFS: Addressing petabytes of genetic data with IPFS
The goal was to provide access to this data using IPFS content addresses. This was partially achieved using Filestore and a mounted FTP directory. However, this solution had an obvious network latency bottleneck.
In this thread I would like to discuss a more general solution. Many institutions provide open access to very large scale static datasets. In the vast majority of cases, this data is only accessible via location addresses. Convincing each institute to manage IPFS instances with Filestore extensions would be a costly endeavor. Similarly, incentivizing and organising peers to mirror for a significant duration is a difficult challenge.
As an alternative, peers could work together to curate a bidirectional mapping of IPFS content addresses to location addresses. This mapping could be consulted to find alternative means for retrieving data. In the event that data becomes absent from the IPFS network, a location address could be queried.
The mapping would require updating. This could be achieved by a group of peers working together. For instance, using a location address and hashing the retrieved data could validate a mapping entry.
I have begun working a project for managing and building a bidirectional map:
I believe that the ability to access and address data using IPFS content addresses would be of great benefit to the scientific community. Irrespective of the backend mechanisms which actually retrieve the data.
I hope that this community can cast a critical eye over my proposal. I’m quite happy to abandon it completely or change it significantly if I’m not on the right track. I would welcome all feedback.
Thank you for your time.
(The IPFS URL store might be related: https://github.com/ipfs/go-ipfs/pull/4896)