Addressing petabytes of genetic data with IPFS

Hello,

Tl;dr: I would like to run an IPFS gateway which maps IPFS addresses to files which are hosted by a third party.

I’m writing to ask for help and advice.

I’m a research software engineer at the European Bioinformatics Institute (EMBL-EBI). One of the services offered at the EBI is the European Nucleotide Archive (ENA). This archive contains petabytes of public open-access genetic data. The data is currently served with FTP and REST with around 6 FTP mirrors world wide.

There is obviously great value in addressing this data with the IPFS protocol. For example, guaranteed replication of computational results due to content addresses input data.

I have “local” (a mounted drive, as local as possible) access to the data. However, I don’t believe that I can simply run an IPFS node on the computational infrastructure available at the EBI and start serving files. I think that this strategy will encounter a lot of resistance. Which grant/budget covers the infrastructure cost (particularly network IO and file system IO)?

I could try convincing the ENA devops to deploy IPFS, but I’ve heard that they’re very busy and I don’t believe that they will be very keen to support a third protocol when there is no apparent (for them at this time) motivation to do so.

I have thought of hosting some of the data myself as a proof of concept. However, it’s occurred to me that I can “serve” all of the data from the ENA using IPFS if I can redirect IPFS addresses to the corresponding ENA endpoints. In theory, all that I would need to store locally is the mapping from IPFS addresses to ENA endpoints.

Does this sound like a reasonable solution and does it currently exist?
If it does not exist, I am happy to work towards implementing it within IPFS or as a hack.

Thank you for your attention.

Best regards,
Robyn

6 Likes

If I understand correctly, it sounds like you might be describing the IPFS Filestore. Note that it’s currently an experimental feature.

In order to address the ENA content using IPFS multihashes, I don’t think you can get around ingesting the data into IPFS initially to come up with the “mapping” – which I’d expect to be doing the network IO and filesystem IO that you have grant concerns over.

This sounds like an exciting use case for IPFS!

4 Likes

Thank you for your reply @leerspace.

I agree with you that IPFS must ingest the data before IPFS multihashes, and the mapping I mentioned above, can be produced. I’m not very concerned about the network IO and file system IO cost of generating this mapping once. But I think that the IO cost of running an IPFS node from the ENA’s computer infrastructure (loading files from the file systems and serving data over the network) will be too high and will require political work which I’m hoping to side step.

I’ll take a look at Filestore, I don’t know enough about it right now. I think I was ideally hoping that a request to my IPFS endpoint could be silently redirected to the correct ENA endpoint.

Do you think that the following might work:

  1. download subset of ENA data
  2. produce IPFS multihashes
  3. save mapping (IPFS multihash -> ENA REST endpoint)
  4. delete ENA data subset
  5. spoof IPFS HTTP endpoints with correct IPFS multihash addresses using a web framework (i.e. using Python and IPFS is left unused for this step)
  6. spoof web framework redirects GET requests to correct ENA end points.

Looking forward to hearing your thoughts!

What I had in mind with FIlestore is that if you can mount the dataset locally, then an IPFS node running on the machine could just retrieve the files through the mount point. It sounds like you’re trying not to use a mounted filesystem though.

If I understand it this isn’t really using IPFS anymore for transferring data (?). If it’s just using a web application to convert IPFS-style multihashes to ENA REST endpoints, would there actually be an IPFS node anywhere? If not, then I think that might defeat the purpose of using IPFS hashes unless someone with their regular IPFS node can request the multihash (and for that to work, there needs to be an IPFS node with access to the ENA data somewhere that can handle requests from other IPFS nodes for the multihashes).

This is my understanding of what you’re describing (where there’s no IPFS node involved in the process):

User with multihash <-http(s)-> web application that maps multihashes to ENA URIs <-http(s)-> ENA REST endpoint

I don’t know what you mean by “IPFS HTTP endpoint”, so it’s possible I’m missing something.

In either case, there would also need to be a way for users requesting data to know which multihash to request in the first place for the data they want. For this I think there would need to be some kind of user-facing website that allows them to look up IPFS multihashes that correspond to the appropriate ENA identifiers, Taxon records, or Project records.

Maybe others on here have some better ideas about how this could work with the stated constraints.

Sounds like what he would like is to have the IPFS equivalent of a torrent file with a webseed. So he publishes the hashes to the DHT but when people are retrieving the hashes they can fallback to http (or FTP??) to actually transfer the data.

I don’t see that fitting in the protocol very well.

1 Like

Hello @leerspace and @wscott,

Thank you both for your replies.

As promised, I’ve investigated IPFS Filestore and it’s worked out brilliantly! I can now mount an FTP directory and add directories/files with --nocopy to IPFS. The IPFS init directory takes up a trivial amount of memory on my laptop. However, the files that I’m seeding with IPFS are of course provided through my laptop and not directly from the ENA.

@wscott, you’ve hit the nail on the head! That’s exactly what I was looking for. However, I believe that my current setup should be enough to convince my colleagues that IPFS will prove to be a very valuable solution.

Thank you for the help!
Robyn

1 Like

Hi.

I might have a similar usecase to this. Let me explain the scenario a bit.

I work with various organizations who have very large sets of data. Normal set might be something like 50TB or so but some could be more or less. The data are large binary blobs by and large but some of them could be text(ish).

In order to avoid moving all this data around indices are built so that one person can pull out a subset of the data but this data could still end up being some hundred(s) of megabytes.

I would like to put this data on a distributed platform so that each entity who needs the data can pull it from a peer or peers nearby instead the central store. So basically a bittorrent type of system but I would like to be able to set up different “containers” or fenced in areas where access to data is more restricted.

Could IPFs do this?

Looks like that filestore is a pretty good fit for this right?

If you’re trying to restrict access to data within IPFS, I don’t think that there are currently good ways to do that. There are ways to set up private IPFS networks which only allow your node to connect to others with a shared key, but other than restricting who you’re connecting to there’s nothing to prevent someone you’re connected to from requesting something your node is providing.

Other than wanting to try to restrict access to data, this sounds like a good fit to me!

@rffrancon you may be interested in work that @flyingzumwalt is doing on motivating factors for using IPFS with large datasets: What motivates people to use IPFS for large volumes of data?

3 Likes

This subject can be quite interesting for all the biochem industries that do heavy research.

There could be even some specialized implementations and functionalities to help their needs.

I do happen to have a background in molecular biology and I know people related to biomedical research who are desperately in need for improvements in computing. The subject of an operating system specialized in case uses for biochem research has been talked about already.

Part of that OS could be a decentralized filesystem that could

1: store data safely

2: keep the data free from corruption

3: ability to access data on an ad-hoc basis

IPFS might be an interesting idea for those industries… it could relieve some of the resources needed by the machines, hence allowing them to get “more bang for their buck”, while at the same time being able to store their data much more safely…