IPFS for community-led research

Hi everyone, I have an application question and am looking for general feedback or to initiate a discussion.

I cofounded our-sci.net, and cofounded photosynq.org, two projects focused on empowering communities to build research capacity. That means large groups (100s or 1000s of people) should be able to design and implement experiments, collect comparable, validate data, and analyze the results to address research questions of interest to the community. The Gathering for Open Science Hardware (openhardware.science) community has overlapping needs as well.

Data is collected both as sensor data (USB or bluetooth connected devices to your phone/computer) plus some survey data inputted by the user. In PhotosynQ we built a survey app and backend from scratch (4 years ago), but using today Open Data Kit (opendatakit.org) is a much smarter way to go for data collection. The data is stored as XML or JSON, and contains meta-data like user, location, time/date, etc. etc.).

So far, our approach has been the standard SAAS approach - we create server (RoR + PostGRESQL), let users collect/submit/analyze/discuss on it. Server + development costs money, so we must charge users for server access (reduces access), or we let users fork and set up their own server (beyond the capacity of most communities, requires lots of documentation effort + support to make reasonable).

I hate this “pay” or “host” model. It seems IPFS could really change the game by eliminating server costs and dramatically expanding access…

Am i right? I can see some issues right off the bat:

  1. It’s unclear reading through the documentation if IPFS automatically backs up data in other locations in the network. So if I collect data (stored as data1.json on my phone), does it automatically back up parts or all of that on other devices on the IPFS network? Of course, when i go home to look at the data, I would download it to my local machine, but is that file replicated on other machines by default as soon as it’s created?

  2. Let’s imagine a community wants to use IPFS for community-led research projects. And let’s suppose we built a js/html program, located on IPFS, which they can use to collect data, collaborate, analyze data, post results, etc. We’d probably also need an android app for data collection as well. At that point is it as easy as saying “OK - everyone in the community who wants to take part, install these programs and off you go! Data is automagically distributed to those in the network and scales up to mega / giga / terabytes?”. Besides adding features / fixing code bugs, are there other issues/costs that I’m missing on the developers side, or other downsides for the users?

… I’m sure there are more.

Just to elucidate things further, here’s some actual use cases.

Someone in Ukraine wants to research the best ways to repopulate native plants in the forest, currently grown only in universities. It’s surprisingly hard to transfer the plants without killing them, so the research involves identifying the best transplant methods for each plant, and identifying the hardiest varieties which can survive the transplant process. That means lots of individual tests where people collect plant health data (most survey questions, some handheld sensor data maybe) and outcomes (transplant worked / not) on many species using many transplant methods. Data is collected on the phone, and analyzed on the computer.

Someone in the US wants to develop new open source seed varieties by growing new seed varieties (crosses) in gardens across the US. Gardeners would collect data (survey, maybe some sensor) about the plant as it grew, collect the finished seed, and send it on to someone else to plant and measure. This data would be analyzed to identify which breeds performed best in which soil/climate/management conditions, to create a matrix of optimized seed varieties. Data is collected on the phone, and analyzed on the computer.

I’m intentionally simplifying things so don’t point out every flaw in these projects plz :slight_smile: but hopefully this gives you an idea.

I’d love any thoughts or feedback or interest –

Greg

1 Like

So, first off, no, IPFS does not automatically replicate your data to the entire network (or even any part of it). The way it works is, it hashes any content you “add” to it to create a unique “forever” address (that’s the “permanent” part of “permanent web”), which you can then make known to other people in some way (i.e. share a link, link on a website, etc).

The part where you’re getting confused, I think, is this next crucial bit. If someone accesses the file(s) you shared, that data will temporarily get replicated. This data will eventually expire and be garbage-collected (automatically deleted) from their system. However, the person accessing the file(s) can optionally “pin” them, which just means they are telling IPFS to never garbage-collect them.

To your second question, no, it’s not as easy as all of that (because you expect data to be automatically replicated, which doesn’t happen).

What you could do, is provide a website (or just an API that could be used to build websites and Android apps on top of) that researchers could share links to their research data they have stored locally. The more interesting any particular data is, the more “available” it will become (the more people access it, the more it is replicated around the network, the more people able to send the data to others asking for it, and so on). But on the downside, if the OP (let’s call it) goes off the network for any reason and nobody else has the data available, it will not be available at all. Or, if some data was popular for a while and was safely replicated around many nodes, but nobody pinned it, eventually it will vanish from the network if interest dies off.

What you may want to do is look at other solutions if you need the data replicated in the way you were thinking, or look into a hybrid system, such as Ethereum+IPFS.

Anyway, I hope that helps. This is definitely a problem I’ve been interested in trying to solve myself. Are your projects open source?

Thanks, that clears a lot of things up. In watching the videos, there’s a pile of functionality and assumptions in the libraries which it’s hard to see without digging in pretty deep in the code… so thanks for sharing your knowledge.

PhotosynQ was open source - I left in part because winds were shifting. What is still open is available here: https://github.com/Photosynq/ . I’m working on taking lessons from that and building something functionally similar, more generalized, and fully open.

OK - did more homework based on your response (PS - this is an amazing resource/explanation for the existing landscape - https://github.com/UbermenschProject/ubermensch-docs/wiki/Where-do-decentralized-applications-store-their-data%3F. Swarm seem more appropriate as it guarantees the availability of a file, acting essentially as cloud storage. However, it is less well developed than IPFS.

Filecoin (maidsafe, swarm, storj, sia, etc.) all incentivize storage using some crypo-currentcy and more or less guarantee that files won’t just disappear when the original creator does.

So basically - if you want guaranteed persistent storage, you have to solve the cheating problem. Solving the cheating problem means making a cryptocurrency and all that, which is complicated.

If you don’t mind things disappearing (no guaranteed storage), then IPFS is pretty easy to implement.

Both suffer from poor searchability (as they are not relational database, all you get is the filename).

This project - ubermensch.store - addresses searchability and could be a good fit as well. A distributed database, so you can query the components of the information rather than just file names, and it acts as persistent storage. It’s concept stage, so not helpful for me right now.

Now I get it… so your suggestion is a reasonable one.

I think there is a reasonable hack / solution in here. IPFS seems very easy to implement and oh so close to what we need. I think it’s ok to give some reasonable level of responsibility to those in the community for ensuring they are contributing to ensure capacity… Like maybe this:

  1. large files (audio, video, images) are stored as links. The links are located on dropbox, google photos, or some other similar public site.

  2. Data is then just text files (JSON, XML, etc.) so it’s not so massive.

  3. Users, when they first sign in, are required to ‘pin’ the bottom 2 least pinned people, and then they can choose to ‘pin’ at least 3 other users (this is useful actually because it also forces them to see who else is in the network). This also adds value to the user by providing them with real-time updates about those other 3 users they couldn’t otherwise get (when data is added, results are posted, etc.)… this is like following someone on facebook, but in a nerdy sciency way :slight_smile: it adds value to the network by creating backups of everyone’s projects. If they exceed X storage for their own projects (maybe 50mb), they must then pin additional people to make up for it.

… but let’s say I wanted to disparate projects which all use the same method (interesting for large-scale analysis of data)… where and how can I add meta-data to files? This is the searchability issues… I’m just wondering if there are work-arounds… using the file name with % or something like URLs…

This is a super non-elegant solution from a math perspective, but actually I think from a community perspective creates values for users while leaving decision-making largely in the hands of the community.

1 Like

I think there is a great opportunity for libraries to play a role here – libraries already exist as a place where communities hold the data that they care about. In the same way that I can ask my library to add a new book to their collection, we should all be able to ask our libraries to pin content on the library’s IPFS nodes. It’s basically the same model with books and IPFS content – patrons nominate content (a book, a dataset), the library considers the request and decides whether to proceed. If the library decides that the content is appropriate for its collections and if it fits within their budget, the library accessions a copy of the content into their collection. Once the content is in the library’s collection, the library is able to support access (make sure it’s available), discovery (make ways to find the content & provide ways to learn more about the content) and preservation (make sure it doesn’t get destroyed) of the content.

On a technical level, IPFS lets you reduce the problem of preserving data to this:

  • groups of people decide on a list of hashes corresponding to the content they want to pin – a pinset.
  • those people allocate storage to hold that pinset and pin it on ipfs nodes.

For storing the content, you have an abundance of options, such as:

  1. run your own ipfs nodes on your own hardware,
  2. use filecoin
  3. put it on a cloud service or colocated servers
  4. form a reciprocal arrangement to trade backups with other groups (like LOCKSS, DPN, etc)
  5. mix and match
  6. etc, etc

You can also optionally use ipfs-cluster to coordinate a network of participating peers who share the burden of storing an evolving set of pinned content.

The key benefits of IPFS are that you can move your data around, rebalancing and changing storage strategies as you see fit, without changing the links that point to the content – whether a researcher is serving the data directly from her laptop or from a big beefy server in some data center, the link stays the same. This means the location of the data is only a detail that impacts things like availability and protection from data loss. It doesn’t impact the way people link to the data, cite it, etc.

That benefit also extends to the fact that I can pull data onto my own machine if I want to – I don’t have to rely on a faraway server to give me access to the content and I don’t have to rely on someone else to keep the data around if I don’t trust them to keep it safe. Again, in that context the links don’t change. If the data stays the same, the link will stay the same regardless of whether it’s on a server run by the EPA or on an external hard drive in someone’s home.

2 Likes

One more thing: @gbathree you’re right about decentralization radically reduces the cost to run a collaborative system like yours because it’s possible to go completely serverless – no need to run any more web servers. No more database servers, no more rails applications to deploy. You just design the data model, figure out how you’re gong to propagate updates across the network, and build client-side applications that allow people to interact with the system. Even the client-side application itself can be stored on IPFS.

You do still have to get someone, somewhere to store the data on an IPFS node for you (or a bunch of nodes) and you do have to decide how to get updates to propagate across the network (that’s getting easier every week), but other than that there’s no infrastructure to run.

That’s an amazing idea! I have some library buddies involved in the maker movement, I’ll share this with them and see what they think and point them to the GLAM community (PS I love discourse it’s awesome). Anything that can make open science more uniquely competitive with closed science is critical to making open science ubiquitous - here’s my two cents on that topic - http://blog.our-sci.net/2017/04/10/people-led-research-strange-sleeping-giant/ .

Adding libaries also adds a really interesting and unique partner when assembling a research community. Sometimes having diverse partners helps get the ball rolling.

One possibly dumb technical questions I still haven’t figured out:

If there are 5 copies of a file in IPFS (pinned, let’s say), and I access the file (download it on my computer)… is it pulled from only 1 location, or all 5 (bittorrent style) or something else? And the originating node is the only one with the rights to update the file (?) . Can the originating node transfer or share rights (?) . I am creeping towards dropbox here, I’m sure it’s wrong just want to know where the edge is.

Finally… When you say propagate updates… do you mean just propogate file changes (or files which run an app in the case of a client-side app which is located on IPFS)? I guess I assumed that happened automagically (I’m used to GCM now Firebase keeping things all up to date). I know… nothing is automagic in reality :slight_smile: .

I think ipfs-cluster makes the most sense. In the case of most small communities, I would say everyone by default would be part of the same cluster, and just share the burden completely. As specific parts of the community grew too large, that could be adjusted.

Sorry for the questions but I want to decide if I (we) want to jump in… it seems compelling but devil’s in the details.

A file is downloaded from every peer that has that file. This only really matters once the files get very large.

No node is able to update any file, /ipfs/ hashes are permanent. A hash always points to the same content.
You might be confusing this with /ipns/ names, which are dynamic links to other /ipfs/ hashes.
An /ipns/ hash is owned by a single peer, and can be updated by them at any time.

A node may give /ipns/ private keys to another person, giving total control over matching /ipns/ hashes, but this is not really an expected use case. Note, that the original node might keep the private key, which would cause conflicts (automatically resolved by IPFS, but still messy) whenever both nodes update the IPNS name at the same time.

Updating files/directories creates entirely new hashes, which have to be somehow synced up between IPFS nodes.
ipfs-cluster is an example of a tool that allows to sync hashes to be pinned between several IPFS nodes.