Architecture Question - Storing Constantly Updated Branched Data

Hey everyone I have a long form (re)architecture question which I would love some feedback on if it’s even the right approach.

I’m from the dClimate team where we’re currently working on making climate data available and easily accessible through IPFS. We’re an offshoot that came about via Arbol (we’re incubated from the Arbol team and have some of the same cofounders) which actually has a case study on the IPFS docs page here https://docs.ipfs.io/concepts/case-study-arbol/. Our goal is to increase transparency in the climate data space so that everyone from construction companies, logistics organizations, and parametric insurances to local and national governments have as much information to make the best decisions possible.

As we have continued to scale the datasets on the Arbol side (into the dozens of terabytes) we came to the realization that a climate-specific DAO stewarded by domain experts would be best to produce and maintain high-quality data sources. With this expansion we also began to investigate what scaling our infrastructure would look like compared to our initial versions and I had some thoughts for a re-architecture where I would love to hear your feedback on whether it is the correct approach.

Government climate datasets are not only quite fragmented across dozens of FTP servers but data points are also sometimes updated after they are already posted, leading to a need to often revise older data points, which makes parametric insurance a nightmare to deal with if you don’t have a data trail. We’re leveraging IPFS to create that data trail so that anyone can publicly verify and confirm for themselves what the data previously was and what it is now, while also making it easier to query these datasets (spatiotemporal).

Problem:
The way we currently do this is by uploading the initial dataset of already available historical data onto IPFS, then, upon each new update that occurs on government servers, we add a new item to IPFS which contains the new and/or revised data with a reference to the previous “genesis” IPFS CID for that particular dataset. Upon a new update from the government sources, we repeat this process, this time referencing the last CID and so on. This creates a linked list of all new data along with all the previous data updates. To make it easier to query this IPFS “data structure” we’ve created a python client that anyone can run locally or on a server, however, the issue is that this client must traverse the entire linked list and reconstitute the entire data set in memory (via a reduce) even for a small slice of data (in time). As we intend to create a climate data infrastructure that anyone can use (even hobbyists), the increasing space and memory demands for such an approach (as new data is added) appear to create constraints that will only increase over time crowding out the very people we want to empower.

Proposed Solution:
As a result of these constraints, I looked a bit more into IPFS and what data structures/models would work well within the IPFS ecosystem. I wondered if combining the concepts of Merkle DAGs with IPLD schemas would be a more scalable approach (e.g. git model). Instead of the current linked list, datasets would be preprocessed to fit into predetermined “buckets” (hourly, daily, weekly, monthly) depending on the specific dataset and a root node of a Merkle DAG would reference these bucketed datasets (each Merkle DAG root would be an IPLD Schema). If any particular day were to change, the root node would reference all the other days except for that day that was changed, where it would point to the new one, and so on. The days themselves could also have some metadata (if so desired) so that a user could traverse through the versions of a day. So you get a data trail not only in the root nodes of the Merkle DAGs (which reference each other to genesis) but also on the data bucket side (option B on diagram). I’ve attached a diagram for comprehensibility where the top structure is the current implementation and the bottom is the new proposed approach.

I feel like this branching model affords the community the ability to easily create forked (derivative) datasets while also making it trivial to query based on time slices, as you no longer need to reconstitute the entire linked list and can query in parallel for the days you care about.

With that said, I was wondering if there are any constraints from a protocol level such as folder size (whether total size or number of files) or if there are any particular resources that can help us avoid any pitfalls if this approach is deemed worthy to explore. Our ultimate goal is to also replicate this data onto Filecoin to create a canonical permanent climate data “ledger”. If anything above does not make sense please let me know :slight_smile:

1 Like

Hi!

I’m no expert but I’ll have a go at it. Weather data would have be location-based and timestamped I assume. There is also the question of authority (your DAO idea might be good). My (limited) experience at designing decentralized system thought me that you will have to sacrifice something to get scaling.

First lets have some locations objects (name, coords, etc…) then some time objects (days, weeks, months, etc…). Location and time is kinda universal and you can know link to theses objects when new weather data is collected. Anyone can add weather data and link it (even bad data) then you need some kind of anchoring/indexing so that people can agree (curation) on the set or have multiple way to search the same data.

This design have some pitfall mainly you can’t change the location/time object since all weather data would link to those. You might be able to link new version to old one though. Also curation might take to much efforts.

The benefits are; crowd-sourcing is possible, adding new data is trivial and you can always add new index and subsets.

My 0.02$

My hot take would be, you’re looking to do a lot of things at once. I see you have a section labeled “problem” but what exactly are you looking to do? What kind of data is this? Imagry? sensor readings? What format are these things in? What kind of access patterns do you want to support? Do you want to allow bulk downloads that can subsequently be loaded into a database or do you want this to be queryable? Somewhere in-between? It sounds like you want to support a bi-temporal data structure. That will be challenging on its own even before you distribute it.

Why are you even making it distributed? Is it to provide availability? Not rely on a single corporation or service? Share costs? etc?

I’m a big fan of IPFS but I think most uses of the uses should start with, “Why am I not just putting this in S3?”. Again not putting it down but it’s just a question you should have asked yourself and have a good answer for.

Hey @SionoiS really appreciate your input and indeed crowd-sourcing is the key driving force behind this approach as we want to make it as easy as possible for the community to create derivative sets and use these same datasets as inputs (as interoperability is paramount).

Curation is definitely very important for us which at the moment we’re using The Graph for discoverability among some other mechanics :slight_smile: We will maintain our own canonical merkle roots (if we go ahead with this re-architecture) which the wider community can use for their own endeavors while giving everyone else the flexibility to maintain their own.

2 Likes

Hey Zachary thanks for the reply! Love hot takes. Apologize for the formatting (this was originally sent in slack, reformatted etc I may edit to make it clearer) but the problem statement primarily is the feasibility ( specifically any major cons or issues that can surface in IPFS w.r.t protocol limits) of the conversion of the current Linked List approach (which is limited_ to a Merkle DAG for bucketing/storing data. Data is primarily sensor readings which are timeseries currently gzipped but we also do have image data, nevertheless both are bucketed according to time (hence the Merkle DAG time bucketing approach outlined).

Re: access patterns, we currently have a client as mentioned for our linked list approach which allows for querying based on time/space (it reconstitutes all this data in memory for the entire chain of history which as mentioned isn’t scalable especially as datasets get longer and have increased resolution) but has the caveats mentioned. Ideally the goal is to be as queryable as possible without too many “middleware” layers. I know of course when it comes to file storage one is somewhat limited in this regard but leaving the base layer as flexible as possible can allow for performant middleware solutions while allowing the user to “get as close to the data” as possible. With the Merkle DAG approach, a user can bulk download multiple days in parallel and then use their preferred processing methods to build locally/in memory for the analytics they desire, we just want to ensure what we provide is flexible enough to allow for this.

Re: bi-temporal data structure, although we currently support this in our linked list approach I do understand querying becomes quite challenging for something like this regardless. Nevertheless the approach of at least chaining both the merkle dag root node versions along with its contained days (each day can reference a previous day’s version) we can somewhere close to this without sacrificing too much on parsing side.

As for why to make it distributed, it’s a great question and you nailed a few of the answers. Availability is important so that the wider community doesn’t have to rely on one corporation which could go out of business. Increasing the redundancy of this very important data (there were concerns in the previous US administration of climate data being wiped) makes it so that there is no singular failure point. Beyond that, the goal is also increasing interoperability in the wider decentralized space leveraging tools such as Chainlink and payment/organizational rails such as Ethereum. I could go at length explaining the benefits :sweat_smile: but I hope this suffices.

Ultimately to summarize the problem statement listed above , we feel that the linked list data structure approach is limiting and want to rearchitect toward a Merkle DAG solution but want to know if there are any limitations w.r.t IPFS protocol or things to watch out for during implementation. (I will add this as a Tl:Dr above)