I have been thinking a lot lately about the value I think IPFS could provide to Extract, Transform, Load (ETL) operations. My thinking is that if paired with a metadata database, IPFS could handle both the “extract” and “load” steps for simple cases where “extract” means “pull some files to the local processing server” and “load” means “push the resulting data product into a data lake”.
To me this seems huge because it would allow data analysis developers to focus entirely on the “transform” stage. I have been seriously considering attempting a test implementation, but am hesitant to commit a lot of time to such an experimental idea.
Here is a rough outline of my plan:
- put ipfs on all my machines
- implement database to map product metadata into sha5 hash for “extracts”
- replace E & L operations in my airflow pipeline with ipfs fuse mount usage & ipfs add, respectively
- set up an ipfs cluster to keep my data pinned across nodes
Has anyone tried something like this?
Are there challenges I am overlooking?