Data lineage visualisation

r/

How do you produce interactive data lineage flowcharts automatically from your cleaning/transformation/derivation code? I’m a PhD student trying to properly document the procedure for cleaning + analysing population level linked data across a dozen datasets, around a million unique id’s and hundreds of variables. I am unable to host the data anywhere except university NextCloud servers – so the custom made programs I’ve googled (dbt, databricks, fabric etc) aren’t applicable. I know Stata well, but I’m happy to learn python, R, SQL etc if there’s a solution to my ideal outcome.

Similar to a DAG but more complex, I want to be able to interact with all the data variables. It can be in a table format or a flow chart or something like the connected papers/obsidian data view weblike structure. How it looks isn’t important as long as the information is there. Clicking on a variable name will open/lead to/expand more information such as the source data file name, parent/children variables (also clickable). Variables would have flow arrows or click through links to find related variables and data files. If the variable is derived, the code to derive it would also be wonderfully useful.

I feel like I’m going around in circles. Yes I’ve asked AI – and tried multiple methods.

Things I’ve tried:
First Claude suggested a purpose build Node.js + npm interactive webpage hosted by GitHub pages based on a subsection of my thoroughly annotated Stata syntax. This is exactly what I wanted but scaling halted progress. After feeding a second small portion of syntax into Claude to add to the diagram, the Java script file was so big Claude pro (or ChatGPT pro, or github copilot) couldn’t even tell me where to add the variables/nodes manually.

Attempt 2 was a sort of ‘start from scratch cleaning with python’ – getting AI to transform my Stata code and I learn with line by line explanations (it used SQLite and pandas). ChatGPT told me what folders and files to create, and gave me working scripts that together, output a markdown file which I rendered with Mermaid in vscode. It worked, but no where near the end result I’d like. Also difficult learning curve. No clickable links, just variable names with relationship arrows.

Finally I thought maybe I could just use the data view in Obsidian. So I started to add individual notes for variables and data files, with links and the information I wanted to include. I think this will work but it will be exceedingly tedious. I don’t know if there’s a way to create markdown files or log files out of Stata syntax. ChatGPT seems to suggest there is a way to call Stata through python and with scripts, the markdown files could be automated. Haven’t tried this yet.

I’m probably missing something. I’m sure this issue has been solved before. Any help would be appreciated and thank you for reading ☺️