Show HN: I built an open-source data pipeline tool in Go
(github.com)198 points by karakanb 8 days ago | 48 comments
Every data pipeline job I had to tackle required quite a few components to set up:
- One tool to ingest data
- Another one to transform it
- If you wanted to run Python, set up an orchestrator
- If you need to check the data, a data quality tool
Let alone this being hard to set up and taking time, it is also pretty high-maintenance. I had to do a lot of infra work, and while this being billable hours for me I didn’t enjoy the work at all. For some parts of it, there were nice solutions like dbt, but in the end for an end-to-end workflow, it didn’t work. That’s why I decided to build an end-to-end solution that could take care of data ingestion, transformation, and Python stuff. Initially, it was just for our own usage, but in the end, we thought this could be a useful tool for everyone.
In its core, Bruin is a data framework that consists of a CLI application written in Golang, and a VS Code extension that supports it with a local UI.
Bruin supports quite a few stuff:
- Data ingestion using ingestr (https://github.com/bruin-data/ingestr)
- Data transformation in SQL & Python, similar to dbt
- Python env management using uv
- Built-in data quality checks
- Secrets management
- Query validation & SQL parsing
- Built-in templates for common scenarios, e.g. Shopify, Notion, Gorgias, BigQuery, etc
This means that you can write end-to-end pipelines within the same framework and get it running with a single command. You can run it on your own computer, on GitHub Actions, or in an EC2 instance somewhere. Using the templates, you can also have ready-to-go pipelines with modeled data for your data warehouse in seconds.
It includes an open-source VS Code extension as well, which allows working with the data pipelines locally, in a more visual way. The resulting changes are all in code, which means everything is version-controlled regardless, it just adds a nice layer.
Bruin can run SQL, Python, and data ingestion workflows, as well as quality checks. For Python stuff, we use the awesome (and it really is awesome!) uv under the hood, install dependencies in an isolated environment, and install and manage the Python versions locally, all in a cross-platform way. Then in order to manage data uploads to the data warehouse, it uses dlt under the hood to upload the data to the destination. It also uses Arrow’s memory-mapped files to easily access the data between the processes before uploading them to the destination.
We went with Golang because of its speed and strong concurrency primitives, but more importantly, I knew Go better than the other languages available to me and I enjoy writing Go, so there’s also that.
We had a small pool of beta testers for quite some time and I am really excited to launch Bruin CLI to the rest of the world and get feedback from you all. I know it is not often to build data tooling in Go but I believe we found ourselves in a nice spot in terms of features, speed, and stability.
https://github.com/bruin-data/bruin
I’d love to hear your feedback and learn more about how we can make data pipelines easier and better to work with, looking forward to your thoughts!
Best, Burak
NortySpock 7 days ago | next |
Interesting, I've been looking for a system / tool that acknowledges that a dbt transformation pipeline tends to be joined-at-the-hip with the data ingestion mode....
As I read through the documentation, Do you have a mode in ingstr that lets you specify the maximum lateness of a file? (For late-arriving rows or files or backfills) I didn't see it in my brief read through.
https://bruin-data.github.io/bruin/assets/ingestr.html
Reminds me a bit of Benthos / Bento / RedPanda Connect (in a good way)
Interested to kick the tires on this (compared to, say, Python dlt)