It's already Friday and we have huge announcements for open-source tools this week! In the last weeks we've seen a lot of money raised around tools, it's a huge step forward.
Data fundraising 💰
- Fishtown Analytics, founded in 2016, raised $150m in Series C. If you don't know them, now that they renamed dbt Labs, it should ring a bell to you. With dbt they aim to bring analytics engineering to everyone. The idea behind dbt is to apply software engineering principles on SQL data transformations.
- Meltano, ELT for the DataOps era, raised a $4.2m Seed funding. Meltano was started as a tool inside Gitlab and one year after the open-sourcing they spin out. What I found awesome is that Meltano relies on dbt and they developed an amazing tool to launch your ELT platform in a minute. I can't wait to see where they will go next. As a reminder you can read this blog post explaining Why Meltano.
- Last but not least. Edge Delta raises $15m Series A to compete with Splunk. They announce to be way faster than all log processing tools like Splunk, Filebeat, FluentD or Logstash and also way cheaper.
Airflow Summit 2021
Next week start Airflow Summit. As an appetizer I've written a small selection below of talks I'm looking for (in Paris time). Do not forget to register to the Crowdcast to not miss the event. I will for sure, in two weeks, add a debrief in the newsletter about the summit.
Avalanche: Streaming Postgres to Snowflake
Last week I was so happy to see CDC articles and this week we got again two. The first one about Avalanche — the name is so perfect 👌. Devoted Health team developed a streaming service in Go syncing PostgreSQL Write Ahead Logs (WAL) to Snowflake in real time.
On the other side Vimeo, the video hosting platform, detailed how they replicated MySQL databases in Snowflake in realtime with Debezium. I'm thrilled to see Snowflake in these use cases.
But don't forget batch extraction
For people that are still building batch extractions Felipe from Alice Tech wrote a well detailed article with Airflow operator examples on how they transfered data from Oracle to a RDS warehouse (Postgres). You'll find a Github repo with their airflow plugins — that includes a snippet to get Oracle database schema 👴.
10 short rules for a Data Engineer
People knowing me IRL knows that I'd have put 1st the idempotent rule and not 2nd! Still the list is quite accurate and I think all data engineers should at least read it once a year.
Once you've set your data engineering standards you can get inspired by Lior Gavish. He wrote on towardsdatascience about levers we can use to make data engineering teams happy. IMHO "You measure customer impact" is the most important one.
...and 5 rules for Data Analyst
Now that we have 10 rules for each data engineer we also get 5 rules for data analysts (or should I say analyst engineers?). I like a lot the fourth one because it's always important to fix the root cause!
Greykite, a Python library to provide interpretable forecasts
LinkedIn open sourced Greykite a Python library for interactive and automated forecasting at scale. In the post they say that LinkedIn SRE team uses this algorithm to ensure site availability.
Send Signals from CLI
This is not related to data but I've seen this signal-cli project where you can send Signal messages directly from CLI. I think it could be a good free alternative when you want to build messaging or alerting system for your data app for instance.
Thank you all for reading this new digest. I hope you enjoy reading it and that helps you in your weekly curation. See you next week 🤗.
Join the newsletter to receive the latest updates in your inbox.