Hi data folks, 2 weeks ago we reach the 500 mark and we did not celebrate. So it's time for me to say thank you to all of you and for all the support you manifested in the last months! I'll continue improve the Data News over the next months.
We also published this week on the blog a feedback on our Astronomer trial, we use it for 2 weeks and the post covers the basic tutorial and some limitations we saw while using it.
This week is rather small in term of articles I found relevant to add, I hope you will still enjoy it.
Data fundraising 💰
- Mixpanel raised $200m in Series C. Mixpanel is a product analytics tools that mix your tracking system with your warehouse in a single app. In the last 2 years they add a huge growth — driven by the online traffic explosion.
- ThoughtSpot reached $4b valuation after a $100m Series F. After 10 minutes on their website I still don't know what is about — I'm probably not the target. It seems they provide a cloud platform that answer to search questions using AI with visualization.
- Hightouch, a reverse-ETL tool raised $40m in Series B. The third funding round in less than one year. They'll strategically focus in the next month into adapting the experience for business verticals like marketing or sales.
Build a data quality architecture
Data quality is probably one of the biggest pain point in the actual data teams. Every analyst struggle at one point and every engineer is unsure about the "data quality" definition. At HomeToGo they divided data quality in 3 pillars: accuracy, observability and monitoring with different ownerships in order to build a data quality system. I also discovered Anomalo that is the tool they decided to use.
Chit-chat about data analysts role
In the last two weeks a lot of people wrote about the analyst role or the so called translators as McKinsey said in 2018. First we got people debatting about the method to measure analytical work where Robert answered that the analyst isn't your sql monkey.
Then Taylor tried to define what will be the analyst of the future and how the discipline is evolving. I liked the charts she draw they are a good way to summarize the analytics role situation.
Warning — the post illustration can be hard on this one 👇
Following the post about the fact the analytical work has never been harder last week we got Mikkel writing the moldy data definition and arguing that dashboards should self-destruct and that spreadsheets today are still a not bad alternative. To conclude on one fact, we need new tools to help everyone's work.
Understand SQL joins in Python
I really like when people tries to explains basic concepts we use everyday but with another technology. This time we got SQL joins re-implemented in Python. This way it will help you understand what's going on under the hood when we speak of Postgres joins. Kelvin deconstructed nested join loop, merge join and hash join — go see the post to understand the diff with left and right joins 😉.
Kubernetes for data engineering
If you are looking for an introduction into some Kubernetes concepts this post is not for you, this is a small glimpse of what you can achieve faster as a data engineer when it comes to deploying apps with Kubernetes.
PS: This is also a reminder for me to write an entry-level detailed post about kube.
6 lessons I learnt early as a Data Engineer
All of the 6 points are common sense and also something that we discover quite early when we are a data engineer but they are all true and I think this is a good reminder for all data people. It's a small head's up.
My personal favourite is obviously the second, every data employee should have training to say no.
- Airflow 2.2.2 got released — changelog
- Snowflake release Python support in Snowpark — demo video on YouTube (this 10 min overview but with code)
- Metaplane launched this week on Hackernews — they aim to become the Datadog for data promising only 30 minutes to have your setup up and running.
Fast News ⚡
- Gradient Flow and Jesse Anderson published the results of the State of Data Engineering survey they launched few months ago, look at the results here.
- CinnaMon (Github repo) — a Python library which offers a number of tools to detect, explain, and correct data drift in a machine learning system.
- PySpark cheatsheet — if you do PySpark this huge sheet may help you
- SelectStar is a data discovery tool that unveiled this week a partnership with Snowflake
- Versatile data kit — vmware (yes!) open-sourced an abstraction layer around data jobs to help in monitoring and debugging. This medium post can give you an quick overview.
- Felipe Hoffa tutorial to migrate your Google Analytics data to Snowflake — a SQL cookbook
Join the newsletter to receive the latest updates in your inbox.