Data News — Week 22.35
Data News #22.35 — Startree (Pinot) fundraising, do you really need data engineers?, Snowflake bill, Velox, dbt lessons, dbt Python models.
Hey dear readers, I hope this edition finds you well. In just change a bit the format this week. Only data fundraising and long fast news with a featured article I decided to develop. I hope you'll like it.
Next week I'm gonna launch the Explorer, a hub with all the data news links. If you want access please tell me.
Data fundraising 💰
- Startree raised $47m Series B. Startree provides a cloud version of Apache Pinot. Pinot is one of the real time OLAP databases. It has been designed to support real time ingestion while being queried by downstream analytical apps. We are in 2022 and real time and/or batch is still a thing and I bet it'll continue to be.
Do you really need data engineers?
Yesterday SeattleDataGuy wrote Let's move fast and get rid of data engineers, while the title is a bit provocative the content is still relevant. In the article Benjamin explains very well that companies wants to remove all the data engineering burden by putting directly low-code data software in the hands of analysts/scientists.
When we think of it, this is true, this is the whole promise of many data tools, in all domains. When it comes to extraction Airbyte, Fivetran and co. are trying to put the simplest UI to the most boring task of data platforms: copying data from sources to analytical storage. Regarding transformation, warehouses + SQL has achieved the biggest complexity reduction. On the serving layer, either it's reverse-ETL either visualization, the space has already been transformed with "friendly" tools.
Everyone can write SQL and can configure web tools, so don't bother yourself to hire rare data engineers.
This is a discussion I also had a lot over the past two years during my freelancing journey. If I hire a data engineer what work should I give him/her? Do I really need to hire data engineers when I have dbt?
Often my answer to these question is: it depends. If you're satisfied with your processes and you don't have any scale, stability or engineering issues, you might be ok with the status quo, still you'll need a geekier analyst than the others for some edgy topics. But if you feel that your data team lacks in processes, does not have the time to self-improve, that you always have long-term engineering tasks in the backlog that no-one can take care of you might need to do some engineering work.
I don't want to be a gatekeeper. But I feel that to be data engineer you need something different. Data engineering is very often a boring field. Not everyone is interested in spending his morning investigating why this pipeline fails or why a query is costing more money than planned. But I personally think every data team need to have someone that does it. You can call it data engineer, DevOps, SRE, data geek, idc. In the end you need someone to whom you can delegate the boring stuff. To let other delivering value to stakeholders.
Even if you have Airbyte, dbt, or the other fancy SaaS data platform launcher I bet you'll find a lot of value in hiring data engineers. Data engineers — in general — love to solve problems and to help other teams optimizing what they are doing. I don't care waking up at 6AM to fix a pipeline if it'll simplify my day later. I even like it. This is also a behaviour I've seen in fellow DE, but rarely in analysts.
So, yes, in the end you really need data engineers. Obviously I'm biased when I'm saying this because I'm a data engineer. As I just said I don't think this is a technical reason. You just need someone — or a team — in the backstage that you can trust who magically handle stuff while the team shines on the outside. But draw clear responsibilities and don't give too much power to data engineers.
This is a large topic to cover and I just scratched the surface.
Fast News ⚡️
- 3 main elements in your Snowflake bill — Few days ago the Snowflake stock took 40% in two days. Snowflake has over delivered in terms of revenue for the last quarter. Which means companies have spent more credits, which means more computing time, which means probably few surprised customers. So lucky for us, Sthiven from Wise explains 3 concepts to help you mastering your bills.
- Introducing Velox: An open source unified execution engine — Meta developed a new "universal" execution engine. The idea is to replace traditional computing workers by Velox in order to benefits optimisation and common library. For instance you'll be able to replace Spark engine and Presto workers with Velox but your code will still be the same, just the execution will be different. I see the interest for big companies. Not really for others.
- Lessons learned after 1 year with dbt — 3 classic software engineering lessons. This is a good reminder of the dbt nature: a framework to apply SE practices to SQL modeling. I couldn't agree more, scale fast without sacrificing on data quality, documentation and pay debt regularly. On this topic Madison also wrote about Github best practices for Analytics Engineering.
- Kubernetes was never designed for batch jobs — If you need arguments to justify your SRE team you need a data orchestration tool better than Kubernetes this article is written for you.
- A first look at the dbt Python models with Snowpark — This is a great first look at what can be done with dbt Python models. The author compares the SQL and the Python way to compute the same model. Sadly this is not really comparable as the Python models can only materialize in table when she used views for the SQL. Under the hood Python models generate Snowflake stored procedures in Snowpark. Let's welcome back the good ol' stored procedures.
- Fixing slow PostgreSQL queries — Why a simple query LIMIT can destroy performance? Awesome technical deep-dive that explains how to understand ANALYZE output to fix query performance.
- How to measure cohort retention — This is topic every analyst working in marketing has to do at least once. This post shows how to do it, from defining the problem, to the SQL implementation with charts in the end.
- ❤️ The many layers of data lineage — This is probably the best article about lineage I have read in a long time. The author proposes to see lineage with different layers, like in map visualisation. We should consider lineage like a graph and we should add layers on top of it to understand what's going. For instance we can use a quality layer, a usage layer or a performance layer. And with each layer you'll be able to highlight different issues.
Join the newsletter to receive the latest updates in your inbox.