Data News — Week 23.07
Data News #23.07 — What's DataOps, decrease ETL costs with Arrow, the case for being biased, data validation framework...
In last week newsletter I've also share what is a metrics store, which led to a longer edition than usual and I saw that a few people did not like it this way. It was a try I'll see in the future how I can do it better. Still, what is a metrics store? You can check out the post extracted from the newsletter.
On the same topic this week Pierre shared how to create a semantic layer in Preset—i.e. managed Apache Superset—to do so, it first defines metrics within dbt and then thanks to the CI/CD it pushes to Preset the metrics definition. This is a great example of a simple way to push down metrics to visualisation tools.
Is DataOps really a thing?
Last year DataOps has been used in many different ways to describe so many data-related different tasks. When you look deeply at it some companies put behind DataOps word just data stuff. Which is a bit misleading when you read that DataOps is "DevOps for data". Because all things wrapped DevOps is something different than software engineering.
I personally do share this perspective. Data engineering is mainly software engineering applied to data, or at least we try. If we see it this way, this is logical to say that DataOps is the movement to smoother the operation side, which technically means the infrastructure side—the IT as previous generations were saying, I don't like IT, it makes me feel old. Data engineering is also an infrastructure heavy field with a lot of technologies to put together to create something that works. This is why DataOps is important. This is why Infrastructure as Code is mandatory.
To me it stops here, all the marketing derivation of it saying we do data products using DataOps methodology is just marketing. Actually you are just writing code applied to data and using Docker containers to deploy it in the cloud. I think we should stick to software engineering vocabulary.
It also means that the data engineer role is constantly evolving. Especially with the new appearance of the analytics engineer role. Analytics engineers are taking tasks out of data engineers—which is for the better tbh. Data engineers will have to focus more on software and on infrastructure. Shifting the expertises. Analytics engineers will become the data modeling experts. Data engineers will own the infrastructure side and software related to data team—which is already a too broad field with different ownerships (DS, MLE, etc.).
In the end when I deploy data apps I end up doing Dockerfile with CI/CD processes and I look for cloud services to hosts my containers. If this is not DevOps what is it?
Fast News ⚡️
- Unveiling the three faces of documentation — Practical advices about data documentations and how you can leverage through 3 main axes: assets knowledge, business knowledge and team onboarding.
- Databricks announced a VS Code extension — This is a small news, but still interesting to see all-in-one platform like Databricks going this direction to provide end-users extension to support their way to write code rather than the vendor one.
- 📺 Understanding the business as a data analyst — A podcast about the business privilege position data analysts have, but also the responsibilities to understand and modelise it correctly in order to provide the best value to data users.
- Decrease ETL costs with Apache Arrow — I've often written data extraction with pandas by doing
pd.read_sqlbecause it's super handy and you can have something that works quickly, but the cost in memory can be high. This article shows how you can do it with Polars that leverage Arrow using less memory.
- Deploying data pipelines using the Saga pattern — When you enter the real time journey your way of thinking data pipeline is a bit different and it can be overwhelming when you come from the batch world. The Saga pattern is a pattern meant to ensure consistency first in the system. Here Picnic showcases the usage of dead letter queues.
- The case for being biased — It's been a long time since I've not featured Benn's posts, still awesomely written. It answers well to "Analytics is not about data. It's about truth" I've shared last week. Benn thinks about the role of a data team in the business decisional journey.
- Balancing quality and coverage with our data validation framework — Dropbox tech team developed a data validation framework in SQL. The validation runs as an Airflow operator every time a new data has been ingested. In terms of design only one query runs—performance reasons—and if the query returns something different than zeros, it means something is going wrong. This validation process is also a staging step before sending a table to production.
- Pedram developed a NeoVim extension for dbt users. If you're not familiar with Vim or NeoVim, Simon explained what is Vim, and why this is more than an editor.
Data Economy 💰
- Europe data salary benchmark 2023 — Mikkel has become one of the best in Europe to picture correctly the data field by doing benchmark and studies across the whole market. This time he is looking at salaries. To me, as French, the most crazy number is to see that senior positions—5+ years—in Europe are compensated six figures.
- Side note, this week I realised that DuckDB Labs was the team behind DuckDB and not MotherDuck who did a partnership with them to propose the duck technology to everyone.
See you next week.
Join the newsletter to receive the latest updates in your inbox.