It's already last August week. Back to school. I'll celebrate my first year of freelancing, and to be honest it's been a wonderful year. I've tried and started a lot of projects: the weekly newsletter, streams on Twitch and videos on YouTube. Growing an audience and a community is hard but so satisfying. But I also thanks clients that trusted me over this whole year to help me achieve my goals.
Let's go for a new weekly newsletter!
Data fundraising 💰
- Grafana Labs raised $220m in a funding round. With this fundraising Grafana will continue to compete with Splunk and Datadog with a vendor neutral approach in order to help cloud monitoring.
- Monad, a vertical data platform dedicated to security domain, raised $17m to create the first data cloud for security. Monad comes with connectors to security tools to load data in any warehouse and with Monad Core Tools: a suite of reports.
- In echo with last week fundraising, Cribl raised $200m in Series C to put an aggressive plan for expansion 2022. Cribl is an observability platform focused on micro-services data.
- Bodo.ai Raises $14m Series A led by Dell Technologies, Bodo.ai is a company that aims to make Python a first-class, high-performance and production-ready platform. They want us to avoid rewriting Python code to run ETL, Feature Eng or AI on all hardware.
Erratum: last week I did not say it, but Preset raised also $35.9m in Series B.
Rebranding data or finding the next bubble
The last ten years have been quite a ride. Data ecosystem went through a lot of trendy concepts and bubbles. We got (not exhaustive) BI, datalakes, big data, AI, data warehouses, and more recently modern data stack or data mesh. O'Reilly Radar published Rebranding Data, it help us understanding why we jump from one bubble to another one. Asking all data engineers outside to adapt every couple of years.
Tip: if you want to find new hype just go to big vendors blogs and look a the SEO articles they write.
How Data Shapes the Uber Rider App
Uber team wrote a super nice article about how they use all the data they collect to improve their Rider App. They give us some diagrams to understand what are the logical blocks behind this kind of architecture. Once again it illustrates the data/AI hierarchy of needs.
Data Lineage at Slack
I really like when big companies writes technical articles because even if we don't have the human resources or the time we can get inspired to find great ideas. Here Slack team explain their whole data lineage system. From lineage ingestion to SQL parsing everything is covered.
Maximizing Productivity of Analytics Teams
Great Expectations wrote a nice series of 3 articles on how to maximize the productivity of analytics teams. Here the part 1. I want to emphases last part of the article. It's really important to make root cause analyses easier because for sure issues will happen. So give yourself a favor.
Airflow decorators for better readability
With Airflow 2 the new Taskflow API was released with super useful decorators to simplify the readability of your DAGs. This article gives you a quick overview of what you an achieve. Time to rewrite all your DAGs 🤓.
What is Data Quality really?
The 10 million dollars question.
This is one of the most difficult question of the data field. Everyone enterprise has his own answer or understanding of the question because it depends on too many factors. Servian tries to answer the question as the part 4 of their Data Engineering testing series.
Data professionals, you suck at Interviews
Leo Godin came back with a well written article about data interviews. It says that "if no laughs during an interview, you probably failed". And I agree. Also do not forget that interview are a way for company to test you but also a way for yourself to test a company. This is a 2-way process.
JSON woes in Apache Spark — Null fields that are not really null
If you already had Spark JSON parsing issue this article by Ruben Berenguel is for you. It describes how Spark behaves when you send numbers in string instead of int. Is it a bug or a feature? Go check the article even if you don't do Spark it's well written.
- Cloudera announced DataFlow for the Public Cloud — all that marketing means they run Apache NiFi on top of Kubernetes for you with a control plane.
- Oracle announced increased performance with MySQL HeatWave — they announced that their MPP database is faster than Redshift at the third of the cost.
- Open-source project: sqlmodel — this is a way to infer Python type annotations to create SQLAlchemy model. The project have been created by FastAPI creator (so it integrates well with).
- Homomorphic encryption — If you are interested in the homomorphic encryption I proposed you this blogpost. Look at the example with the image filter. I hope it will help us in the future to be privacy-first.
- Airflow + MLOps Virtual Meetup — Next week on Sept. 2 Tel Aviv Apache Airflow meetup group will host a virtual meetup on Airflow + MLOps use-cases. If you are interested in, do not hesitate to attend!
- Snowflake quarter performance — Snowflake shared their quarter performance. They announced almost 5000 clients and 212 Fortune500 companies (+34% YoY).
I also announce that I'll restart streaming next Wednesday (in French) See you next week ❤️!
Join the newsletter to receive the latest updates in your inbox.