Hello, another edition of Data News. This week, we're going to take a step back and look at the current state of data platforms. What are the current trends and why are people fighting around the concept of the modern data stack.
Early September is usually conference season. All over the world, people gather in huge venues to attend conferences.. Last week it was Big Data London, this week it was Big Data & AI Paris. I wasn't able to go. But every time I went to a conference in the past, I came back with ideas to change everything because someone introduced me to a new fancy stuff.
This feeling is right. But you should temper your excitation. Let's go through the current state of data to understand what you should do next.
Big Data is really dead
Although the term Big Data is no longer very popular, London probably counted over 10,000 visitors and more than 160 vendors (2022 figures). Big Data London exists since 2016 and when we look at sponsors it's like an history book. Over the years Cloudera logo has been replaced by Snowflake and Databricks ones. Microsoft logo still standing over the years. When everybody is digging for gold, it’s good to be in the pick and shovel business.
The era of Big Data was characterised by Hadoop, HDFS, distributed computing (Spark), above the JVM. This era was necessary and opened doors to the future, fostering innovation. But there was a big problem: it was hard to manage.
That's why big data technologies got swooshed by the modern data stack when it arrived on the market—excepting Spark. We jumped from HDFS to Cloud Storage (S3, GCS) for storage and from Hadoop, Spark to Cloud warehouses (Redshift, BigQuery, Snowflake) for processing.
In fact, we're still doing the same thing we did 10 or 20 years ago. We need to store, process and visualise data, everything else is just marketing. I often say that data engineering is boring, insanely boring. When you are a data engineer you're getting paid to build systems that people can rely on. By nature it should be simple—to maintain, to develop—it should be stable, it should be proven. Something boring.
Big data technologies are dead—bye Zookeeper 👋—but data generated by systems are still massive and is the modern data stack relevant to answer this need in storage and processing?
Is the modern data stack dying?
The modern data stack has always been nice words to bundle a philosophy used to build data platform. Cloud-first. With a handy warehouse at the center and multiple SaaS tools revolving around to answer useful—sometimes not—use-cases. Following an E(T)LT approach.
Historically, data pipelines were designed with an ETL approach, storage was expensive and we had to transform the data before using it. With the cloud, we got the—false—impression that resources were infinite and cheap, so we switched to ELT by pushing everything into a central data storage.
If we summarise the initial modern data stack vision, this is something like:
- move data with Fivetran
- store data in Snowflake
- transform data with dbt
- visualise with Looker
- document with a catalog, prevent with data observability, orchestrate
So what's left of the original vision of the modern data stack that can be applied in 2023 and beyond? An easy-to-manage central storage and querying and transforming layer in SQL. When you put the things like this it opens the doors and does not limit the modern data stack to 4 vendors.
The central storage can be cloud storage, a warehouse, a real-time system, while the SQL engine can be a data warehouse or a dedicated processing engine. It can go further than that, you can—in fact you should—compose storages and engines, there are too many use cases for any one solution to address. More importantly, the modern 4-vendor data stack is too expensive to scale.
The modern data stack is not about to disappear, it's so simple to use in the first place and it's the core of too many data stacks and practices today. But it needs to adapt to today's needs, hence its incremental evolution.
I believe in incremental evolution
What do you need to do? Well, it all depends on whether you're a newcomer and want to start building your data platform, or whether you already have a stack and are wondering what to do next. If you're starting your data stack in 2023, simply choose the solution that will be the quickest to implement to discover your business use cases, you'll build something later. A lot of companies started with Postgres + dbt + Metabase, don't be ashamed.
When it comes to incrementally change a data platform this is a bit different, you need to find what is going wrong and what could be improved. Like
- data workflows are always failing, are always late—Identify why workflows fails, data contracts might help to bring consensus as code if it fails because of upstream producers, create metrics about failure or latency aim for a 30-days streak with no issues. Define SLAs, critically and ownership. For downstream data quality there are also a lot of tools.
- data stack is too expensive—With the current economic situation a lot of data team were in need to stop spending crazy amount of compute and introspect storage to remove useless data archives. DuckDB can help saving tons of money.
- developer experience to add new workflows—This is something often neglected by data engineers, you need to build the best dev experience for other data people not everyone is fluent with the CLI.
- data debt—You might have too many dashboards or tables, workflows spaghetti. For this you need to do recurrent data cleaning. Find, tag and remove what is useless, what can be factorised. Only healthy routines can prevent this.
- poor data modeling—This topic might be too large to handle in one bullet. Data modeling is the part that really don't scale in data stacks. Because of the growth your SQL queries patrimony will inflate and only data modeling will prevent data from being unusable, repetitive or false. Good data layers are a good start.
- there is no data documentation—Rare are the people who are happy to document what they are doing. Best to do is to defined what is a good documentation and then enforce the requirements before going to production. Think the documentation for your readers.
- data is not easily accessible for humans or AI—We build data platforms in order to be used. You should create usage metrics over of your platform either about business users conversion in the downstream tools, about SQL query writers but also about how AI is using the data. How the AI platform combines with the analytics platform?
This list is probably not exhaustive, but it's a good start. If you think you're good on all counts, you've probably finished the game and that means your data team has built something that works. Don't forget the stakeholders though, as it's probably more useful to have a platform that barely works but serves users perfectly than the other way around.
This post is a reflection on the changes in the data ecosystem. Marketing would have you believe that your data infrastructure may be obsolete but you shouldn't worry about it, if you're still using a crontab to run your jobs that's fine. Just use the right tool for the right job and identify what are your data needs. Tip: data needs are rarely a technology name.
I hope you like this different Data News edition, I'm curious to know what you think about it, I wanted to keep it short while giving a few practical links and ideas.
Your data stack won't explode if you don't use dbt.
PS: I wanted to write also about interoperability of data storage and file formats but that's for another time.
Fast News ⚡️
- Motherduck has announced their pricing — The model simplicity reminds me a lot BigQuery in the early ages. You pay for the cold and hot storages. Respectively $0.04 per GB per month and $0.02 per GB per hour. But it looks like way more expensive than BigQuery.
- Announcing BigQuery Omni cross-cloud joins — Join datasets located in BigQuery with datasets located in AWS or Azure. This is part of BigQuery Omni offering, which is 37% more expensive (in EU).
- 3 lessons to learn before creating your own data team — Christelle wrote 3 lessons learned about a survey that has been run in a private French data community. Mainly it shows that the first hires in a data team have to be picked cautiously.
- How to prioritise projects and scale your Data Science team efficiently — A nice article about how to understand an OKR and make it your own to lead data science projects.
- Mistral 7B, the best 7B model so far and open-source — Mistral AI is the French company that want to compete with OpenAI and they released under Apache license a first 7B model.
- A selection of SQL tutorials — a long list.
Data Economy 💰
- Rollstack raises $1.8m Seed. This is a YC company and they propose a product that automates slide deck with data coming from your data stack. Without engineering or manual work. This is an awesome idea the young myself would have love 8 years ago when I was generating Powerpoint in Python.
- Kolena raises $15m Series A. Kolena proposes an end-to-end framework to test and debug ML models to identify failures and regressions.
See you next week ❤️.
Join the newsletter to receive the latest updates in your inbox.