Dear readers, I hope this second edition of the year finds you well. We often say that saying things in public helps you achieve goals. So here some personal goals I'd like to achieve this year:
- Move from Paris to Berlin
- Run every week (starting next week 😇)
- Map the European data landscape by interviewing people and launch a podcast
- Lose weight (-7kgs)
- Publish 12 videos on YouTube
- Do a 30 days datavis challenge
If you want to use me as your sparing goal partner, hit reply and send me your goals it'll be a pleasure to remind you along the whole year at your duties. Each newsletter will be like an secret alarm.
Data fundraising 💰
- The French open-data sharing company founded in 2011 Opendatasoft raised $25m. Opendatasoft is a platform built for public and private organizations to open their data. This sharing can be paying or free. As an example, the city of Vancouver data portal.
- Snowflake decided to say in public they invested in Collibra in their latest funding round. Did I already tell you that Collibra means collaboration and library contracted 🤷♂️?
If you want to go to space contribute to Astropy
And don't wait for US billionaires to open tourism there. Except from finding the Great Bear and Cassiopeia in the sky I've never been that good or interested at it. But this reading about how the Astronomy community shaped Astropy and got NASA grants gave me feels.
BigQuery now supports semi-structured data
Yeah 🥳 that means you will be able to create a column with the JSON. This is a good and a bad news. Now we will be able to load a JSON column easily, but that also means we'll drop the schema validation we were all forced to have while parsing the data. Have a look at the BigQuery JSON data documentation page. Zach also detailed it on a medium post.
On the other side. I don't know which words to use but I always feel cold with Snowflake communication, I know this is probably the strategy, but they seem so corporate. This week on their Medium publication they explained how you can break the 16MB JSON limit — coincidence with Google news? — but tbh the post is so hard to read.
By curiosity I had a look at BigQuery limits and I did not find for JSON column, but I saw that a JSON row can't exceed 100MB.
The metadata money corporation
But the landscape is getting crowded, and its unincorporated territories are becoming too small to represent new categories in the eyes of the customer. Buyers don’t spend nearly as much time studying the distinctions between vendors as the vendors themselves do, and what can seem like category-defining differences from the inside are minor details to everyone else.
This is what I exactly tried to say last week but in a better English, Benn also argue in the direction of one unique data experience that could be powered with the metadata kerosene — data is the new oil as we say.
The *Mythical Modern Data Stack
Is it simpler to find the best pancake recipe or the best Modern Data Stack? Doug Foo tried to answer to this simple question. I really enjoyed the way he demystified the tools and the stack. This is a nice reminder of what are our common concepts and a nice entry-level post for newcomers.
PS: this is also a good follow-up to Benn article.
Airbyte or Meltano — and why they did not choose one of them
Robert Offner detailed to us how his team at Kolibri Games tried Airbyte and Meltano in order to decide what were the next steps of their data integration. Finally! Here is a first feedback on those tools. To Robert both tools are also not yet totally matures, but I can bet that recent valuations will help fix that soon — I hope.
Don't forget Airflow
And Airflow is still here in the king seat. Astronomer wrote about their astro packages for ETL that can help you bootstrap some DAG I really like the shortcuts around SQL, this is a good start.
Voodoo detailed why they decided to go with Airflow, the post bring new ideas about Airflow monitoring that I like.
ML Friday 🦾
Doctrine team shared how they implement A/B test with an example on a actual alert they use in their product. The way they did is a genius combination of APIs with HTTP headers forwarding. They have an allocation service using a deterministic hash and back-end services.
Also this week Aurelien from GitGuardian talked about zero false positives when doing predictions.
Raw News 🥦
Because I'm too late, it's raw stuff for you.
- Data to engineers ratio: US vs Europe — part 2 after last week post
- How to collect and visualize data lineage in an AWS-based data lake — good ideas and could be applicable to all kind of platforms
- Zingg entity resolution for deduplication in Svenn Thoughts
- Why I chose data engineering over data science
- Introducing Credmark’s Senior Data Engineer — even DeFi is doing data engineering we are getting hyped
- Yes, you can learn SQL in two hours — 😂 :troll:
See you next week after a jogging session. I've been late again this week but I had a class once again.
Join the newsletter to receive the latest updates in your inbox.