Halo, a lot of content has been published this week with the Coalesce and I kept a lot of articles from the last week that I needed to navigate through this quantity to produce this edition. I'm not that proud of the format but it's ok.
As a side node I'm gonna do the 30-day map challenge in November. So if you do it or if you want to do it say hi.
Women in Data — part 2 👩💻
Second part of the summary of the Women in Data meetup we organized 2 weeks ago. In the second round table the discussions were about the parity in the data ecosystem.
What can we collectively do to achieve parity in data ecosystems? 💪
Several answers and ideas were proposed by the speakers. Let's dive-in by topics.
- Culture. The enterprise culture plays a big part in parity topics. Every manager should be trained and encourage to address equality topic. Also every incorrect behaviour should be mentioned and addressed—still, there was some debate on if it should be addressed with humour or firmness. Gabrielle also described an internal collective she presides to help women finding their place. Along with their mission they identified 5 important points for these collectives to work: define a clear vision, find a sponsor, understand issues with interviews, plan actions that integrates in what already exists in the company then develop content to infuse culture.
- Also on the culture topic—yes I move to another bullet because the first on is too big—there are also initiatives at Deezer to help women by providing material or days-off during periods. Last but not least, everything related to the words we use. We should use inclusive writing—in French this is more prevailing than in English. For instance "hey guys" should be banned.
- Hiring. Everyone is saying this is hard to find women in the data field. This is a fact, probably true. But if you don't force yourself into searching to add diversity it'll never change. So one solution is to put a ratio when searching for people, for instance you can ask your hiring agency to propose at least one woman per 3 candidates and if not you'll not look at the profiles no matter what. Then you have to care about the whole hiring funnel.
- Other issues about hiring were discussed. The salary gap depending on the gender, the fact that studies shown that women tend to candidate less if they don't tick all the requirements.
- What else to change. All the differences can be fixed at a local level in the company but this is something that needs deeper change in the society. At the meetup speakers shared with us initiatives to promote tech/data works at kid school for instance. The idea is to show model roles to inspire younger generations. Tech industry is not a men's world.
That's all for this Women in Data meetup. I hope I've transcript the discussion with the right words and intention. I might have misinterpreted some chats and if it's the case I'm sorry.
My last point on this topic, let's not forget we talk about diversity, so this is not only about man and women, there is more to be diverse and inclusive.
dbt Coalesce 2022
dbt Coalesce took place this week, this is the annual 4-days conference organised by dbt Labs. While all data influencers were there to meet and chat about the next trends of the analytics industry a few announcements were made.
dbt Labs took the time to announce the Semantic Layer. While others call it the metrics layer or feature store in the data science space. We'll see a lot of buzz around this unique layer to access metrics in 2023. dbt Labs will push forward this architecture, in search for revenue growth. They will add this as a product in their cloud offering—with a Proxy SQL and a Metadata API.
If you want to see on how the semantic layer can be use Hex demoed it. You can also see this semantic rise up from the BI perspective with the Semantic BI. In this new world everyone wants to see the issues from his perspective, which is annoying for users but fun as an outsider 🙃.
I'll dedicate a full post with my highlights of the conference early next week after watching all the replays.
Data contracts 👻
Even if I try not to fall in the hype stuff to give a higher view on trends when I see data contracts everywhere I have to still share it. In a nutshell data contracts are contractualized interfaces between data producers and consumers. The most common pattern seems to be an API—http, file, event, table, etc.—between software engineers and the data team with a way to enforce the contract. We call this schema for ages.
I'm convinced for a long time that data contracts is not a data problem but an IT problem. If the whole tech team is not aligned on the way data changes should be managed you'll fix only a small part of the problem. Petr greatly wrote about the way we draw lines. What belongs where?
Data contracts aligned around business areas (domains) rather than technology layers. Contracts are technology-agnostic and can live anywhere inside the Data Platform.
Fast News ⚡️
- Apache Kafka SSL Security — A simple explanation of how SSL handshakes works and why you should add it to your Kafka cluster.
- How Can Artists Influence Recommendation Algorithms? — Second part of the MusicTomorrow series about their tool to help music artists to become more viral on music platforms.
- Load Github API data with Python model in dbt — A new way to see data ingestions. In this article the author get Github data with a dbt Python model running in Snowpark. Demoing an extract-load orchestrated directly in your dbt project. This is a good example, not sure it should be reproduced at scale.
- Is Druid still a thing? — Druid is a distributed OLAP database that can be used for real-time. In the past the main issue of Druid was the lack of SQL. But it changed. This post is an introduction of the Druid architecture.
- Airbnb’s key-value store for derived data — Giants can't stop inventing new databases to solve problems at their scale. This time Airbnb created Mussel as a combination of other OSS to have a scalable key-value store.
- Data Engineering Excellency at Netflix — How Netflix empowers the data engineering team to reach excellency. They even compare data engineers to X-Men. They all have different superpowers to work on different villains. For instance to work on Maestro, the data/ml orchestrator.
- End-to-end data pipeline tests on Databricks — I like all the testing topics even if it's in Spark (😬). Sicara detailed here how they did it for data quality and unit tests.
Data Fundraising 💰
This week a lot a few data satellite companies raised money. When I say satellite I mean companies that are not really related to data field, but they put data at the centre of their product.
- RisingWave raised a $36m Series A. A cloud-native streaming database that uses SQL. You can either deploy your own Docker instance either use their new cloud offering. It works with materialized views that are refreshed in real-time on top of tables connected to real-time sources like Kafka, Redpanda or CDC.
- Tellius raised $16m in Series B. Tellius offers an augmented analytics platform. A one-stop platform with insights discovery that does anomaly detection on your metrics.
- Keebo got $15m in a Series A. Keebo provides a way to lower your warehouses costs by rewriting your SQL queries on the fly. With their solution rather than connecting to Snowflake you connect to Keebo and you let them do the magical optimisation. Even if I like the promise I don't think this is a good idea to rely on a third party to do optimisations. You better done if you teach people to write performances tips with CI/CD checks for instance.
- The "security" space got some traction this week with 3 companies raising money. Anonos raised $50m in debt and provide a compliant pseudonymization engine. OutThink raised a $10m seed to tackle automatically data breach by highlighting company risks. Velotix raised $10m seed to automate data accesses over the complete platform.
See you next week ❤️.
Join the newsletter to receive the latest updates in your inbox.