Hello Data News readers. I'm still struggling to get back into my usual work rhythm. If you add the fact that last week I came up with fewer articles than I expected, this has led me to another blank page. Anyway, after 2 years of work, I have to accept and let go when necessary. But don't worry I don't forget you.
Let's quickly jump to the news, because it's rather busy.
(Gen) AI News 🤖
- Reinforcement Learning: an easy introduction to value iteration — Title says easy, but the article contains maths formula. RL is always something magical and this article explains it well through golf concepts.
- Falcon 180B has been released on HF — This is interesting to note that Falcon has been developed at Technology Innovation Institute (TII) in Abu Dhabi. It brings diversity to Foundation models usually coming from US. But despite of the number of parameters (180B) can it run on your computer? Spoiler, according to Benjamin it needs 100GB of RAM to run and a good GPUs to be able to fine tune.
- If you're late to the party and you need fresh views on LLMs Daniel wrote an introduction demystifying the Large Language Models and Jesse wrote about LLMs impact from a Data Engineering perspective.
- At the same time Github Research quantified GitHub Copilot’s impact on developer productivity and happiness — Developer productivity is a difficult measure to compute. Also productivity ≠ speed, but speed is important. The research also shown that people using Github Copilot feel more 88% more productive and are more efficient and less frustrated.
- HuggingFace CEO and co-founder opening statement at AI insight forum — This week US AI giants went to a 6-hours private meeting with 60 US senators to explore AI regulation. Clement Delangue transparently shared his speech on Twitter. Mainly he treats about openness, risks measurements—like mis-information, elections manipulation or carbon emission increase—and finally safeguards implementation.
- Meta developed an end-to-end AI system performance simulator called Arcadia. From what I understand this performance simulator unlock capabilities in finding what are the best parameters for training.
Fast News ⚡️
- Birmingham City Council has to pay 5x the initial price of the new ERP Oracle project. From £20 million to around £100 million. Crazy amounts.
- I just discovered this week that in June BigQuery introduced primary keys and foreign keys.
- How to reduce warehouse costs? — Hugo propose 7 hacks to optimise data warehouse cost. And if you can read French (🇫🇷) there is the super post by a French data collective about comment réduire ses coûts Google BigQuery?.
- * * * * * schedule Snowflake queries — If you want to live dangerously you can use Snowflake table schedule to compute tables periodically. I don't recommend it, it's a Pandora's box we don't want to open.
- Dimensional data modeling with dbt — A great 6-steps process to create a simple dim-fact model with dbt. It also uses the dbt_utils macro to generate a surrogate key.
- Head-to-head comparison of 3 dbt SQL engines — A comparison between DuckDB, Spark and Trino where DuckDB wins almost every fight. Obviously biased by the fact that the comparison is done on a mono node and DuckDB is built for this.
- Scrape & analyse football data — Benoit nicely put in perspective how to use Kestra, Malloy and DuckDB to analyse data.
- Factory Patterns in Python — It remembers me Java design patterns classes at the engineering school. A bittersweet feeling. Still I think that the Factory pattern is probably the one that I've used the most since the beginning of my career and this post explains it well.
- When charts looks like spaghetti, try these saucy solutions — Great tips to enhance your dashboards.
- ❤️ The key to building a high-performing data team is structured onboarding — The title say it all. Still in the article it mentions 2 key piece. First you need a great onboarding doc and then you need to successfully pass the "bootcamp" phase, which matches the 2 first weeks.
Of course, great onboarding isn’t the only thing necessary to build a high performing team, but it’s almost impossible to build one without great onboarding
Github gems 💎
- nike-inc/brickflow — Nike engineering team released a Python framework to orchestrate jobs in Databricks workflows. Mainly it maps to Airflow concepts to have a declarative interface over Databricks objects like Cluster, Workflows or Notebooks in order to orchestrate them.
- sourcegraph/cody — Cody is a free, open-source AI coding assistant that can write and fix code, provide AI-generated autocomplete, and answer your coding questions. Under the hood it uses either Anthropic or OpenAI LLMs to work and requires a free cody.dev account.
- teej/titan — Titan is a Python library to manage data warehouse infrastructure. Titan allows you to create Snowflake Databases, Warehouses, Role and RoleGrant in a programmatic manner.
Data Economy 💰
- SQream raises $45m Series C. SQream is a GPU-based SQL database that can act as a data warehouse promising performance peaks at PB scale because of the GPU architecture. It also works well for machine learning use-cases.
- Gable raises $7m in seed funding. Chad Sanderson launched his data contracts product / platform in association with 2 other co-founders. Chad produced a lot of content around contracts in the last 2 years. It seems Gable is here to fix upstream data quality with contracts. Alerts will be sent in Github to alert owners when something breaks enforced rules.
- Databricks raises, another, $500m in Series I. Soon there will be no letter in the alphabet to associate with Databricks fundraising. Since the beginning they raised $4b and are today valued at $43b. Nothing to say except than they love to burn cash. Be ready for a downhill in 2025 if you have picked Databricks.
- Treefera raises $2.2m in pre-seed to develop a data platform that monitors forests built for carbon offsetting and reforestation. I really like their "data products" approach and the geo visuals over forests risks.
- Collibra acquires SQL data notebook Husprey. Husprey is a Notion-like directly in the warehouse to write stories on top of each interesting tables or facts. It will become a nice product in the Collibra data governance ecosystem.
See you next week ❤️.
Join the newsletter to receive the latest updates in your inbox.