lakeFS v0.100.0 is out 🎉

View in browser

Informing you of the latest news in the lakeFS community

We're thrilled to announce the release of lakeFS v0.100.0 🎉

To celebrate this milestone, we’re giving out a PlayStation 5 to the 100th installation of this version.

Install lakeFS and get a chance to grab a PS5 and more: 100 free days on lakeFS Cloud, bragging rights, and a few lakeFS favorites (our axolotl plushie and more surprises)! 😉

Install v0.100.0 and you'll have access to:

Rich diff capabilities for Delta Lake tables
Support for generating pre-signed URLs
Native Spark integration for Azure Blob Storage
Huge improvement to lakeFS embedded hooks
Many performance and scalability improvements

Get all the contest details here. And be sure to check out the release notes!

ML & Reproducibility: How to Experiment Like a Pro

Knowing that you can replicate experimental results in ML feels just great. It means your product is ready for production-scale deployment.

The trouble with ML is that workflows are anything but linear. We experiment with different ML algorithms and parameters in an incremental and iterative manner, and making that work reproducible is a hard nut to crack.

⚡ Challenges with ML reproducibility

End-to-end ML pipelines involve multi-step complex workflows - from pre-processing training data to monitoring models for performance degradation.
Copying massive training datasets each time we want to experiment isn’t scalable.
There’s no way to maintain versions of several model artifacts and their associated training data atomically.
The added complexity of managing versions of structured, semi-structured, and unstructured training data.
It’s hard to enforce data privacy best practices and data access controls when ML teams create duplicate copies of the same data for collaboration.

💡 Solution? Data versioning

You can use a data versioning tool that works for data in-place (in object stores) to version training data, ML code, and models together.

📖 How to get started

Check out this guide showing how you can achieve reproducibility via data version control: Building an ML Experimentation Platform for Easy Reproducibility Using lakeFS.

You can also watch this webinar showing how to use lakeFS to intuitively and easily version your ML experiments and reproduce any specific iteration of the experiment as needed.

🛠️ What other people are saying about it

Here’s a primer on data versioning in machine learning: Intro to MLOps: Data and Model Versioning, and a handy tooling guide: How to Version Control Data in ML for Various Data Sources

Other data news you should know

DuckDB vs. Polars for Data Engineering

Polars now has an SQL context and is rising in popularity, so do we need DuckDB anymore? Read this in-depth comparison of Polars and DuckDB.

Open Table Formats vs. Data Version Control

Open Table Formats (OTF) vs. Data Version Control

What's the difference between OTFs and data version control? The short answer is, “Those are different technologies solving for different use cases.” Let's dive deeper.

New open-source library to monitor LLMs

As the industry re-tools around LLMs, data practitioners need to be extra careful about hallucination or bias - and Phoenix aims to solve it.

Community events you shouldn't miss

Advancing Data governance episode 1 rbac - featured LI image

Advancing Data Governance: Data Lineage and Auditing | May 17, 2023 @ 12:00 PM EST | Virtual | Register

Version your ML Training Data for Easy Reproducibility | June 21, 2023 @ 12:00 PM EST | Virtual | Register

Fun moments from around the web

Visit our website

Did you enjoy this edition of the Axoletter? If so, consider forwarding it to a friend!