Learnings from a Machine Learning Engineer — Part 3: The Evaluation

In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a trained model (one not yet in production), and evaluation of a deployed model (one making real-world predictions). In Part 1, […]

Building a Data Engineering Center of Excellence

As data continues to grow in importance and become more complex, the need for skilled data engineers has never been greater. But what is data engineering, and why is it so important? In this blog post, we will discuss the essential components of a functioning data engineering practice and why data engineering is becoming increasingly […]

Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets

Python has grown to dominate data science, and its package Pandas has become the go-to tool for data analysis. It is great for tabular data and supports data files of up to 1GB if you have a large RAM. Within these size limits, it is also good with time-series data because it comes with some […]

How to Measure the Reliability of a Large Language Model’s Response

The basic principle of Large Language Models (LLMs) is very simple: to predict the next word (or token) in a sequence of words based on statistical patterns in their training data. However, this seemingly simple capability turns out to be incredibly sophisticated when it can do a number of amazing tasks such as text summarization, […]

Should Data Scientists Care About Quantum Computing?

I am sure the quantum hype has reached every person in tech (and outside it, most probably). With some over-the-top claims, like “some company has proved quantum supremacy,” “the quantum revolution is here,” or my favorite, “quantum computers are here, and it will make classical computers obsolete.” I am going to be honest with you; […]

Manage Environment Variables with Pydantic

Introduction Developers work on applications that are supposed to be deployed on some server in order to allow anyone to use those. Typically in the machine where these apps live, developers set up environment variables that allow the app to run. These variables can be API keys of external services, URL of your database and […]

Understanding Model Calibration: A Gentle Introduction & Visual Exploration

How Reliable Are Your Predictions? About To be considered reliable, a model must be calibrated so that its confidence in each decision closely reflects its true outcome. In this blog post we’ll take a look at the most commonly used definition for calibration and then dive into a frequently used evaluation measure for Model Calibration. […]

Build a Decision Tree in Polars from Scratch

Decision Tree algorithms have always fascinated me. They are easy to implement and achieve good results on various classification and regression tasks. Combined with boosting, decision trees are still state-of-the-art in many applications. Frameworks such as sklearn, Lightgbm, xgboost and catboost have done a very good job until today. However, in the past few months, […]

4-Dimensional Data Visualization: Time in Bubble Charts

Bubble Charts elegantly compress large amounts of information into a single visualization, with bubble size adding a third dimension. However, comparing “before” and “after” states is often crucial. To address this, we propose adding a transition between these states, creating an intuitive user experience. Since we couldn’t find a ready-made solution, we developed our own. […]