Your Guide to Avoiding Critical Errors with Machine Learning in Production

Towards Data Science
Image by GrumpyBeere from Pixabay

I’ll never forget the first time I got a PagerDuty alert telling me that model scores weren’t being returned properly in production.

Panic set in — I had just done a deploy, and my mind started racing with questions:

  • Did my code cause a bug?
  • Is the error causing an outage downstream?
  • What part of the code could be throwing errors?

Debugging live systems is stressful, and I learned a critical lesson: writing production-ready code is a completely different beast from writing code that works in a Jupyter Notebook.

In 2020, I made the leap from data analyst to machine learning engineer (MLE). While I was already proficient in SQL and Python, working with production systems forced me to level up my skills.

As an analyst, I mostly cared that my code ran and produced the correct output. This mindset no longer translated well to being an MLE.

As an MLE, I quickly realized I had to focus on writing efficient, clean, and maintainable code that worked in a shared codebase.