Machine Learning Operations (MLOps) is a relatively new concept in the world of ML and AI, but it is quickly becoming a must-have practice for organizations looking to have successful data science projects and programs deployed in production. It is a set of guiding principles on how to structure processes so your chance at successful deployment and maximum gain from a data science project is optimal.
This is the first blog post of a series of theoretical and practical MLOps articles. As such it is suitable for people looking to understand what MLOps is, why they need it, and how they can benefit from it.
modern machine learning from the standpoint of a dinosaur
When I started my career back in 2006, there was no Pandas, no Scikit-Learn, or the oh-so-popular nowadays Python packages we became used to using. The buzzwords of the day were ‘Advanced Analytics’ and ‘Data Mining’. We trained statistical models, decision trees, and neural networks to solve problems. Something that nowadays we call Machine Learning.
Back in the day, you had two choices: do it yourself from the ground up or buy into a big commercially available off-the-shelf software system. The former required very good software engineers, mathematicians, and statisticians. The latter meant you would be locked into the vendor’s ecosystem for a very long period, with some notable exceptions. To this day I know organizations sticking to the solution they acquired 15-20 years ago just because of that vendor lock-in.
By 2015, Open-Source software solutions for ML matured and started disrupting the long-stand vendors of analytical systems. At that time Google ML practitioners presented their paper on Hidden technical debt in Machine Learning Systems. It stated some of the problems modern MLOps practices try to solve. Around that same time, you could see that enterprises were starting to notice Open-Source software: Python, R, and their ecosystems, in particular. People started talking about how those solutions could be combined and improved to implement good internal solutions. Their visions were in step with what Googlers found.
What is mlops and why do we need it?
MLOps is a set of practices trying to solve common problems around the building, deployment, and monitoring of Machine Learning models in production. It is best described as the intersection between Machine Learning, DevOps, and Data Engineering practices.
Here an MLOps framework focuses on data collection, pre-processing, and feature engineering, ensuring that only quality data is fed to the next steps because the “garbage in – garbage out” principle holds very strong when building machine learning models. This is achieved by having tests for data quality, reproducible data preparation, and feature engineering. Once the features are ready, they are stored in so-called feature stores, where each feature is versioned and available for the life cycle of the model that would be built.
machine learning operations
Design, (re-)train, optimize, and evaluate models for the task at hand. Again, the focus here is to have reproducible results as well as to log different training artifacts like models and metrics. In this category of practices, we can even go further, automate model explainability and produce reports on why a model was chosen. You could see this practice is also called experiment tracking. It allows us to collect and organize the training model results between different runs, based on different configurations (hyperparameters, features, data samples and data splits). Also, we:
- Get an audit trail of our work, so we could be regulatory compliant if needed.
- Have reproducible training sessions.
- Can package the models and be sure they will work as expected.
the devops part of mlops
It refers to the practice of having a reproducible end-to-end building, testing, and deployment of the software and /or machine learning project. Monitoring of the models, more specific to ML than to DevOps, is also a crucial step toward the success of the ML project. Not having the information that our input data drifted, or our model is not performing well after some time could result in a massive loss for the organization. And, be sure, over time the data will change, and you need the necessary safeguards in place to notify you of the change.
From my personal experience, I can say that having good monitoring helped me identify problems in production systems. For example, in one case the monitoring system showed us that our model was no longer relevant. The users, interacting with the system had changed their behavior, so the model needed to be re-evaluated and re-trained.
In other, more trivial cases, monitoring showed us there were technical problems, for example, a dip in the number of requests to the models was caused by an outage of a branch system. In that case, we had default behaviors applied, but they were sub-optimal and resulted in some revenue loss, but not as much as if the problem hadn’t been discovered for days.
build vs. buy
An MLOps framework is fundamental for people and organizations trying to leverage their data sources for deploying solutions that make decision-making easier, increase revenue, and free time for focusing on more value-adding topics.
As there are a lot of different parts of a good platform, building a complete end-to-end MLOps practice often takes time and requires a lot of expertise upfront. An organization looking into starting with MLOps should consider the approach of a hybrid solution where they work with a solutions partner, like Accedia, with experience in both software development, DevOps, and Machine Learning, to assess the current situation, create an action plan, and deploy a well-established MLOps platform, customizing it to the client’s needs. This will reduce the time to deploy and operationalize the new ML platform. It will further enable the fast creation of internal ML projects and, at some point, graduating them to complete ML programs within the company.
Back in the day, we had monolithic systems that were (are) very capable Machine Learning platforms, and you could also apply some of the best MLOps practices in them. Heck, some of the best practices came from those systems.
Nowadays, we as practitioners and organizations generally have many rapidly changing needs, where one system could never fill all the gaps. I enjoyed working for a vendor of such a system as much as the next guy, but the disruption and innovation that is continuing to happen in the Open-source world of Machine Learning gives me opportunities to learn new things and challenges to solve new and exciting problems. And I have the MLOps framework to guide me!
In a future article in this series, I intend to go into more technical examples of how we can leverage some of ML, MLOps, and Data Engineering practices in our daily work as Machine Learning/Data Science practitioners.
In the meantime, if you want to learn more on the topic of MLOps and how it can be used in your organization, please don’t hesitate to reach out!