In the continually advancing world of machine learning (ML), one concept has quickly become a keystone of effective and efficient model deployment: Machine Learning Operations, or MLOps for short. This discipline combines ML, DevOps, and data engineering, providing a methodical approach to deploying, monitoring, and maintaining ML models in production.
Nevertheless, not every ML application fits neatly into the conventional MLOps framework. Occasionally, the model or its application diverges from the standard, presenting unique challenges that demand inventive solutions. Today, I will share an intriguing case study in which we manoeuvred some atypical circumstances to successfully industrialise an ML model for one of our clients.
In this particular instance, we ran into several challenges:
This experience highlights the significance of taking into account MLOps principles right from the inception of any ML project, to assure a smoother transition into operation and maintenance.
In this blog post, I'll guide you through this captivating journey and show you how we managed to metamorphose a non-standard ML model into an operationally proficient solution using Azure's ML platform. So, welcome to a deep dive into the world of MLOps under non-standard conditions…
Before we plunge into our case study, let's briefly refresh our understanding of MLOps and how Azure caters to it. MLOps is an engineering culture and practice that aims to unify ML system development (Dev) and ML system operations (Ops). The principal objective is to create a seamless integration and workflow from the development, to the deployment and maintenance, of ML models.
A comprehensive workflow which includes a build pipeline (CI), a continuous training pipeline (CT), and a delivery pipeline (CD) is paramount to successful implementation. We found Azure DevOps to be a suitable tool for overseeing the orchestration of CI/CD within this structure. It is robust and easy to configure, and enables us to assemble and deploy our solution across different environments effectively.
Regardless, the main thrust of this blog post is to spotlight the heart of the ML operation - the continuous training pipeline. This pipeline typically comprises several stages: data ingestion, preprocessing, feature engineering, modelling, model testing and validation, model registering, and serving. Each stage is important, typically carried out in the sequence outlined.
However, do bear in mind that the specific structure, components, and sequence of these stages can be tailored to meet the unique demands and maturity of a given project. In this manner, MLOps retains its inherent flexibility and adaptability, accommodating the diverse requirements of any ML venture.
Azure Machine Learning, a prime offering from Microsoft's cloud services suite, is a collection of tools and services that facilitate MLOps. Azure ML offers services such as automated ML, hyperparameter tuning, and a multitude of pre-built ML algorithms. It also boasts powerful MLOps capabilities.
Azure ML further enables MLOps development through the azureml Python SDK. This SDK permits developers to engage with the Azure ML service APIs, and to create and deploy pipelines, monitor experiments, register models and artefacts, manage metadata, and much more.
For our case study, Azure's ML platform was selected due to its extensive services and strong MLOps support. But what happens when your ML model doesn't quite slot into the typical MLOps mould? Let's delve into the challenges and solutions we unearthed during the course of our project.
The first hurdle we came across in our project was the monolithic character of the initial codebase. Initially devised without the MLOps workflow in mind, this comprehensive, interconnected codebase posed a considerable obstacle for us to overcome.
A monolithic codebase can be compared to a single, intricate knot where numerous threads representing various processes are interwoven and inseparable. In our case, those threads represented different stages of the MLOps workflow, such as data ingestion, preprocessing, feature engineering, and more. With everything bundled up, it was a tough task to work on individual phases independently, making the code awkward to manage and modify.
Our first critical task was to meticulously untangle this monolithic codebase into logical MLOps phases. This necessitated disentangling the diverse processes from each other, converting them into separate, manageable units of code, or devising new processes with clear interfaces where required. We then put them together in a more structured MLOps manner, ensuring they could interact seamlessly whilst remaining distinct.
However, in this process of deconstructing the monolithic codebase, we stumbled upon a particular challenge with the feature engineering component. Due to its deep integration within the codebase, completely extricating the feature engineering functionality would have required a significant investment of developer time and meticulous reverse engineering.
We decided to strike a fine balance, recognising the constraints posed by the original code structure whilst optimising the process to fit into an MLOps framework as effectively as we could. We earmarked the further separation of the feature engineering component as a potential task for future enhancements, and thus put it on the backlog.
To ensure the integrity of our work during this development phase, we devised a comparison test that could compare the results of the original monolithic codebase with the results of the newly developed, MLOps-friendly version. At every stage of development, this comparison test served as a vital checkpoint, enabling us to confirm the accuracy and consistency of our results, ensuring that the new version faithfully mirrored the behaviour of the old one.
This process reinforced a crucial lesson: when developing an ML model, it's imperative to consider the MLOps phases from the outset. Without this foresight, different phases can become deeply entangled within the codebase, leading to unnecessary complexities and challenges further down the line.
Untangling the monolithic codebase was a significant task, but it was a necessary step in transforming the model into an MLOps-friendly format. By doing so, we managed to enhance the maintainability of the model, making it considerably easier to operate in the long run.
One of the more prominent characteristics of the model was its non-user facing role. Unlike most models that are trained and deployed to offer services to users, the primary function of this model was to produce a fixed set of predictions each time it was run. These predictions, rather than a served model, were presented as the user-facing element in the form of a data product. As a result, the standard model-serving stage wasn't applicable in the same way here.
However, even though the model did not directly serve users, it was still crucial to register the model artefacts. While model artefacts are typically used for deployment, in this instance, they served a diagnostic purpose. The registered model artefacts provided a lineage of any results produced, offering the data scientist invaluable insights into the modelling process and facilitating the inspection of artefacts should any anomalies arise.
Another distinctive aspect of this model was the requirement to reproduce the probability distribution each time results were needed. This ensured that the model factored in any changes in the input data, maintaining its relevance and precision. This characteristic negated the value of serving the model on an ongoing basis, as the model's usefulness was tied to the specific time frame in which it was sampled to generate predictions, which was an automated part of that pipeline stage.
Moreover, we had to take into account that the lion's share of processing time was spent in this phase while sampling from the probability distribution. By contrast, the other stages, such as preprocessing, feature engineering, or modelling, were relatively brief. This prompted us to concentrate on load testing to better understand the resource requirements under different types of load. As suspected, the sampling process turned out to be quite resource-intensive, and the tests unveiled opportunities for optimisation and performance improvement within the codebase.
By gaining a profound understanding of these unique aspects of the model, we were able to adapt our approach, moulding our MLOps strategy to effectively cater to the model's specific needs and operational context.
Thinking about lineage, i.e. the historical sequence of models and data, from the get-go is crucial, and we turned to Azure ML to facilitate clearer lineage and diagnostics of the ML model. To better trace the lineage of each model artefact and run effective diagnostics when required, the Azure ML platform allowed us to assign specific tags during artefact registration. These tags aided us in tracing the lineage of each model artefact, enabling effective diagnostics when required. It's worth mentioning that the Azure ML Studio search interface is particularly useful in this regard, as you can search on any of the available attributes, including any tags that have been assigned.
Pinpointing specific versions of data is a key part of maintaining clear lineage, understanding the state and inputs to the process, and ensuring the reproducibility of results. In our case, considerable attention was given to tracing back to the upstream input data. As the input data sources were continuously updated and there was no means to retrieve the state of the data at a particular point in time, we decided to preserve a copy of the data used as input by the MLOps process. Fortunately, the volume of data was manageable, however this may not be the ideal solution in cases with significantly larger data volumes.
While Azure ML made the above fairly straightforward, however when it came to tracing back to outside the Azure ML boundary we had to find a different approach, underscoring the importance of considering lineage upfront when designing the overall MLOps architecture.
Data drift is a common worry in ML. As the foundational data shifts over time, models trained on outdated data can lose their precision and efficacy. However, in our specific case, conventional data drift wasn't really a concern as the probability distribution was remodelled each time to mirror any alterations in the input data.
Nevertheless, model drift posed a potential hazard. The model's assumptions about features and the modelling process were grounded in the state of the data when the model was initially developed. As fresh data was introduced over time, it could invalidate some of these assumptions, and thus impact the model's performance. This is sometimes called concept drift, and it calls for a proactive approach to catch such drifts.
Our solution entailed creating an initial, verified baseline summary of results. Whenever the model generated predictions, a summary of these predictions was compared with the baseline. If the outcomes deviated beyond a certain threshold, it signalled possible drift and alerts would be sent to the developers. These alerts urged further inspection to determine whether the model's assumptions remained valid or required amendment. This approach allows for timely alterations and revisions to keep the model pertinent.
As we draw to a close on our MLOps journey, we reflect on the value gained from leveraging MLOps in our specific scenario.
Our primary success indicator lies in the industrialisation and operationalisation of a process that was formerly almost entirely manual. The key to this achievement was the effective transformation of a monolithic codebase into a structured format, in alignment with the MLOps process. This overhaul significantly enhanced the model's operability and manageability.
The key lesson from this project is the importance of considering the MLOps workflow from the very onset of an ML project. Designing ML models with an operational mindset from the outset can avert substantial complexities and challenges down the line. It can also deliver considerable advantages in terms of scalability, manageability, and overall efficiency of the ML model.
Throughout the development process, Azure ML proved its worth as an ML platform. However, it's crucial to remember that while a tool set like Azure ML can streamline the journey, MLOps is fundamentally an engineering practice, not merely a technology. Each project carries its own unique demands and requirements. Hence, our approaches must be tailored to cater to these project-specific needs.
To delve deeper into the concepts discussed in this blog post, here are a few resources for further reading:
By exploring these resources, you'll gain a more comprehensive understanding of MLOps, its benefits, and how to implement it effectively to meet the unique demands of your project.
Ready to navigate the MLOps landscape with confidence? Reach out to our team at Mesh-AI. We're here to guide you through your MLOps challenges and propel your projects towards success.