AWS Sagemaker Pipelines
Why do we need MLOps?
The history of statistics and machine learning (and the subsequent rebranding of old methods into new ones) is incomplete without the mention of AI. AI – Artificial Intelligence – is the next big step forward, and those wondering about its practicality and longevity only have to look at products such as Apple’s Siri and Amazon’s Alexa.
But from a Data Scientist point of view, what does it take to develop such a model? Or even a simpler model, say a binary classifier? The amount of work is quite daunting, and it’s only the tip of the iceberg. How much more work is needed to put that model into the continuous development and delivery cycle?
For a Data Scientist, it can be hard to visualize what kind of systems you need to automate that your model needs to perform its task. Data ETL, feature engineering, model training, inference, hyperparameter optimization, performance monitoring, etc. That all sounds like a lot to automate.
This is where MLOps comes into the picture. MLOps bridges DevOps CI/CD practices to the data science world.
Building an MLOps Infrastructure
Building an MLOps infrastructure is one thing, but learning how to use it fluently also takes time and work. For a Data Scientist at the beginning of their career, it could seem too much to learn how to use cloud infrastructure in addition to learning how to develop Python code that is “production” ready. A Jupyter notebook outputting predictions to a CSV file simply isn’t enough at this stage of the machine learning revolution.
Usually, companies that have a long track record of Data Science projects have a few DevOps, Data Engineer/Machine Learning Engineer roles who work closely with their Data Scientist teams to distribute the different tasks of production machine learning deployment. Maybe they even have built the tooling and the infrastructure needed to deploy models into production more easily. But there are still quite a few Data Science teams and data-driven companies figuring out how to do this whole MLOps thing.
Why should you try SageMaker Pipelines?
The problem in building out your MLOps infrastructure is that there are perhaps too many ways to approach its build and deployment. Luckily, AWS is the biggest cloud provider at the moment, so it has all the imaginable tooling that you’d need. They are also heavily invested in Data Science thanks to their SageMaker product, where new features are popping up constantly.
AWS tries to tackle some of the problems with the technical debt involving production machine learning. I’ve recently been involved in project building and deploying an MLOps pipeline for edge devices using SageMaker Pipelines, so I’ll try to provide some insight on why it is good and what is lacking compared to a completely custom-built MLOps pipeline.
The SageMaker Pipelines approach is an ambitious one. What if Data Scientists, instead of having to learn to use this complex cloud infrastructure, could deploy to production just by learning how to use a single Python SDK ? You won’t even need the AWS cloud to get started, as it runs locally (to a point).
SageMaker Pipelines makes MLOps easy for Data Scientists. You can define your whole MLOps pipeline in (for example) a Jupyter Notebook and automate the whole process. There are a lot of prebuilt containers for data engineering, model training and model monitoring that have been custom-built for AWS. If these are not enough you can use your containers, enabling you to do anything that is not supported out of the box. There are also a couple of very niche features like out-of-network training where your model will be trained in an environment that has no access to the internet mitigating the risk of somebody from the outside trying to influence your model training with, for instance, altered training data.
You can version your models via the model registry. If you have multiple different use cases for the same model architectures with differences in the datasets used for training it’s easy to select the suitable version from SageMaker UI or the python SDK and refactor the pipeline to suit your needs. With this approach, the aim is that each MLOps pipeline has a lot of components that are reusable in the next project. This enables faster development cycles and the time to production is reduced.
SageMaker Pipelines logs every step of the workflow from training instance sizes to model hyperparameters automatically. You can seamlessly deploy your model to the SageMaker Endpoint (a separate service) and after deployment, you can also automatically monitor your model for concept drifts in the data or latencies in your API. You can even deploy multiple versions of your models and do A/B testing to select which one proves to be the best.
And if you want to deploy your model to the edge, be it a fleet of RaspberryPi4s or something else, SageMaker provides tooling and seamlessly integrates with Pipelines.
You can recompile your models for a specific device type using SageMaker Neo Compilation jobs (basically if you’re deploying to an ARM etc. device you need to do certain conversions for everything to work as it should) and deploy to your fleet using SageMaker fleet management
Considerations before choosing SageMaker Pipelines
By combining all of these features to a single service usable through SDK and UI, Amazon has managed to automate a lot of the CI/CD work needed for deploying machine learning models into production at scale with agile project development methodologies. All of the other SageMaker products (for example: Feature Store or Forekaster) can be leveraged if you happen to need them.
Though it is a great product to get started with, machine learning pipelines aren’t without its flaws. It is quite capable for batch learning settings, but there is no support as of yet for streaming/online learning tasks.
And for the so-called Citizen Data Scientist, this won’t be the right product, as you’ll need to be somewhat fluent in Python. Citizen Data Scientists are better off with BI products like Tableau or Qlik (which use SageMaker Autopilot as their backend for ML) or perhaps with products like DataRobot.
And in a time where software products are highly available and high usage, the SageMaker EndPoints model API deployment scenario where you have to pre-decide the number of machines serving your model won’t be quite enough.
In e-commerce applications, you could run into situations where your API is receiving so much traffic that it can’t handle all the requests because you didn’t select a big enough cluster to serve the model with. The only way to increase the cluster size in SageMaker Pipelines is to redeploy a new revision within a bigger cluster. It is pretty much a no brainer to use a Kubernetes cluster with horizontal scaling if you want to be able to serve your model as the traffic to the API keeps increasing.
Overall it is a very nicely packaged product with a lot of good features. The problem with MLOps in AWS has been that there are too many ways of doing the same thing, though SageMaker Pipelines is an effort for trying to streamline and package all those different methodologies together for machine learning pipeline creation.
AWS MLOps is a great fit if you work with batch learning models and want to create machine learning pipelines quickly and efficiently. But if you’re working with online learning or reinforcement models you’ll need a custom solution. And if you are adamant that you need autoscaling then you need to do the API deployments yourself, SageMaker endpoints aren’t quite there yet.
For references to a “complete” architecture refer to this AWS blog: https://aws.amazon.com/blogs/machine-learning/automate-model-retraining-with-amazon-sagemaker-pipelines-when-drift-is-detected/
View more articles View more articles