Machine Learning System design

84 / 100

Before we even start building a ML system, we need to understand the objective of the system and the impact of the same. The system should be driven by business objectives which are then being converted to ML objectives to drive and shape the development of ML system.

Machine Learning

Once everyone from business and tech team are onboarded on the objectives of the ML system. We can start working on requirements. Apart from the key requirements from the business the other important considerations for any ML systems are reliability, scalability, maintainability, and adaptability.

Let me walk you through the step by step approach in designing ML systems.

Business and ML objective:

Lets assume you have a list business objectives and have derived ML objective, it’s equally essential to discuss with the business to understand the business metrics they would want to improve with the ML model. Usually in a data science/ML project data scientists gets too excited about improving the model’s accuracy, recall, F1 score etc and hence spend a ton of resources like data, time, computation hardware to improve the model accuracy from 92% to 92.6%. Many short-lived ML projects falter because data scientists become overly fixated on optimizing ML metrics, neglecting business metrics. Meanwhile, their managers prioritize business metrics and, when they don’t see how an ML project can enhance these metrics, they prematurely terminate the projects.

For an ML project to be successful in a business setting, it’s essential to link the ML system’s performance to the company’s overall performance

The impact of an ML project on business objectives can be challenging to assess. For instance, an ML model that provides customers with more personalized solutions can increase their satisfaction, leading them to spend more on your services. To get a definite answer on the question of how ML metrics influence business metrics, experiments like A/B testing are needed.Realistic expectations are vital when evaluating ML solutions in a business context. The media and enthusiastic advocates often create a lot of hype, making some companies think ML can immediately transform their operations. While magic can happen, it won’t be overnight.

Machine Learning System design

Requirements for ML Systems:

Any project wouldn’t be successful without a clear list of requirements. It is important to discuss the requirements from the business team and agree upon the final list. The requirements would vary widely based on the objective and type of ML problem, but most of the systems should have the these four characteristics:

  1. Reliability,
  2. Scalability,
  3. Maintainability,
  4. Adaptability.

Reliability

The system must continue to operate correctly and perform as expected despite encountering hardware or software faults or human mistakes.

Determining “correctness” for ML systems can be tricky. For instance, your system might correctly call the predict function, but still produce incorrect predictions. Without ground truth labels, how can we identify if a prediction is wrong?

In traditional software systems, warnings like system crashes, 404 errors or runtime errors, often alert us to these issues. However, ML systems can fail without any visible signs. End users might continue using the system, unaware that it has failed.

Scalability:

There are multiple ways an ML system can grow over days/years.

It can grow in complexity. lets’s assume you used a simple decision tree model for a classification task and hence it could fit into AWS lambda+EC2 container setup utilising less resource. But now the data has grown and decision tree is not giving good accuracy and hence you want to move to deep learning models which will require atleast 16 RAM EC2 instance to generate predictions

Your ML system can grow in traffic volume. Let’s assume initially you handled 10,000 prediction requests per day. However, as your company’s user base expands, the daily number of prediction requests fluctuates between 1 million and 3 million.

An ML system might grow in ML model count. Let’s say initially, you might have only one model for one use case. However, over time, business wants to add more features to this use case, so you ended up building more models for each use case.

To handle system growth effectively, you need scalable solutions. Scalability involves both up-scaling (adding resources during high demand) and down-scaling (reducing resources during low demand). For example, while peak usage might require 100 GPUs, most of the time only 10 GPUs are needed. Keeping 100 GPUs always active is costly, so the system should downscale to 10 GPUs when possible.

Autoscaling, a feature in many cloud services, automatically adjusts the number of machines based on usage.

Scalability isn’t just about resource scaling but also artifact management. Managing 100 models is vastly different from managing one. While you might manually monitor and update a single model, handling 100 models requires automated monitoring and retraining. Additionally, you need a system for managing code generation to reproduce models as needed.

Leave a Comment