Atlas, Noodle.ai’s Machine Learning (ML) Framework Part 3: Using Recipes to Build a ML Pipeline
This blog is co-authored with Prateek Mishra
Part 2 in this series provided an introduction about what Recipes are and the potential benefits. In this blog, we will take a deep dive into a Recipe implementation contrasting with the typical Machine Learning (ML) pipeline implementation. We recommend reading Atlas Part 1 (Noodle | Medium) and Part 2 (Noodle | Medium) be before proceeding further.
Let’s consider a problem of anomaly detection from multi-variate time series data. Let’s say, we have one month of time series data available at a 10-millisecond granularity. The below figure shows the data fields available from some of the sensors of equipment ‘X’.
An anomalous section of the time series is highlighted below in blue overlaid to a single sensor data.
Solving the Problem
Here, we’ll start with a naïve solution by applying a k-means clustering based anomaly detection in which normal state of multivariate sensors is being encoded by three clusters. A new multi-variate data sample can be now assessed against these learned clusters and its deviations from these three-cluster means will form a basis to assign an anomalous score to that data points. A typical Python implementation of the above with some data preparation and k-means clustering, is provided below which will form the basis to explain the concept and potential benefits of a recipe-based implementation.
During the DS development cycle, a data scientist evaluates multiple techniques and ML pipeline workflows. In the context of the above example, the following variations can be tried out to improve the anomaly detection:
- Evaluation of various clustering techniques such as k-means, DBScan, Gaussian mixture models.
- Different data pre-processors based in the underlying data volume and velocity such as Pandas Data Preparator, Spark Data Preparator, etc.
- Use of different model evaluation metrics such as: Silhouette Score, Homogeneity Index, etc.
- Implementation of different workflow for model training and model inference
These variations need the implementation and execution of various experiment workflows as shown in the above figure. Let’s consider two experiments:
a) Experiment 1 highlighted by the blue arrow
b) Experiment 2 highlighted by the orange arrows
In the above code structure, when switching between Experiment 1 and Experiment 2, a developer would need to delete and rewrite chunks of code in order to execute each experiment in a cohesive way. Writing modular code is the first step to avoid this kind of deletion and to enable reusability during experimentation.
Writing Modular Code
We start by identifying the high level blocks which can provide the sufficient abstraction needed for us to plug the variations and build multiple experiment workflows. Here I am using 4 modules to build an abstract ML pipeline as shown below.
The code that we have written in the notebook should now be refactored into these four modules. The modules could be classes or functions. Here we will model the code into classes. These classes will have predefined interfaces, like so:
A concrete implementation of the interface Data Preparator might look like this:
Once written in a modular fashion, we’re able to swap in the interfaces for concrete implementations, like Dask Data Preparator, Spark Data Preparator, Pandas Data Preparator etc. Similarly, each module may have one or more flavors, and we can mix and match any flavor of data preparator with any flavor of a trainer (clustering) and so on.
However, when we break the code into modules, it comes at a cost. Let’s look at an example:
As you can see in the code above, we are calling the respective modules for a given experiment. However there is some amount of “glue code”. This code is helping to stitch together the pipeline by making data ready for the next step. Writing the glue code is an added effort each time we want to create a working pipeline from modules. In the above simplistic example this doesn’t seem much of an effort but for real world complex pipeline, this soon starts getting out of hand. It’s also very difficult to trace the most recent version of such code as it usually lies around in the code base depending on the developer’s style of arranging. We also delete such code once we have the final version to keep the code base cleaner, regardless of whether we would need this later or not.
A lot of these issues could get addressed if there was a way to declaratively define a pipeline in such a way that no further stitching was required. That is what the recipe will give us.
Writing a Recipe
In the previous section, you saw two flows (Experiment 1 and Experiment 2). It’s the Recipe that will define these flows.
Recipe acts as an abstraction for the pipeline. When we write a Recipe we define which classes will be executed and in which order. A Recipe can be saved with a name and can be shared by data scientists.
The above Recipe (saved as equipment_X_AD_Recipe.json) describes part of an ML pipeline. It defines the strategy that we are using in each stage as well as the environment details. Therefore, the recipe has defined for us how the ML pipeline should be built. However, we need a way to control the behavior of the instantiated pipeline. We still need to define what configurations are needed to run the pipeline. For this we will use Configurations file. A ML pipeline is formed by the combination of a Recipe and a Configuration file.
Example of a configuration file equipment_X_config.json:
A Configuration JSON file starts by defining the Recipe Id (file name of the Recipe) which will be used to instantiate the ML pipeline. This is followed by providing run time parameters for each stage of the pipeline. Named parameters such as Models, Hyperparameters, Features are used as first-class entities to enable tracking of changes. One Recipe can be used in multiple Configurations in turn enabling creation of multiple pipelines with varying configs quite easily.
Let’s say a new equipment, Equipment Y, is added and we want to create a model for that. To make this pipeline if you had to select new features while keeping the modelling technique the same then you would not need to code the pipeline again. You could create a new Configuration file equipment_Y_AD_Config.json by selecting the existing Recipe and changing values for the Feature parameter. This is where we start realizing the true power of this design and the amount of rework it reduces by encouraging developers to codify existing knowledge.
One of the main motivations behind creating an abstract declarative representation for a ML pipeline was to manage the scale of execution. To run pipeline in production you would require a scalable and distributed orchestration engine which manages the scheduling and execution for you. While data scientists and ML engineers run the very same pipeline in local development environments. We observed that integrating the code required the developers to build a good understanding of the underlying orchestration engine and the deployment architecture. Most of the orchestration engines use a Directed Acyclic Graph to represent a pipeline which could be easily abstracted from the developers.
When developers create Recipes as shown above, they get a major advantage of not worrying about execution architecture or frameworks. Atlas provides support for 3 types of execution:
- Local development environment
- Kubeflow on K8s
- Airflow on K8s
Recipe acts as an abstraction to enable execution and scaling on all the above environments for the developer helping them migrate from Jupyter Notebooks to training cluster to production clusters. Atlas has been battle tested to run 100s of concurrent ML pipelines in production.
Leveraging Stages of the Recipe
It’s a common observation that there is a huge code overlap in Training and Inference Pipelines. We defined stages in the Recipe earlier, which can be used to mark the steps in the pipeline which should run for only inference or training, thus reducing duplication even at the Recipe level. The Recipe when instantiated would take in a runtime parameter for stage and would produce the following 2 Pipelines:
By creating a declarative structure for the development of a ML pipeline we get a good way to reuse existing modeling techniques and experiments, but a need arises to validate the JSON to check:
- Does the Recipe JSON have valid syntax?
- Is the Recipe creating the same pipeline that the developer had visualized?
- Are there any accidental cycles in the Pipeline graph?
- Can we execute and test a Recipe + Configuration combination?
We created an SDK to be able to help developers validate the Recipe and Configuration JSON against the above criteria. The sample invocation of these commands is shown in the steps below.
1. Visualization: The Recipe JSON can be visualized as a graph which depicts the execution flow of the Pipeline. Developer specifies the Recipe name to load the JSON and visualize.
2. Validation: The validation-util provides an ability to scan the Recipe JSON and detect errors with the JSON file syntax and logical errors that could have been introduced due to incorrect linkage of steps in the pipeline. We could also validate the entire recipe folder.
If the Recipe is valid:
If the Recipe is invalid, an error is thrown:
3. Execution: Instantiation of the Pipeline is done by passing the Configuration JSON to the SDK. The benefit of this step is that you don’t need any orchestration tools such as Airflow and Kubeflow in your local development environment to test out the end-to-end pipeline execution.
When the same Pipeline needs to run at scale, we can change the execution engine to Local/Airflow/Kubeflow declaratively. This gives us an ability to separate Pipeline logic from the underlying execution engine.
What does the Recipe give us?
Not only does the Recipe give us one declarative language that the data scientists will be comfortable with, but it gives us multiple other benefits:
- Modularity: The code will now be written in modules. Modules written by one data scientist can be easily imported by another. Writing modular code allows rapid experimentation and helps in scaling to several pipelines.
- Reusability: As you have seen any flavor of a particular layer can be used. No code will be wasted, but rather it can be reused (ex: Inference 2). Reusability is no longer just a construct of inference, it’s also that of training and experimentation. This benefit can be visualized by the blue and orange flows of the two experiments.
- Versioning: This workflow encourages the developer to check in even the experimental pipelines since all modules can be retained till the end. Developers need not wait to have perfect results before committing their code.
- Testability: We can perform Integration testing without depending on any orchestration framework like Airflow or Kubeflow, bringing developers one step closer to production like execution.
- Dead Code Paths Traceability: In a Data Science Workflow, multiple experiments are run with different modelling & feature generation techniques but finally the best performing set of techniques are promoted to production. This leads to a lot of Dead Code over time in the repository. With Recipe these code blocks can be identified easily by finding the classes that are not being called by any Recipe. Inactive classes and Recipes can be removed while GIT maintains the version history. [Refer: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf]
- Platform Independence: The same Recipe can be used to spin up a DAG in Airflow, as well as a Pipeline in Kubeflow. We are no longer writing a Pipeline in the syntax of any one platform but have instead abstracted out the essentials into the Recipe.
- Inference and Training Support: As shown above, one Recipe supports both training and inference. Based on a run time variable, the Pipeline graph that is generated will vary.
- Jupyter Support: We leverage the Kale framework to convert an annotated Jupyter Notebook to a Recipe Json creating classes dynamically as needed. Creating DAGs in Airflow and Pipelines in Kubeflow dynamically from this notebook is supported.
- Scaling: By providing support for multiple execution engines we can scale our pipeline on distributed cluster setup using K8s and leverage any scheduler/orchestrator engine.
- CPU/GPU Config: As seen in the sample Recipe, the CPU and GPU details can be configured for each task specified in the Recipe.
- Support for multiple dockers: Each Recipe may be working on a separate base docker image. There is also support to run each step of a Recipe on a separate docker image providing inter-task dependency isolation.
As we have shown above, using the Recipe comes with a host of benefits. While it almost seems obvious that we should be writing modular code, we see the challenges that come with modularity. These challenges need their own solve and that’s where the Recipe comes in. Writing reusable code may seem to be tedious at first, but as we write more code there are great benefits to be reaped. Developers can experiment rapidly without rewriting or losing experiments, while also getting an ability to move the code to production easily.
We would like to thank the following Noodlers for their contribution to this series:
- Ravi Bala for setting up the requirements for a framework to support modular, extensible pipeline development.
- Purushottam K. for building out the Recipe Management Utility SDK with Airflow Support.
- Madhukar Gokhale for extending Recipe to work with Kubeflow.
- Abhinav Garg, Pooja Balusani and Ravikant Chahar for testing out the Atlas framework and providing invaluable inputs which helped mature the framework.
- Gopal Joshi for providing feedback on the blogs in this series.