Noodle.ai is focused on providing AI products to address a diverse set of problems which could aide in better visibility, decision making and waste reduction for our Enterprise customers. This two-part blog covers the journey of building our in-house ML framework – Atlas, explaining the factors that were considered, choices that were made and the benefits that we derived out of it. In the first part the focus will be on 4 high-level factors which contribute to complexity of building an AI application and we will dive into details of each of these:

  1. Enterprise AI Applications: What does it take to build an Enterprise AI product
  2. AI Application Development Lifecycle: Where is the choke point in the product development process and why
  3. ML Pipeline Development: What are the velocity challenges with iterative experiments
  4. Tech Stack Selection: What are the tools available, overview of one such combination we use at Noodle

1. Enterprise AI Applications

There are numerous blogs which describe the challenges when developing an AI application. This comes largely due to the iterative nature of development when compared to a more deterministic rule-based software.

Building Enterprise grade software is always treated differently than building other applications and a testament to this fact is that most open source software offer a community and an enterprise version of the same product. A Forbes article covers this topic in detail.

What do we know so far?

  • Building AI applications is tough
  • Building Enterprise grade software is tough

So, what would be the complexity in building an Enterprise AI application?

To understand this, we try to see the application of AI driven Recommendation Engine, evaluated for different category of end users.

  1. Recommending songs or videos to the end user on an audio/video streaming application
  2. Recommending optimal parameters to an operations engineer in a manufacturing plant.

In each of the example above, while the approach taken to solve the problems could be similar the impact of decision made is quite different. In the #2 scenario, errors made by the application could cause significant loss to the end users.

When you are building an Enterprise software such as Confluence, you would be paying to consume core features such as Documentation & Collaboration. It might be using an AI/ML backed search algorithm to better assist you in navigation.

Another example could be Zoom, where you would be paying to use the meetings, screenshare and recording features. It could be using AI based systems to flag content on video calls.

Both are examples of Enterprise products where AI is an Add On Feature. At Noodle we build products where AI is the core feature of the application, which belong to the green path shown in Fig. 1. AI is not an added layer of intelligence, but it is the feature for which we get payed. This adds additional layers of complexity in the development and deployment lifecycle. A failure in this component will result in failure of product and most likely cause losses. The criticality of performance is higher and so are the expectations.

Fig. 1: Types of AI applications based on project goals

Some of the challenges that we have experienced when building Enterprise AI products are described below:

  • Data Variations – Enterprise tenants have variations in storing and naming the data sets, often involving data being spread across multiple vendor’s software. This makes the efforts to retrieve and process this data intensive.
  • Customization – Most customers need data isolation and hence even when the domain is the same, customized models must be created for each tenant analyzing their individual historical data and understanding the context.
  • Performance – Accuracy of the models should be high to drive impactful decision making and, as seen in the examples above, prevent losses from happening.
  • Freshness – Inference results should be available on the latest data. Inferences which were run earlier become stale and need to be discarded or refreshed.
Fig. 2: Challenges with Enterprise AI applications

2. AI Application development lifecycle

“What you have learned is that the capacity of the plant is equal to the capacity of its bottlenecks,” says Jonah.” – Eliyahu M. Goldratt, The Goal: A Process of Ongoing Improvement

As per the Theory of constraints, every system has one or more constraints which prevents it from achieving optimum efficiency. Any improvement done before or after this constraint is ineffective. Going from raw data to an AI application involves multiple stages as highlighted in diagram below. A common constraint is the turnaround time of putting new ML pipelines into production. Usually the choke point is the switch over from Model Development to Production Readiness stages as shown below. Enabling faster experiments, adding more members on the team or providing more specialized hardware for ML/DL will not have any impact till this is addressed.

Fig. 3: AI Application Development Lifecycle

If we examine the two stages leading to the constraint namely Model Development and Production Readiness, we can see that the reason for this delay is that the developers in the two stages are trying to meet different goals.

Model development

  • Domain understanding / Data Labelling
  • Modelling technique selection
  • Feature selection
  • Experimentation
  • Evaluation
  • Visualization

Production Readiness

  • Code Modularization
  • Reduce duplication
  • Interface definition
  • Test case identification and automation
  • Workflow orchestration
  • Scalable execution
Fig. 4: Optimization goals for Model Development and Production Readiness stages

Balancing the different goals and bringing in the right velocity is tough between these two stages. The amount of effort to be invested in this process varies if you are trying to build one product vs if you are trying to build a suite of products across different domains. In the latter scenario, optimization of this process becomes a necessity.

3. ML Pipeline Development Model

Development is an iterative process which involves experimentation and evaluation cycles factoring for both code and data changes. While it is tempting to continue this cycle to chase a certain state of perfection but if we apply the principles of DevOps, also referred to as MLOps, we would want to promote reasonably performing models and their corresponding ML pipelines to production quickly. There are a few articles that cover this topic in detail and are a recommended read.

  1. MLOps: Continuous delivery and automation pipelines in machine learning
  2. ML Ops: Machine Learning as an Engineering Discipline
Fig. 5: Iterative Model Development

All AI applications must embrace the concept of iterative, exploratory development process. This is a difficult premise for software development where most practices known have been designed around clarity in requirements. The constant experimentation creates 2 more problems which become tougher to solve as time passes.

  • There is a tendency to keep improving the models and push for an ideal state. This makes the development follow a waterfall model where most of the integration and testing gets delayed.
  • The boundary between a successful experiment prototype and an application deployed in production becomes unclear and this leads to massive technical debt [Prototype Smell]

4. Tech Stack Selection

There is an added complexity of selecting the right tech stack amongst the numerous packages and tools that are available.

Fig. 6: Tech stack across different layers.

At Noodle, we have a hybrid infrastructure. We have a private data centre that houses heavy compute clusters and public cloud – AWS + Azure, is used to host the application. Each tenant can opt in for dedicated VPC or VNet based on their data + compute isolation needs. Development and training phases are mostly carried out in the private data centre and the trained models are deployed to cloud. To ensure seamless operations across 3 different environments, we have leveraged Kubernetes (K8s) to be the layer of abstraction for our operations and deployments.

The initial focus was more on hardening the inference scalability and deployments. For this we selected Airflow which has been battle tested and proven in our operations. We deployed Airflow on K8s, using the K8s Executor to achieve parallelism across 100s of pipelines spawning 1000s of containers.

Development for ML pipelines is usually done in Jupyter converted to python scripts and tested in Docker. To scale the training workflow and enable rapid experimentation we choose Kubeflow, since this blends well with our K8s stack. MLFlow was selected as a model registry and experiment tracker since it gives a good REST-based interface to abstract model versions.

Our workflow is shown in the diagram below.

Fig. 7: Workflow & Tech stack at Noodle

Context Switch: Training to Production – Training is different from inference in pipeline steps, infra (GPU -> CPU) and tech stack (refer pic Above) which causes developers to factor for all these in their code. For example, a developer needs to do the following.

  • Develop experiment in Jupyter
  • Create a Training pipeline in Kubeflow
  • Test the training pipeline using Docker container
  • Deploy pipeline to Kubeflow + GPU instances
  • Write an airflow DAG for the inference pipelines
  • Test the inference within Docker
  • Deploy inference pipeline to Airflow + CPU instances

Summary

So far, we have seen the challenges in building AI Applications due to the following factors.

  • Enterprise AI applications have more layers of complexity added due to the end users it serves and the cost involved in decision making.
  • Iterative nature of ML pipeline development and deployment necessitates that we factor or explorative development with room for experimentation.
  • Tech stack and infrastructure variations pose a selection problem. This also causes syntax variations which leads to context switches.
  • Constraint when promoting ML pipeline from Model development to Production Readiness stages. To ensure the code being promoted to production is maintainable it needs to be modular and reusable.

When designing a system there are always trade-offs involved and being aware of these helps in making an educated choice. In Part 2 of this blog we start by exploring some of these trade-offs and do an impact analysis on pursuing one set over the other. This will pave the way to define the Design Goals which were most relevant to Noodle and which are the foundation in building a novel in-house ML framework which we call Atlas. We will do a brief walkthrough of developing and deploying an ML pipeline using Atlas.

References