Will All The Good Data Scientists Please Stand Up?
These are interesting times for a data science-driven company like Noodle.ai, where our Slack channels frequently pulse with discussions of and debates about approaches to modeling and predictions for our customers. So, you can imagine during the past months of the coronavirus crisis, our #datascience Slack channel has been abuzz. What’s grabbed our attention now is the controversy over how to best predict the shape of the curve of the pandemic with epidemiological models, and specifically, when infections will peak and the health crisis will begin to ease globally, nationally, regionally and locally.
An article published by Slate highlighted the holes in the models the U.S. government is using to determine when the country can reopen, arguing that they’re essentially using “high school-level math” to justify reopening the economy. In late March, epidemiological models showed that the U.S. could face around 100,000 – 240,000 deaths. It’s really surprising to see that the Council of Economic Advisers (CEA) tried simple curve fitting for forecasting.
…the Washington Post’s revelation was met with slack-jawed disbelief from economists, policy analysts, and pretty much anybody who spends time with spreadsheets. A cubic model is not a sophisticated prediction tool. It takes data and uses it to draw a curve, based on an equation you probably learned in high school algebra (y=ax3+bx2+cx+d). You can literally do it with a button on Excel. It’s not a good way to make forecasts in general, since you’re just stuffing numbers in a rigged formula that’s prone to extrapolate small recent trends forward into implausibly dramatic predictions about the future.
Simple curve fitting with the past reported numbers of deaths is likely to yield a very incorrect prediction. It tends to have a great risk of model misspecification bias, since we should assume the form of the predicted curve. This kind of univariate forecasting in which we perform a curve fitting with the historical time series of the predicted target variable and extrapolate the fitted curve over the future may be useful only when there exists clearly evident trends and seasonal patterns. Moreover, it makes a primitive assumption that future actions and conditions would be the same as the current ones. In Covid-19 predictions, this simple assumption is not valid because there have been and will be notable changes in the number of diagnostic tests done per day, the supply of PPE, the medical and hospital capacity, the government responses influencing social distancing (e.g., stay-at-home orders, state reopening), the availability of effective treatments, and the mutations of Coronavirus, and so on.
Advanced Data Science to the Forefront, Please
One thing we know for sure is that we have lots of uncertainties arising from our future actions and conditions that influence the critical factors dominating the infectious propagation of Coronavirus. We should note that the uncertainties could be multi-level from the perspective of predictive modeling.
First, we do not know what actions (e.g., levels in government responses, diagnostic testing and PPE availability) and conditions (e.g., possible mutations of Coronavirus, effective treatment availability) should be considered for the future, since they are not exactly determined or known at the time of prediction. Let’s call it the future action and condition uncertainty.
Second, we are not sure how the specific kinds of actions and conditions change the parameters such as detection rate, infection rate, hospitalization rate from infection, recovery rates while hospitalized, death rate, etc. Let’s call it the parameter estimation uncertainty.
According to the CDC Covid-19 forecast website and the Reich Lab Covid-19 forecast hub, many of the selected prediction models have aimed to use great practices in data science and machine learning. They combine a data-driven parameter estimation with a mathematical equation-based simulation. Instead of a simple curve fitting with the historical values of the target variable itself, they use a set of mathematical functions to describe the epidemic spread and generate the target variable with some defined epidemiological model parameters.
The SEIR (Susceptible, Exposed, Infectious, Recovered) model uses a very well-known set of equations to simulate the stage transition of individuals in different stages for each time period. This simulates the generative process with some key parameters such as the regional population, the latent period, the infectious period, the mortality rate, and the reproduction number R which is defined as the average number of secondary infections that an infected individual makes over the entire infectious period. Importantly, one should not assume the fixed values of parameters but infer the varying values of those over time using the direct inputs (e.g., government policies) or indirect inputs (e.g., anonymized mobile data). If parameter values were not updated with the situational changes such as social distancing and state reopening, the model would fail to make the prediction more accurate and realistic.
The use of model validation schemes popular in data science (e.g., estimating the model parameters based on the training data in previous periods and predicting the out-of-samples in next periods) would allow us to validate how reasonable the estimated parameter values are. Monte-Carlo and Bayesian probabilistic estimations will help us achieve the goal of uncertainty estimation in parameters and show the prediction interval for the worst-case and best-case outcomes.
Yet, we should note that these parameter values are best estimated based on past data influenced by past actions and conditions! Thus, the predictions based on the past data alone cannot incorporate the future action and condition uncertainty into predictions. They implicitly make a basic assumption that future actions and conditions would be the same as those of the past (e.g., the current stay-at-home orders and social distancing will continue), which is unlikely, and make wrong predictions when the future situation changes. This is also why long-term predictions are much harder than the short-term predictions.
According to the CDC Covid-19 forecast website and the Reich Lab Covid-19 forecast hub, we find that 9 out of the 13 listed models make the assumption of the current intervention continuation or are not conditional on future interventions. When there are great uncertainties about future actions and conditions, one trick is to use an ensemble model weighted over many models. This ensemble model may represent the uncertainty integrated over many models in terms of prediction interval. One such model is the UMass ensemble model shown in the Reich Lab hub.
Now, how can we incorporate future action and condition uncertainty into predictions? There is no magic here, but we should explicitly factor the planned or known future actions and conditions in our predictions. For example, at least one model incorporates individual state-by-state re-openings and the resulting effects on infections and deaths into the parameter estimation. To model the effects of mitigation strategies, it uses three different values of the reproduction number R such as the initial R0 before mitigation strategy, the post-mitigation R1 after mitigation strategy, and the post-reopening R2 after mitigation strategy has been relaxed.
The framework of incorporating the future inputs (planned actions, exogenous conditions) into future predictions sounds straightforward. However, it seems that data science practices tend to focus on making predictions with historical inputs alone. These may not be practical when the goal is optimal decision making or focus on control, in which the prediction model should be used for planning future actions in various conditions. We need causal predictions for what-if plan simulations, instead of mere predictions. As Judea Pearl said in his Book of Why, “ML prediction should be beyond curve fitting.”
We need Causal AI. Our future should not be simply predicted by considering our past experience, but by factoring in which controllable actions we are taking in uncertain and uncontrollable conditions in the future.
Noodle.ai’s Approach to Making Predictions and Decisions
Enterprise AI® models used by Noodle.ai implement Causal AI. Most of the key representative problems in enterprise AI involve a dynamic process involving the causal interactions of action inputs, condition inputs, sensor outputs and target (KPI) outputs. Thus, our proprietary Deep Probabilistic Decision Machines (DPDMs) infer the latent state of the dynamic process from data and knowledge. The deep probabilistic (DP) process builds a predictive model that enables the generative simulations of the likely future observation sequence for any given future (or counterfactual) condition and action sequence at a process state. Then, the decision machine (DM) optimizes the action policy or decision-making controller based on predictive simulations in model-based deep reinforcement learning (RL) or model-predictive control (MPC) approaches. The optimal action policy or controller can be designed to maximize the given KPIs, relying on the predicted experiences of sensor and target observations for different actions over the future time horizon. DPDMs implement causal model-based decision making and control.
The Noodle.ai Athena Supply Chain AI Suite, powered by Deep Probabilistic Decision Machines (DPDMs), can make an actionable simulation of future inventory positions, demand, lost sales, and value at risk, given the production and shipments planned in different conditions. The DPDMs try out different action strategies in the simulated environment and reward action strategies that optimize the key performance indicators associated with the desired objectives. In the context of an enterprise with complex dynamic interactions and process variability, our DPDM model is unique in its ability to recommend the optimal actions in any scenario, and to continue learning based on the current actions being taken.