Supernovae to Customer Success: A Data Science Journey
Why would you leave your job researching stars? Won’t you get bored?
As I left my academic career studying exploding stars, I received this question several times. How could I leave a job where I was paid to look out into space and tackle fundamental questions about our universe? Sometimes the driving force in your work is not answering philosophical questions but in the way you get to answer those questions.
I have spent the last 10 years doing data science for astrophysics research, but I recently made the switch from an academic position to one at a startup company. I will share how I prepared for this career change and how it relates to skills for a data scientist.
I had decided at a young age that I wanted to pursue a STEM career. I finally settled on Physics with a focus in Astrophysics going into my undergraduate program at the University of Georgia. I decided to continue my education to earn a PhD in Cosmology, which is the study of the universe on a large scale like how it started and how it will end. Honestly, who would be able to resist the urge to be addressed as “Dr. Ponder”?
During graduate school, I determined that I wanted to either be a research scientist at a national laboratory or leave academia and become a data scientist. I started with a postdoctoral appointment to further explore which direction was the right choice for me. However, when I searched for postdoctoral programs, I only applied to programs that would give me the flexibility to work on projects that fell into the Data Science/Artificial Intelligence domain.
I earned my Ph.D. from the University of Pittsburgh and joined the University of California, Berkeley as a Computational Data Science fellow in the Berkeley Center for Cosmological Physics. For every project I took on, I made sure that I was still working towards both possible career choices: using data science in academia or using data science at a tech company. It was also during this time that I forced myself to stop relying on tools made explicitly for astronomy and focus on general tools. For example, the Python package AstroPy has a nice interface for working with data frames that was built for astronomers, but Pandas is more widely accepted.
After 3 years, I decided that I wanted to focus even more on artificial intelligence and machine learning, so I took a position at the SLAC National Accelerator Laboratory at Stanford University as a postdoctoral researcher in their new Machine Learning Initiative. In this job, I furthered my knowledge of general tools by implementing a model using TensorFlow. Since I was interacting with the machine learning researchers on Stanford’s campus, I wanted to ensure that I could use the same jargon as them by taking Coursera classes on Machine Learning and Deep Learning.
Making the Decision to Leave Academia
After spending so much time jointly building my academic career and preparing for a possible career in the tech industry, I decided it was time to commit to a single direction and become a data scientist. Though I had a clear path to continue my academic research, I made the hard decision to leave academia. I wanted to find a company that had a clear goal to better the world we live in. One of my friends from graduate school suggested their current employer called Noodle AI. Noodle optimizes the flow from raw materials to finished goods from manufacturing to supply chain to minimize the waste accrued in other companies.
I met all the technical requirements for the Senior Data Science position offered at Noodle due to my diligent planning, so my friend recommended me for the job. My friend assured Noodle that though I had no industry experience, I was a good fit for the company. I received an offer and was able to deliver data science work for a customer within my first two weeks at the company!
Connecting Academic Data Science Research to Real World Data Science Experience
Though I took specific steps to ensure that I was getting data science experience, many projects that an astrophysicist can work on can have connections to data science.
I worked on many machine learning projects and not all of them were published. An interesting option is using machine learning for a class project. For example, for my observational techniques class, I used scikit-learn for population clustering with Fast Radio Bursts. For a collaboration I was working with, I created a logistic regression to emulate expensive simulations to plan for a satellite survey. For both projects, I used several different implementations to probe the best methods. Neither of these resulted in a paper, but I learned how to use scikit-learn.
Machine learning algorithms are often created with an express purpose in mind. A large part of either academic or technical work is creating extensions of that work to use with your own data. One project I worked on was adapting a K-Means clustering tool that was developed for clustering curves as a function of time (timeseries), and instead I clustered curves as a function of cosmological distance. I also used a deep learning architecture called a Transformer that was originally built for language translation to classify supernova light curves.
Developing machine learning models is a good second step (after developing features for the models), but a particularly important last step is analyzing the performance of the model. One project that I worked on was solely based on analyzing the results from a modeling competition. Kaggle hosted a challenge to classify supernova light curves with a limited amount of data based on what would be expected from the upcoming Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST). There were over 1000 participants, and we needed to understand what made the winning models unique so that we could leverage those aspects for the survey. https://arxiv.org/abs/2012.12392 https://www.kaggle.com/c/PLAsTiCC-2018
None of the models I worked with were built from scratch; they were all built using readily available packages such as scikit-learn and TensorFlow. The two most time-consuming parts of building machine learning models are data cleaning and analyzing the results. Data cleaning is an imperative step as it reduces noise and possible false signals in the data set. It can be nontrivial to understand what the results are really saying about your data and your question.
Noodling on New Problems
Noodle is an Enterprise AI company, meaning that they deal directly with other businesses. We typically receive timeseries data from customers. I work primarily with our Demand Flow application which examines timeseries of items sold and forecasts how much will be sold in the future. In this way, we can make recommendations of how much to produce and ship out to distribution centers so that companies are not under or over-producing goods.
When I joined, Noodle had already built a Python package called Core Inference that made it easy to interface with different prediction models such as statistical models, machine learning-based models, and deep learning-based models. It also has capabilities to ensemble multiple models for improved performance and reconcilers to standardize over different hierarchies. My machine learning training made it easy for me to understand which models would be good for the data and my experience with Python made it easy for me to quickly master the Core Inference package. In academia, you need to publish eye-catching figures, so I had a lot of practice creating plots to explain the results of my work to different people within the company and customers.
After working for two customers, I started adding new functionality to the Core Inference package with custom work I had developed. I spent time in academia learning how to code properly, performing unit tests, and working with git which helped me to rapidly add new and flexible modules to the codebase. One example of this work is Prediction Intervals. Though this was not a strictly machine learning issue, it is still a large issue that is found in both academia and at Noodle: how to quantify errors on your machine learning results. Some models come with their own functionality to determine prediction intervals, often by calculating the desired quantiles. However, some of our models lacked those capabilities and did not produce posterior probability distributions either. Our solution was to calculate the Root Mean Squared Error as
And use that as a tracer for the standard deviation. We found that this assumption produced a distribution that was too wide. I then implemented a scale factor to reduce the size of the prediction intervals based on minimizing the Mean Interval Score that was used for the M4 Competition (https://robjhyndman.com/hyndsight/m4competition/ ).
Where h is the number of timesteps forecasted, 𝛼 is the confidence level, 𝑈𝑡 is the upper prediction interval, 𝐿𝑡 is the lower prediction interval, and 𝑌𝑡 is the actual value.
I used the BFGS minimizer in Python’s SciPy to calculate a scale factor k assuming that
𝐿𝑡= 𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡 − k ∗ RMSE
𝑈𝑡= 𝑓𝑜𝑟𝑒𝑐𝑎𝑠𝑡 + k ∗ RMSE
This method assures that the prediction intervals encompass 68% of the actuals for the 68% prediction interval. In this way, the prediction intervals that we calculate are more in-line with what we expect the results to mean and are easy to explain to our customers.
The biggest difference I have seen between Noodle and academia is the time it takes to tackle “big” projects. It took me 4 years to publish a paper from my thesis and the results of the paper were that we needed to gather more data and author another paper. While at Noodle, I had a customer for 3-4 months and spent a few weeks operationalizing the code, and then I moved on to other projects like the prediction interval that lasted 1-2 months with extensive testing. If you want to work on new projects in a fast-paced environment, then a move to a tech company might be what you need!
Steps to Take Toward Your Own Data Science Transition
I was able to use my academic projects on my resume and talk about them from a data science viewpoint so that prospective employers could draw parallels to work done at their company.
There are concrete steps that graduate students and postdoctoral scholars can take to prepare for a future in data science:
1. Acknowledge and allow yourself to consider career paths outside of academia!
- Choosing to leave an academic career does not make me a failure. I decided that a data science job and a company doing good for the world was a better fit for my goals in life. I did not fail to get an academic job; I chose not to.
2. Position yourself to do more AI/ML projects
- Artificial Intelligence/Machine Learning/Data Science have many applications to research science. You can easily pick a problem and try different techniques to solve it.
3. Learn tools being used in industry
- Take the time to learn the programming tools being implemented in industry such as Python, NumPy, Pandas, scikit-learn, TensorFlow/PyTorch, and Git.
- Learn the jargon used in the tech industry and write your resume using that language so that recruiters and interviewers can connect with your work.
4. Build up a network of colleagues and ask for help!
- Create a LinkedIn profile and add your friends from graduate school and your research colleagues.
- At the winter American Astronomical Society meeting, there is an Astronomers Turned Data Scientists splinter meeting that features previous astronomers and current astronomers considering the switch! Find them and ask for advice!
- Asking for recommendations from people in your network can help get you in for an interview, but you must stand on your own during the long interview process.
I do not regret getting a Ph.D. or either of my postdoctoral appointments. I made these positions work for me and built the skills that would be valuable outside of academia. The work I now do at Noodle AI implements the same techniques, but I can see the results of my hard work almost immediately and know that they have a positive impact on the world.
- Why ServiceNow and Honeywell Invested in Noodle.ai To Solve the Global Supply Chain Crisis
- Beyond Control Towers – Supply Chain Command Centers
- How AI Helps Supply Chain Leaders Navigate a Turbulent New World: 3 Experts, 3 Perspectives
- Supernovae to Customer Success: A Data Science Journey
- How to Get Actionable Answers to Yield and Defective Production Issues Finally — and for Real