You ask, Noodle answers: What makes a great data science problem?
Articles about what makes a great data scientist are everywhere. Articles about what makes a great problem for a data scientist to solve… well, they are harder to find. That’s OK – Noodle has a great team of data scientists and we sat down with one of them (thanks, Siva!) to help us break it down for you.
Ask the right questions
For Siva, a great data science problem is one that allows her to help a business solve challenges both now and in the future. She helps her clients transition from reactive problem-solving to a proactive or predictive approach. Getting there means asking the right questions so that both she and the client understand the real problem that needs solving. Some of her favorites help her give clients solid footing so they won’t get caught off guard by problems in the future:
- What is something that caught you off guard last year?
- If you learned something from last year, what were those three things?
- What were three things you wish you knew last year?
These questions help Siva focus on the most impactful areas first and start to make a difference for the client right way.
You guessed it: ask more questions
We can’t say enough about how important it is to have an inquisitive mind if you want to be a great data scientist. Siva stressed that questions can reveal hidden solutions and help combat hidden biases for better solutions. Some of the most important questions help you size and sense your business and its market or supply chain to reveal opportunities. Don’t be afraid to ask the hard questions about issues like demographics, product offerings, links in the supply chain, or hiccups in the manufacturing process.
Once you have begun to get human answers to these questions you can begin to see patterns and insights within your data sources, both internal and external. One of the key challenges for any business is managing the external impact then sensing the internal effects and finding ways to connect the two.
“In theory it sounds simple but connecting external sensing and internal operations is one of the hardest things we do in data science. Connecting something that is within the enterprise to something that is outside the enterprise boundary… companies invest millions of dollars just trying to join those two universes together. If you want to solve this effectively, unless you establish a linkage, nothing you do will be actionable and meaningful, so that becomes a very important piece of this puzzle.” Siva Devarakonda
Test, experiment, test again
A data science motto is “correlation does not equal causation” – experimentation is how you reveal true causation. Great data science problem solving uses A/B testing to understand everything from behaviors to market impact. If you have enough data deep learning comes into play for this, but if you have less data, you have to look for more traditional tests and experiments. When testing, remember that all data has a hierarchy. This hidden hierarchical structure takes shape in the form of categories and labels when dealing with data points (“this kind of person from that kind of place with this type of education and that level of income” is one example of hidden hierarchies in real world data).
Testing helps reveal which categories are relevant and helps create better labels for the data you’re using to solve your data science problem. Understanding and labeling these implicit and explicit categories helps data scientists navigate across data that seems homogeneous to find patterns they can use to solve problems.
Dealing with bias
The need to categorize and label can open a door for bias, especially when dealing with data sets you receive from an outside source. Let’s say you make engines, for example. An engine is (hopefully!) low failure by design. This means the data itself is designed to be biased toward a low failure rate. You are trying to solve the data science problem of this failure and why it happened, even rarely, so the data you have is not biased in your favor. It becomes vital to separate the categories and search for patterns in the categories using questions, testing, and labeling to understand the truth behind the biased input.
“I think it is very important to control for bias. Hidden bias in algorithms is not only a problem for businesses but is becoming a societal problem at this point, and it’s something we take into account with every project at Noodle.” Siva Devarakonda
One essential way to make sure hidden bias is removed from data is to understand how to talk to the people behind the data to understand how to connect the external sensing with the internal operations of the business. Connecting with the people using the systems daily can reveal when a system has incentivized a certain metric, allowing you to better label and compare data across categories and see clearer patterns.
Another type of bias a good data science problem will control for is measurement bias. A business may not be measuring certain things that are important to them yet. Data science is only as good as the data inputs, so in this case you may need to educate the client on how to measure more to get better insights, then revisit the problem they want to solve once they have more information.
Too much data is always a good thing
Too much data may occasionally lead to a situation where you have more data than time to solve the problem, but overall more data is always the best foundation. One of the great things about Noodle is that we have created these modules to help clients solve big problems using large data sets, and when you have a great toolbox like that large amounts of data become a strength. When this data is joined with a solid exchange of human domain knowledge, solving these data science problems becomes exciting.
“I’m a strong proponent of data science as a medium of insights and good data is truly reflective of the underlying process that’s at play.” Siva Devarakonda