The Data Science Myth (for Startups)

It all probably started around 2012, a year after I moved to New York. At the time, no one was talking about data science as a thing, Big Data was Big Data, machine learning was machine learning, and AI was AI. These are very different expertise: Big Data is concerned with the implementation techniques of processing large amounts of data; machine learning is concerned with the design of models that classify and predict based on data; AI is a much broader study of intelligence that seeks to design and model human-like decision-making. These skill sets are connected and often overlap with one another but they are certainly not one thing.

After all, every science is data science. Is there a science that does not rely on data?

In the couple of years after that, the term data science is plastered all over. Now according to Wikipedia, data science is “the extraction of knowledge from large volumes of data that are structured or unstructured, which is a continuation of the field data mining and predictive analytics, also known as knowledge discovery and data mining (KDD). “Unstructured data” can include emails, videos, photos, social media, and other user-generated content. Data science often requires sorting through a great amount of information and writing algorithms to extract insights from this data.”, which, sounds like a textbook definition of machine learning (or statistical learning at large). The terms “knowledge discovery” and “data mining” have been used in academia for decades.

But if you take a look at popular data science classes like the one offered at General Assembly or the one offered on Coursera, the curriculum is a blend of “Big Data” processing and machine learning. But as far as machine learning goes, the coverage is restricted to basic regressions, decision trees and naive bayes classifier, which constitutes only an introductory machine learning skill set.

There is no doubt that both data processing and machine learning are immensely valuable and have very immediate applications, but with the new blurry designation of data science, some myths are also being perpetuated at a massive scale. It is helpful to debunk these myths about data science as an ultimate weapon, and understand that data processing, machine learning and AI have very different aims, purposes and required training in order to be effective.

So, here goes.

Data science is a new field of study

Data science is a buzzword. A new field of study introduces new concepts, new implementations or new applications. Much like “cloud computing”, or “Internet of Things”, or “User Experience”, a buzzword like data science data science does not do that, it is a redesignation of a subset of existing skill sets and techniques without furthering them.

Buzzwords are created to hype up certain products and technologies, they are not inherently bad. But treating buzzwords as if they are real fields of study is dangerous because it leads us to believe in certain forms of problem-solving without understanding the sciences that justify these forms under particular circumstances. For example, believing in cloud computing is erroneously believing that anything that sits on a remote server is better than something run locally; believing in IoT is believing that solutions created using weaker microcontrollers are inherently better than ones created using more powerful computer and mobile processes; and blindly believing in UX is to be lazy in studying design methodology, psychology and prior arts before showing products to users.

The drawback of blindly believing in data science is not knowing when to stop using Big Data and when to not use machine learning. These are disparate skill sets that do not need to go together, nor do they need to be used in every software system.

Data science will always give better results

One interesting phenomenon that is becoming increasingly common, especially in early-stage startups, is to cite machine learning as a solution.

It is not a solution, it is an approach, and it is a wrong approach if you don’t have data.

Have you ever tried speaking into Siri, Google Now or Cortana? The speech recognition engines that power these services are trained on terabytes after terabytes of human voice data, and even then they are hardly perfect.

This gives you a sense of the type of accuracy (or lack thereof) you should expect from machine learning. Unless you have a large data set to train your system on, the results will be highly disappointing.

There are two takeaways from this, one philosophical, and that is machine learning is to train to imitate classification and prediction tasks that humans typically perform, if you don’t know how to perform the task, don’t expect a program to figure it out for you; the other takeaway is practical, and that is don’t treat data science as a substitute for product development.

Data science will always improve products

Relating back to the previous point, data science is not an oracle that will teach you something completely novel. And like any science or arts, your space of discovery is limited to your apparatus. If you’re running around with a monochrome lense, you won’t see blue, ever.

Same philosophical limitation exists in data science in the form of supervision. In laymen terms, supervised learning means to infer solutions in specific forms from data. Other similar limitations can be variable selection or simply data collection, which in laymen terms, is the method of selecting what to observe, record and study. For example, think of trying to predict someone’s height from the brand of cell phones they have in the form of a linear equation. Regardless of what machine learning method you use, you probably won’t end up with useful predictions because the data you collected may have already limited you to two variables that have little to no links to each other.

You may think the example given here was silly, it sure was, but it is not far from the reality of employing machine learning in an early stage startup. Your data science results will only be as good as the team that is working on them. If your team does not have the industry acumen to look at the right data points and select the right methods to study the data, you are much better off not using data science.

In other words, data science is not a substitute for domain knowledge.

Big Data solutions are more advanced than those that are not

No, they are not.

A business solution is more advanced because it solves a problem more effective or more efficiently, ideally both. As far as business problems are concerned, Big Data is only concerned with the implementation of a business solution that in most cases do not change the business solution itself.

Big Data, as the name suggests, is a collection of technologies that help to distribute data storage and parallelize data processing in the wake of modern day data explosion. The traditional feat of simply beefing up a single server node just isn’t sufficient to handle high data volume and high data velocity. In order to go from handling data of couple terabytes, we may be handling terabytes of new data per day. Big Data, in laymen terms, help us distribute the workload across a whole network of servers.

With that said, even a simple data task like calculating the mean of a value may require Big Data if we need to calculate the value for petabytes (i.e. thousands of terabytes) of data within a fixed time frame. On the contrary, a complex linear algebra computation may not require Big Data if the data never reaches terabyte scale.

There is no sense that a business solution built on Big Data technologies is in any way better, more advanced or more sophisticated, conceptually.

In fact, Big Data technologies like Hadoop often come with a startup overhead that can introduce delays up to minutes. Unless your data volume is high enough to justify the overhead, execution time on Big Data technologies may even be slower.

I can train a data scientist in 3 months

As suggested in previous sections, data science as a vague designation is actually a broad collection of expertises that each require deep industry-specific training and experience to cultivate.

Regardless of whether it’s Big Data or machine learning, handling speech recognition, natural language processing, DNA matching, causal inference and travel optimization are vastly different from each other in terms of the data storage configurations, computing capacity provisioning as well as parallelization design.

In the past, database administrators, computer scientists, engineers and statisticians had to specialize according to the industry, in data science the norm is no different. Today, a data scientist who knows K-means and logistic regression is no more relevant than a statistician who has taken a survey class on data mining.

The amount of training required to make a person a good statistician, or a good database administrator, or a good computer scientist, is no different when you relabel the person as a data scientist.

So…

I am not here to say data science is not useful, it certainly is. I am a computer scientist by training, and I use machine learning and data storage technologies on a daily basis. But there are very many scenarios when a good algorithm or model is hands-down the better solution than a data-trained statistical model.

I know that it is simply a lot cooler to have a machine create a model from data, but it defeats the purpose of studying statistical trends because if there is no uncertainty, then probabilities just becomes pure logic. In other words, there is no point making a machine guess when you already know what a good approach is.

Finally, we all just can’t iterate this enough: employ data science for what it is good for (approximating human decision-making), and not use it as a substitute for learning industry expertise and experience (algorithms and models).

Always remember that a data science system is only as good as the industry expertise and experience of the person who is designing it.