Common Sense Data Science

Comments 6

Share to social media

Data science is so hip today that everyone is doing it. Everyone is showing off their ‘data science muscles’ by how much data they can lift and in what amazing ways the lifting is done. However, only some are showing that they can work in a smart way, and even fewer are showing common sense.

Let’s take the problem of train delays as an example. If you are looking into ways to minimize train delays, you may notice that the main preventable cause of the delays is the malfunctioning of the doors. This leads to a reasonable business case: minimize delays by predicting failure and doing preventive maintenance.

The approach of many companies in implementing this business case would be to throw data science at the problem. But there are several levels of maturity for this scenario which I think should be highlighted since they can profoundly influence the results.

The first level would be represented by a company that would hurriedly gather all data available from the train sensors and have their data scientists start crunching and modelling a predictive model without too much questioning of the data provided to them.

For example, if the doors of the train only have a binary sensor which only records signals with “door open / door closed” the feature engineering and the logic of the model will need quite some ‘muscle.’ A lot of data would be needed about train schedules, train delays, weather, etc., in order to try to predict a failure, and they would most likely end up with a model which still does not perform in a satisfactory way despite ingesting vast amounts of data. I would call this level pure-muscle-little-brain data science.

I call the next level pure-muscle-lots-of-brain data science, and it would be represented by a company where the data scientists would take the time to think about the data they are working with in relation to the domain and how the domain works.

Before ingesting, crunching and modelling vast amounts of train sensor data, a lot can be gained by spending some time trying to understand how the data provided is generated and if the data is relevant enough. If the data scientists first pinpoint the causes of doors’ failure, they would be able to identify in more detail what kind of data they would need to significantly increase the accuracy of their models.

For example, working with a dataset from a more sophisticated sensor which detects pushes, pulls and vibrations of the doors would result in an almost trivial model and the prediction would be more accurate than any other amount of data crunched in any algorithm and with any amount of muscle power.

The third level is, in my view, a collaborative data science. At this level of data science maturity, there can be a collaboration between vendors to avoid high costs and starting from square one. For example, not only train manufacturers but also bus manufacturers might be interested in sharing experiences and ideas about predictive maintenance of doors.

It makes sense that the engineering development process would have leveraging points in the predictive maintenance process later on, and that a proper symbiosis and collaboration will optimize results. This is one way to collaborate.

Collaboration is beneficial among data scientists, too. Many of the data scientists I have talked to are afraid to ask other data scientists to exchange experience, ideas and even code with the primary concern being the competitive advantage. But they shouldn’t be afraid.

Simple common sense should dictate that it is better to share experiences and thus build a much better base model than what can be done separately. This will in turn result not only in improvements of the training data sets but also in a variety of ideas and knowledge which could push all participants forward.

My hope is that we can have common sense and leap towards collaborative data science soon.

Commentary Competition

Enjoyed the topic? Have a relevant anecdote? Disagree with the author? Leave your two cents on this post in the comments below, and our favourite response will win a $50 Amazon gift card. The competition closes two weeks from the date of publication, and the winner will be announced in the next Simple Talk newsletter.

About the author

Feodor Georgiev

See Profile

Feodor has a background of many years working with SQL Server and is now mainly focusing on data analytics, data science and R.

Over more than 15 years Feodor has worked on assignments involving database architecture, Microsoft SQL Server data platform, data model design, database design, integration solutions, business intelligence, reporting, as well as performance optimization and systems scalability.

In the past 3 years he has expanded his focus to coding in R for assignments relating to data analytics and data science.

On the side of his day to day schedule he blogs, shares tips on forums and writes articles.

Feodor's contributions