Storytellers of data, fortune tellers of the future
“What do data scientists, well, do?” Yukta’s blog discusses some of the most interesting and exciting facets of the data science field, and what is at the core of the data scientist profession.
Looking back on the last three years of my undergraduate journey, I cannot help but reflect on a particular question.
A question that has piqued my mind time and time again, that has been asked by friends and family alike. A question that I have eventually had to ask myself, that has brought upon pondering and reflection for days on end.
"What do data scientists, well, do?"
It seems like a simple question, Google-able if you will. Upon which we are inundated with a sea of search results that neither deliver the true definition of this up-and-coming occupation, nor convey the true beauty of data science. The fact is, data science has long been misunderstood as a profession that deals with data in its primitive sense, accompanied by tools and algorithms that appear to us as a blur of software which we believe are at the crux of becoming a data scientist. While web definitions don’t go much further in explaining the field beyond this, there is a lot more to the exciting world of data science.
The term data science was first coined by Paul Naur in 1974, in conjunction with the idea of datalogy and how these dealt with the science of working with “data and data processes”. As such, data science as a concept is not even 50 years old, but has its roots in a field that has existed for over 300 years - the field of statistics. Indeed, data science combines the time-tested theories and workings of statistics with our modern-day computing and data processing capabilities to bring the most meaning it can to ordinary datasets. By turning the once-manual calculations and models proposed for different statistical operations, whether it is hypothesis testing, discriminant analysis or regression, into programmable, tuneable algorithms that can run on computers in a matter of seconds, data science has broadened the horizons of the conservative world of statistics. So much so that we now have new statistical models derived entirely from computational capabilities, such as support vector machines. This has allowed for high-dimensional analyses, the sensemaking of ‘big data’ (a story for another day!), and revolutionary data insights.
As such, while most of us are aware of the idea that data science involves summarising and representing data, whether it is through pie charts or histograms, this is just a small part of the bigger data science operation - what we actually want to achieve is to understand why our data behaves like it does. We want to understand how our observations act, how our variables are connected in the real world, and how true, or ‘significant’, these connections really are.
Regression is the first place to start with this, by correlating one variable with another. How does a one unit increase in sepal width affect the sepal length? And how consistently is it doing this? This is where regression coefficients and the coefficient of determination come into play, indicating the direction and magnitude of the relationship. Such analysis can be linear or non-linear, furnished with polynomial terms, or include more than one predictor variable, as is with multivariate regression.
What about the case where we want to model on a binary variable, such as ‘large’ and ‘small’ sepals? Cue classification. This involves using a host of different methods, whether it is logistic regression, random forest classifiers or K-nearest neighbour classifier, to segregate the data based on Euclidean distances, the classes of similar points, hyperplanes, or other methods.
Then we have the odd circumstance of having plenty of predictor variables, but no response variables to model on! Here, we are said to conduct unsupervised learning, as we navigate through our data for interesting trends present in the variables themselves. We may want to run a cluster analysis to see if the observations can be grouped up into several heterogeneous groups or perhaps perform a principal component analysis to better understand if several underlying dimensions are at play.
As such, data science has also never been about impassively applying several algorithms to give meaning to an otherwise dull dataset - it is about holistically understanding what we are looking for, what frame of mind, and thus what type of model, would best bring out the insights we seek for.
Coming back to where we began - what do data scientists do? I believe we are statisticians by heart, and programmers by work. Perhaps we are nothing but a passionate bunch of geeks working to better understand why things work the way they do. To understand how flights get delayed, to understand why COVID-19 cases spike, to understand when the sun comes out.
I suppose we are just storytellers of data in some sense - fortune tellers of the future in another.
Yukta studies BSc Data Science and Business Analytics in Malaysia.