Big Data Is Dead! Long Live Big Data!

comments 0

Comment

share

Share

0

Rate

Henry Wang's picture

In the mid-19th Century, cholera was rampant in the streets of London. An English physician named John Snow plotted the positions of cholera outbreaks which led authorities to remove the pump handle on a well on Broad St in London. Snow’s meticulous study of spatial cholera outbreaks was credited with drawing the trail to the well on Broad St. Though there has been a large amount of recent media interest surrounding “Big Data”, the concept of data analysis is nothing new and “big” is relative to only specific reference points. An oft quoted statistic (albeit antiquated now) is that the more data was transmitted in 2010 than all previous years combined [i]. Indeed this exponential growth is staggering when one compares it with the past, but this view that the data we are collecting is inordinately “big” is not a very imaginative nor productive view of the possibilities that lay ahead. So while there has been much hype about big data, I will seek to address the real value of the data we are collecting today. The narrative is more about the shift in paradigm towards data analysis which is happening across a variety of fields. The Rise of Mobile Sensors It is important to set the frame of thinking with the origins of our new data-driven paradigm. Today, the mobile phone is ubiquitous. The proliferation of cell phone technology has produced a massively distributed network of mobile sensors, collecting metadata such as location, time and communications. The increasing volume and variety of data that is being collected via distributed sensor networks adds more granularity to our view of the world. Take the humble map for example, it is a technology which has its origins over two thousand years ago. For much of the last few hundred years, man has attempted to better catalog the physical features of our planet, reaching its zenith with the proliferation of satellite technology. Yet the information displayed on these maps were still largely static, mountains after all were not likely to change much over the course of weeks. What was missing was information about the dynamic systems that exist on top of these static features. Google Maps today not only display static physical features but also social elements within human societies, examples of which are embedded in user check-ins or ratings of restaurants. These esoteric forms of data which reveal some structure to complex social systems would have completely impossible to capture without the different forms of mobile sensor networks out there today. In doing so, the capture of greater volumes and variety of data also form the foundations for a new approach towards understanding complex systems. Complex Systems & Incomplete Understanding In the 1960s artificial intelligence (AI) was seen as a very real technology that would come of age in the near future. The initial approach to AI was to deterministically program instructions. While deterministic methods worked reasonably well for chess playing machines, they lacked the ability to discern context in real world decision making. Take for example, an algorithm to instruct a robot to pick an apple of a tree. Our robot would most definitely end up destroyed should there exist a valley between itself and the tree. The lack of context in decision making was an early stumbling block for the deterministic approach towards AI, but these efforts were also emblematic of a wider problem of incomplete understanding of governing laws for very complex systems in fields ranging from political science, consumer behavior to economics. The key breakthrough for AI was the shift to statistical algorithms which sought to replicate outputs from the inputs, a realization that we have an incomplete understanding of the immense complexity of the human brain. The way this replication was done was by using algorithms to mimic the learning process undertaken by the human brain. Specifically this learning can be broken down into supervised and unsupervised learning. In supervised learning, one can think of the process as attempting to teach a child to recognize objects from labeled pictures. The child can then be tested on a new set and can iteratively improve on its classification accuracy. In unsupervised learning, the process is analogous to arranging a field of unlabeled objects into groups of greatest similarity. These two new ways of thinking about AI were all predicated on the availability of data in order to train an algorithm for better out-of-sample performance. The success of AI methods in prediction (Amazon, Facebook, Target) has drawn a lot of attention and indeed there are a lot more fields which face similarly complex systems where hidden interactions between model inputs are hard to account for in a static model. In predicting localized energy, a variety of factors ranging from altitude, cloud coverage, pollution and solar insolation all come into play in determining the realized output of a solar system at the ground level. One could attempt to develop deterministic models based on physics to model such systems, but the complex interactions between model variables would make for an extremely complicated explicit model. The benefit with data-driven statistical models is that they can be trained to give better out-of-sample performance over time as more data is collected, but more importantly they are also polymorphic, and can adjust model parameters based on changes in the underlying system picked up via new data. Correlation vs. Causation Post hoc ergo propter hoc (correlation does not imply causation) is a synonymous term for many students being introduced to statistics. This is an age old term which will have more relevance now as data-driven techniques increase. Whilst machine learning and data-driven methods hold immense promise for the social sciences, there is however a key distinction between the computer science and statistics driven field of machine learning with the more formal methods used in the social sciences. The core emphasis of machine learning is about mimicking outputs based on past data, its focus is on predictions whether they are continuous values in regression discrete labeled classes in classification. Much of the new technology developed in machine learning and statistics such as neural networks or random forests are highly flexible in their use, but are more opaque in their operations making for less interpretability. Contrast this with econometrics or formal methods in political science. The mathematical technologies used in these fields are still relatively old (least squares regression) with respect to cutting edge statistical methods. Fields such as macroeconomics or politics have traditionally had a very direct tie in with policy-makers, and hence it is understandable a lot of the emphasis in these fields is on unearthing narratives and causation from data. The variety of data that is now being collected on social systems holds immense promise for fresh thinking towards old problems in the social sciences. The array of data out there and new statistical technologies developed to deal with high dimensionality means that one can get very creative in thinking about contributing features to a prediction. But in order for this happen, there has to be a mindset shift from the strong desire to seek causality and narrative to an acceptance of mere correlation in the new age of high dimensional data. Its Not About the Size of the Data, Its About What You Do With It New forms of data are around us will continue to be generated as sensor technologies penetrate more applications. Looking past the hype surrounding big data, the emphasis should not be on its size, as this is very much based on the frame of reference. The key benefit of this new data is the ability for us to better understand and make predictions for extremely complex and dynamic systems. With the proliferation of machine learning methods to other fields, this could be an extremely exciting time to reinvigorate the social sciences with a new approach towards to modelling complex dynamic systems.

References: 

[i] Kirk Skaugen. “More Data Transmitted Over The Internet In 2010 Than All Years Combined”. The Future of Retail. October 24, 2011. http://www.psfk.com/2011/10/more-data-transmitted-over-the-internet-in-2...