The hubris of Big Data
“Data is a tool for enhancing intuition.”
- Hilary Mason, data scientist and founder of Fast Forward Labs
“If you wanna do data science, learn about cognitive biases, our alarming lack of statistical intuition, and how to correct for them.”
- Hugo Bowne-Anderson, Head of Data Science Evangelism and Marketing at Coiled
In February 2016 hackers issued thirty-five instructions via the Society for Worldwide Interbank Financial Telecommunication (SWIFT) network to a Bank in southern Asia to transfer out US$1 billion. Thirty of the suspicious digital commands were blocked by the Federal Reserve Bank of New York, but five got through and were actioned. The fraud led to the successful transfer of US$101 million, of which most (US$81 million) ended up in the Philippines, and the rest ($20 million) in Sri Lanka. A simple error by the hackers, a misspelling on one of the instructions, had alerted the New York Bank to the scam. In the era of Big Data and cyber-security, you don’t get much bigger, and more audacious, than the Bangladesh Bank Heist. The US’s Federal Bureau of Investigation believes the government of North Korea and its network of cyber criminals was responsible—famous amongst the international hacking community, known as the Lazarus Group.
Except for those of us living in splendid isolation from the world of science and computing, perhaps as an undiscovered tribe in an impenetrable jungle in remote Papua New Guinea, most of us have heard of the term ‘Big Data’. It’s the culmination of our advances in computing technology and the never-ending torrent of information provided by our digital activity. Big Data is the collection and analysis of very large treasure troves of assembled facts, figures, statistics and evidence.
For some, the prospect of the unprecedented availability of such information heralds a golden age where the judicious analysis of data will solve almost every imaginable human problem: from how best to target goods and services to consumers, to streamlining healthcare services, to predicting the occurrence and magnitude of earthquakes. People have claimed that data has taken over from fossil fuels as the most important resource for society: Peter Sondergaard, Senior Vice President and Global Head of Research at Gartner, Inc, put it this way: “Information is the oil of the 21st century, and analytics is the combustion engine.” This landscape of infinite data may very well be the yellow brick road that leads us from data-impoverished Munchkin Country to information rich Emerald City.
For others, watchful of the infringements to personal freedom that access to such information might permit, and mindful of the data scammers, hucksters and shysters that are seemingly everywhere, we are heading for digital Armageddon: somewhere between Huxley’s Brave New World and Orwell’s Big Brother. This future is a digitally enhanced version of their dystopian worlds.
The idea of Big Data started to circulate in earnest some twenty years ago—around the time that Larry Page and Sergey Brin started work on the unheralded project that would become Google. Initially there was great excitement, and very little questioning about where it all might lead. However, a 1999 journal article by Steve Bryson and others in Communications of the ACM—one of the first to use the term ‘Big Data’—struck a chord of disharmony. It acknowledged the overwhelming nature of the mass of data available: “How will we utilize the seemingly out of control data that’s increasingly available?” the authors asked. “As more than one scientist has put it, it is just plain difficult to look at all the numbers. And as Richard W. Hamming, mathematician and pioneer computer scientist, pointed out, the purpose of computing is insight, not numbers”.
It’s not just the mass of data which is the problem, either. People have started to talk about the “three Vs” challenge: the volume of data, its variety and its velocity.
So, there’s lots of different kinds, and it’s moving fast. It’s only over the last decade that the ramifications of the analytical power provided by Big Data have really become clear. Understandably, as there is in many new fields before people settle down to more sober reasoning, there has been considerable hype. As early as 2008, Nature released a special issue dedicated to Big Data that explored potentially exciting applications as well as possible ramifications. In the same year Wired’s then editor-in-chief, Chris Anderson, made the outrageous claim that analysis of data would make traditional psychological and sociological research redundant: “This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.”
It’s easy to see why people get so enthusiastic. Big Data has everything covered: our genetic sequences, our electronic health and government records, our internet activity, our social networking, our location on the planet at any point in time, our search engine activity. Anything and everything that’s digitized is up for grabs, not just the occasional US$1 billion of Bangladeshi citizens’ savings. If we can tap into this information, we can change the face of almost every industry, and provide real-world solutions for almost every human predicament, banal and profound: from predicting where malaria has just spiked, to working out how to pinpoint exoplanets, from planning where best to attack an enemy army, from foreseeing the next pandemic after we did so badly at predicting COVID-19, to working out what brand of peanut butter best suits consumers in David City, Butler County, Nebraska.
Already, far more businesses than not use Big Data for targeted marketing. You will have searched for something on Google—luggage, say—and then found ads for airline prices advertised on your Facebook page. Or perhaps you haven’t used your gym membership for a while and just when you’re thinking of cancelling the gym contacts you, wondering if there’s anything they can do to help. This is Big Data—or more particularly the predictions it makes by assembling fragmented pieces of information about you—in action. You might be happy about this. Or scared. Or both.
That’s not to say that there aren’t many Big Data success stories. Much is not “wow” anymore, but just routine business and government practice. UPS, the package delivery company, has sensors in its vehicles to collect operational data, ensuring that any maintenance required is done preemptively. Many governments rapidly designed and implemented test and trace soon after the COVID-19 pandemic began (some better than others, but that’s a different story). They know where you have been and whom you might have infected or been infected by, and can thereby mitigate the contagion.
But Big Data isn’t just being used for business and government purposes, it’s also making improvements in areas like your personalized healthcare. There are new possibilities for fresh understanding to be gleaned from electronic health record data. What’s the profile of patients most likely to be readmitted to hospital? Which patterns in your test results might point to a rare disease, which are apparent to the Artificial Intelligence (AI) algorithm filled to the brim with prior cases, but not to most doctors acting alone? If we know that we can target those patients, it’s easier to take preventative action. And which patients recover best from their breast cancer treatments, and what type of breast cancer care—chemotherapy, radiotherapy, surgery, or just watchful waiting—is most effective? This is more than mere invaluable information for future planning and clinical practice. It’s life-giving for the patients involved.
It all sounds as if it will produce breakthrough after breakthrough, but Big Data is not without its critics. As New York University psychology professor Gary Marcus and computer science professor Ernest Davis pointed out in a New York Times piece, there are a number of reasons (nine, apparently!) why we shouldn’t throw ourselves on the Big Data bandwagon quite so hastily.
Big Data can reveal a lot, but it can also provide misleading results. It cannot, for example, always accurately show causal relationships: that one thing determines another. Despite Chris Anderson’s claims, only expert analysis, experiments and clinical trials can do that. Unmediated data can also create apophenia—suggesting correlations where none exist. Just because two things occur in a database simultaneously doesn’t mean they have a relationship.
For example, a simple crunching of numbers might indicate that the incidence of both university applications and traffic offenses dropped in 2010. A pattern might be assumed. A conclusion might be made that one has led to the other. So we could plausibly, but utterly erroneously, extrapolate from this that fewer submissions for university entry means reduced numbers of students on the roads attending interviews, hence fewer speeding tickets issued. But Big Data cannot tell us if those things are connected, if they are causal or even if they correlate at all. The numbers can only ever be relied upon to tell us what, not why.
So many false conclusions have resulted from the faulty application of data, that there’s a website (and book) dedicated to revealing the most absurd of these spurious correlations: (http://www.tylervigen.com/spurious-correlations). For instance, were you aware that Per Capita Cheese Consumption correlates 94.7% with The Number of People Who Died by Becoming Tangled in their Bed Sheets? Or that the Age of Miss America correlates 87.0% with Murders by Steam, Hot Vapors and Hot Objects? And who knew that the Letters in Winning Words in Scripps US National Spelling Bee correlates 80.6% with the Number of People Killed by Venomous Spiders? Statistical correlations don’t necessarily add up to anything much, and we need to be careful that we can distinguish, as Nate Silver, one of the best known predictors puts it, just what is signal and what is noise.
The results of Big Data can be skewed—deliberately, not just unintentionally. Google results, for example, are increasingly being manipulated. (Not just by Google insiders—although they deny any “human curation”, arguing that their algorithms do all the work, but missing the point that they create the algorithms in the first place.) Take just a couple of examples. There are highly lucrative services dedicated to portraying individual and business Google results differently. Many people want to hide negative on-line content that would otherwise appear on the first page viewed. These companies post favorable stories until the bad news is displaced, to then appear far below. As almost no-one surfs beyond a second Google page, the negative results are rendered virtually invisible.
And who knew that false stories at scale on Twitter and Facebook would change the course of entire elections (the 2016 US election results putting Donald Trump into the White House and Hilary Clinton into political oblivion, and Cambridge Analytica’s role in this outcome, have still to be understood). Not to mention a current problem, with the amplification of vaccine hesitancy during the biggest pandemic in 100 years. When you ask their opinion, people seem to be robust in their beliefs, but we have learned with the onset of Big Data that many people have views that are fragile, and they are easily led astray at best, or downright manipulated. Indeed, an entire field—marketing—is set up solely for this purpose.
But there’s more. Wrong data and misleading use of it can become self-reinforcing, in what Marcus and Davis discussed as an echo-chamber effect. For instance, Wikipedia uses Google Translate to cast its entry in another language, but in return Google gets information from Wikipedia. The problem is that this creates a feedback loop, so any errors in Google Translate or Wikipedia perpetuate. And people who subscribe to conspiracy theories—that Democrats are part of a nefarious pedophile ring, sex trafficking children, and they are in any case out to destroy the world, say—are more likely than others to be vaccine hesitant, believe in UFOs, and think climate change is a hoax. Narratives are all too often conflated into a cluster of things that people in that echo chamber mutually reinforce, bolstered by more and more “Big Data” opinions on Twitter, Facebook, YouTube, Instagram and many other social media platforms. It’s called motivated reasoning—your personal views, egged on by others, muddies the realities of those kinds of topics.
The raw volume of data available can also induce a form of data-paralysis in everything from business to making decisions about personal relationships. Far from promoting judicious, fact-based decisions, too much information can create doubt and uncertainty. The sheer logistics of referring to or trying to interpret massive databases can otherwise lengthen the time to decision and discourage essential intuition. And the volume is massive: there are 2.4 billion emails sent around the world every second. That’s 74 trillion every day. My joke: Now you know why you are behind in your email at just the point you get it up to date.
More astonishing, the world’s fastest computer can perform 42,010 teraflops per second. That’s the amount of operations processed—a teraflop is one trillion operations per second. (It’s a Japanese supercomputer, but lots of groups are working on making their supercomputer faster, so someone will outperform this in a year or less). If you find this hard to fathom, think about it this way: if you perform a calculation every second, you would take 31,688 years to match a one teraflop computer. Now do the math and multiply that by 42,010.
The list of ways in which Big Data can be misleading, transformed into misinformation, easily manipulated or even just misinterpreted by the reader or viewer is consequently long and troublesome. Along with questions about its subjectivity, its effect on what we “know” and how we know it—what philosophers call epistemology—there are also huge questions about the ethics, security and ownership of Big Data. Even if data is being used responsibly, should people be informed that their personal information is being used for research, or to influence their behaviors? What are the ethics surrounding online data collection? How can informed consent of online data even be obtained? What is the cost of data to the consumer and taxpayer—who should fund this, and who has, or should have, access? There’s an adage from informatics: if you are not paying for it, you’re the product.
As National Security Agency (NSA) leaks and cyber-attacks on government, business and individuals illustrate, the related civil rights implications of Big Data (mis)use are even more complex. What are the consequences of having so much of our personal information—phone records, internet searches, shopping history, social and professional networks—available to be mined by corporations and governments? Are they safe? Is human privacy a right or a necessity, and is it something we’re willing to compromise for greater connection and security? What happens when a less than benign force is able to access with ease, and hold to ransom, an individual’s digital data? What about our humble bank accounts? If they are not secure, what is? It’s not just the Bangladesh Bank that is at risk.
Big Data is not ever going away, and nor would anyone want it to, but it’s important to remember that bigger isn’t necessarily better. Problem solving isn’t just about numbers. Intelligence and ingenuity—human agency—is still required. And the attendant dangers of the availability and sheer volume of digital data needs to be recognized. As Marcus and Davis say: “Big data is here to stay, as it should be. But let’s be realistic: it’s an important resource for anyone analyzing data, not a silver bullet”. And I say: it comes with great benefits, but massive risk to society and citizens. Who provides the safety-net protection for us all?
Adams, Susan (2013). 6 Steps to managing your online reputation. Forbes. March 14. http://www.forbes.com/sites/susanadams/2013/03/14/6-steps-to-managing-your-online-reputation/
Anderson, Chris (2008). ‘The end of theory, the data deluge makes the scientific method obsolete.’ Wired http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory/
Anonymous (2008). Editorial. Nature 455 (1).
Anonymous (2008). Nature special issue. Nature 455 (7209): 1–136. http://www.nature.com/nature/journal/v455/n7209/
Anonymous (2013). Big Brother in the era of Big Data. Studio 360. June 21. http://www.studio360.org/story/299415-big-brother-in-the-era-of-big-data/
British Broadcasting Corporation (BBC) (2021). The Lazarus Hei$t. Podcast. https://www.bbc.co.uk/programmes/w13xtvg9
Boyd, Danah, Crawford, Kate (2012). Critical questions for Big Data. Information, Communication & Society 15 (5): 662–679.
Bryson, Steve, Kenwright, David, Cox, Michael, Ellsworth, David, Haimes, Robert (1999). Visually exploring gigabyte data sets in real time. Communications of the ACM 42 (8): 83–90.
Crawford, Kate (2014). When Big Data marketing becomes stalking. Scientific American, 310 (4). http://www.scientificamerican.com/article/when-big-data-marketing-becomes-stalking1/
Destiche, Aurielle (July 2010). Fleet tracking devices will be installed in 22,000 UPS trucks to cut costs and improve driver efficiency in 2010. FieldLogix. http://www.fieldtechnologies.com/gps-tracking-systems-installed-in-ups-trucks-driver-efficiency/
Glaser, John (2014). Solving Big Problems with Big Data. H&HN Daily. September 12. http://www.hhnmag.com/Daily/2014/Dec/Big-data-initiatives-solve-healthcare-issues-article-glaser