Big data is undoubtedly one of the big topics of the moment as society leaves bigger data trails than ever before. The efficiency with which we can capture information has also improved significantly which allows companies to build evermore accurate customer profiles. One of the big challenges is to separate the wheat from the chaff in order to derive useful information – data for its own sake is not all that meaningful, especially in a world of Twitter and Facebook. However, big data promises to produce significant breakthroughs in artificial intelligence although we still need to be aware that whilst sample sizes have increased, we still need to be mindful of sample bias issues.
As computing technology improves and the cost of digital storage declines, we are now able to generate and process quantities of data which were unimaginable just a few years ago. In 2005, there were around 1 billion global Internet users; today there are estimated to be almost 3.5 billion. Just as important, there has been a significant switch from the oneway flow of traffic from suppliers to consumers, which characterised the early years of Internet use, to a more interactive medium. Today, users send around 6,000 tweets, make 40,000 Google searches and send 2 million emails every second. The capacity of text on the Internet is estimated at 1.1 zettabytes, which is approximately 305.5 billion pages of A4 paper and which is projected to rise to 2 zettabytes by 2019 (more than 550 billion sheets). And that is without the pictures! To take another example, the Large Hadron Collider generates 15 petabytes of data each year, equivalent to around 15,000 years of digital music.
Clearly, a lot of this information does not yield any valuable insight, but given the vast amount of available information even a small fraction of it is still too much for humans to reasonably digest. Even if we only require 0.5% of the information stored online, we would still need 1.5 billion sheets of A4. The problem is compounded by the fact that we do not necessarily know what is useful information and what can easily be discarded so we have to scan far more than we require in order to stream out the good stuff. As a result, much progress has been made in recent years to devise methods of scanning large data sets in order to search for relevant information. Welcome to the world of big data.
Big data is often characterised by the so-called 3V principle: volume, variety and velocity. Each of these dimensions defines a characteristic which determines its information content, thus enabling those collecting it to decide how to proceed. Indeed, it is one thing to collect huge volumes of data but what matters to businesses is what they do with it.
Volume simply refers to the amount of available information – the more there is, the more time has to be devoted to extracting useful information. Variety refers to the number of different types of data, which increasingly comes in a wide array of formats (e.g. free-form text, spreadsheets, audio, video), and as variety increases so we have to apply different methods of processing the information. Finally, velocity is a measure of the speed with which the data is collected and processed which will differ according to various needs (e.g. those selling real-time advertising via smartphones, which are dependent on the location of the customer, require exceptionally rapid input and output). Two other ‘V’s which make life difficult for data scientists are veracity (is it telling us what we think it is?) and variability (data inconsistencies make it difficult to derive processes to handle it).
Where does all this data come from? Some of it is merely the transcription of existing data into an electronic form that makes it more accessible. Wikipedia, for example, has helped to democratise knowledge in a way that was previously impossible. But a lot of it has come into being as a result of technological developments which allow the capture of much greater volumes of information. This has been facilitated by the rise of cloud computing which allows users to store, manage and process vast amounts of information in a network of remote servers (ironically, this is a reversal of the trend of recent decades which saw a shift from centralised towards local data storage). Perhaps even more important, the rise of social media such as Twitter and Facebook has vastly increased the volume of information pumped out (not to mention the rise of microblogging sites in China such as Tencent or Sina Weibo).
As a result, our daily activities mean that we leave an increasingly large data trail behind us. This is nothing new, but what is new is the ability to capture and store it. Supermarkets started issuing loyalty cards in the early 1980s in a bid to retain the custom of shoppers, but it was only in the mid 1990s that they began to collect large quantities of information on customers in return for small discounts. Information is a data phenomenon, and the more information we have, the more knowledge we accumulate. To the extent that knowledge is power, it stands to reason that those with the data in the digital age are those with the power. This raises a big question of how much control we should be prepared to give up, and there are legal issues about who owns the information that most of us have until now simply given away for free.
Consider, for example, the position of Amazon and Google which, because they are relatively new tech companies, started collecting data on their clients from the start which has given them a dominant position in their respective markets. On the one hand this allows them to target their services more effectively. But on the other, they are open to criticism that they subtly manipulate the behaviour of the consumers by persuading them to consider the purchase of other products (Amazon) or guiding them to the websites of paying customers in order to capture more revenue-generating clicks (Google).
Anyone who has ever bought anything from Amazon will know that before they check out, they are reminded that customers also bought other items. One of the insights from behavioural economics is that this sort of subtle nudge is more effective in changing consumer behaviour than a loud message imploring customers to buy. For a company which has a huge database profiling its customers, devising the right sort of nudge is relatively easy. Each customer has a unique profile whose purchase history is tracked. In principle, it is straightforward to construct an aggregate measure which correlates customers and purchases. But Amazon sold 5 billion items in 2014 alone, which vastly complicates the scale of the problem – and also the cost.
However, Amazon’s genius over the years has been to persuade people to buy things that they never knew they needed. But in addition to any impact which this might have in generating additional sales, such customer profiling also helps Amazon to optimise its inventory management. If there is a strong correlation between the sales of various products, the company will have a fair idea of how demand for one product is likely to impact on the inventories of others. In this way, Amazon can use its profiling techniques to manage its inventory levels more efficiently and thereby keep down costs. Given the size and complexity of the company’s offerings, it is proving increasingly difficult to devise such matching technique using human brain power alone. Increasingly, machine learning algorithms will be employed to mine these huge databases to optimise business processes.
Indeed, big data is proving to be an ideal platform in which machine learning can be perfected. Like humans, algorithms learn by doing but because computers are less adept at learning from the small number of observations available to humans, they currently require huge volumes of information to get to the same end result. A human brain contains around 100 billion neurons, each of which is connected to another 10,000 neurons which means that there are up to one quadrillion synapses (neuron connectors). Computers do not have that sort of processing power but they can use big data sets more efficiently which allows them to detect patterns by shuttling through data very quickly.
Ultimately, such algorithms can only derive results on a probabilistic basis. But if we think of probability as representing a fixed long-run relative frequency in which the likelihood of an event emerges as a ratio from an infinitely large sample size, the more data we have, the more likely the observed result is the true answer. You may have noticed over recent months a huge improvement in the quality of output from Google Translate. This is the result of a project in which artificial intelligence (AI) algorithms were let loose on a huge quantity of data in order to improve the predictive ability of the language generator. Whilst big data could indeed be the facilitator which permits the development of AI on the scale which was originally imagined back in the 1950s, some degree of caution is warranted. Even big data sets can contain biases which mean that the sample mean is not an unbiased estimate.
One area where the bias issue should be treated with caution is analysis of the huge data sets provided by Facebook and Twitter. These platforms contain huge amounts of text information which can be extracted at relatively little cost and has given rise to a whole new industry devoted to tracking the information content from social media platforms. However, we need to bear in mind that whilst social media are often a useful barometer of public opinion, they are biased towards younger, urban-dwelling, high-income users. Relying on tweets without accounting for this bias risks repeating the classic mistake made when trying to predict the US presidential results in 1936 and 1948, when the polling samples were skewed by the inclusion of those picked at random from the phone book, at a time when telephone penetration was low.
Nonetheless, when the US president uses Twitter as one of his primary communication platforms, we need to pay more attention to the information it contains. To demonstrate how easy this analysis is, we scraped 10,000 Twitter feeds through to early July which contain the president’s Twitter handle (@realDonaldTrump). Note these are not all tweets from the president: the exercise is designed to measure public opinion regarding Mr Trump, so we are interested in people’s responses. We then ran the results through the Hu and Liu Opinion Lexicon of 6,800 positive and negative words.1 On the basis of this simplistic analysis, we calculate that negative tweets outweigh positive ones by more than 2 to 1 with roughly one third neutral (Chart 1).
Based on a webscrape of 10,000 tweets
Companies are interested in social media platforms for similar reasons: they want to maximise the number of ‘likes’ they receive on Twitter and Facebook on the basis that their customers are active social media users. The theory is that raising the number of ‘likes’ enhances brand recognition and thereby generates additional sales. This raises the obvious question of whether it is possible to mine the information content to determine if particular words raise the number of ‘likes’. Facebook founder Mark Zuckerberg apparently urged his developers to find ways to rank posts so that important content, such as significant life events, show up higher in an individual’s feed. The solution was to use ‘congratulations’ as a trigger word when used in the comments. Experiments by savvy Facebook users have certainly shown that certain words do appear to boost ‘likes’. This raises the incentive for companies to conduct extensive data analysis to assess the predictive content of keywords for the number of Facebook ‘likes’. It may (literally) pay dividends.
Another application of textual analysis which has found widespread application in the academic literature is the webscraping of newspaper websites to look for words associated with economic policy uncertainty (EPU). Analysis of the US data shows, for example, that sharp increases in the EPU Index foreshadow a decline in investment, output and employment. Chart 2 shows a global index. We do see a jump in uncertainty around the time of 9/11 and again in 2008 following the Lehman’s collapse. But on a global basis, the index suggests that the rise of President Trump, aided and abetted by the Brexit result, prompted an unprecedented rise in the index in 2016. Whilst it is interesting that markets have remained broadly stable, with measures of risk remaining surprisingly low, we should continue to keep an eye on EPU indices: they may be a reflection of that which has hitherto been unmeasurable – uncertainty.
Standard deviations from average of 1997 to 2016
Much of the work of the investment advice industry is not disseminated via Facebook or Twitter, and tends to be communicated in the form of longer printed documents or PDFs. In a bid to demonstrate the themes which our analysis has focused on, we took the past eight years of analysis contained in the ‘Thinking Ahead’ magazine and ran it through a series of statistical programmes. On our calculations, after stripping out various common words such as the definite and indefinite article (the/a/an) and conjunctions (and/for/but, etc.), we have churned out roughly 1.3 million words. We show the main results in the form of a word cloud (Chart 3) in which the three most important words are ‘market’ and ‘risk’ followed by ‘investor’. Of the roughly 34,000 individual words identified in this analysis, a relatively small number of them occur very frequently: just 18 words cover the top 10% of all those used. What is particularly notable is that despite the wide variety of themes which we have covered in this magazine over the years, our focus has remained on the issue which matters most for markets: generating investor return whilst balancing the risks.
Big data promises to revolutionise the way we look at the world but we should not go overboard. Before too long, there will almost certainly be a spectacular miss which will bring out the critics in droves. However, humans also make mistakes and we learn from them. In the not-too-distant future, so will the machines. But the real trick, as former boss of Hewlett-Packard Carly Fiorina once said, will be to turn data into information, and information into insight. We are not quite there yet.
1 Available at www.cs.uic.edu/~liub/FBS/sentiment-analysis.html