Two months ago Donald Trump shocked the world by winning the election. Regardless of your political persuasion or whom you support you were likely surprised by his win. Data Analysis of elections are almost a national past time. All of the models and analysis predicted a different outcome. Even the stock market itself predicted a different result, and as a result there was a period of havoc in the market while the stock market corrected.
Similarly, people have built predictive models of stock market prices, climate change, day to day weather, and even your purchasing habits. Now plenty of these models are accurate, and will accurately predict outcomes. However, the important thing to understand is that models are based on assumptions.
For much of my last job I managed a team of Data Analysts, or Data Scientists as they are now more commonly called. During this time it was my team’s task to build models to predict everything from break down rates of instruments to buying patterns of customers. Using these analysis we would attempt to position our organization to respond to customer needs with a minimum of costs. Sometimes we got it right, in which case our engineers covered the customer need in a timely manner and at the lowest cost to our company. Other times though we got it way wrong and had to employ temps and other mitigating measures.
The difference between success and failure in these analyses was usually based on whether we had made the correct assumptions as the basis for our analysis. Had we correctly assumed the customer’s usage rate of their instrument? What about the cleanliness of their lab? Whether the customer regularly engaged in preventative maintenance? These all helped start us on the path to creating a model.
Mean, Median, and Distribution
There is a popular saying, “there are lies, damnable lies, and statistics”. You can make statistics say anything with enough effort, simply by modifying the underlying assumptions. Take statistics on the average net worth of society as a simple example. If you assumed that society was evenly dispersed, you would use a mean, which measures the average of all numbers. However, if you assumed that there was significant income variety you might take the median, which is the middle value. But that also still might not show the story. Imagine a striation where most people are either very poor or very rich with no in the middle. In that case you would not want to look at either the mean or the median. Instead you would want to look at the distribution of incomes.
The other major thing you need to consider is whether your sample truly represents the population of outcomes. If you are predicting an election, for example, you’re likely basing your predictions on polling people, social media, or both. The issue of course is perhaps this means a large portion of society is not represented. The idea that society is represented is in fact another one of those assumptions.
This is also why predicting individual stock outcomes is so hard. We have a data set of outcomes that even in the longest running stocks is probably 180 years. And yet 180 years denotes only 150 distinct 30 year periods that would represent your investment horizon. It is extremely likely there is a 30 year period somewhere in the future that will not be like any of those previous periods. It is unlikely that all 30 year periods are represented in the data, which explains why individual stock prediction is not particularly reliable. Overall economic data has a longer track record, which is why it is likely at least slightly more reliable. Still even if statistics somehow get it right they still have not told you “Why?”.
Why or What does the data mean
Often we look at data and see a situation that occurs at the same time as some other occurrence. If it does it frequently enough you will hear we call something correlated. Correlation is helpful as it can tell us how frequently two things occur together (or don’t). However, what it can’t tell you is causation. What truly caused something is only determinable via using tools like the scientific method and testing. Why? Because even if they occur together they may just be doing so based on a coincidence. They also might be caused by some altogether different phenomenon. I saw this regularly when we’d look at a data set superficially before splitting our data by things like customer type, region, etc. Prior to splitting you might see an improvement based on an action, but if you split up the data you could see any change was caused by a demographic variable, not an improvement. Knowing what variables to control for are defined by those scientific tests. These tests are also what should ultimately define the assumptions I mentioned a few paragraphs ago.
The final important aspect of data analysis is benchmarking. Benchmarking is choosing an item of similar consistency to compare numbers against. So say you’re looking at a company and trying to decide if it is using too much leverage? In this case you would compare the company to others in their industry to determine the true situation. Looking at the company in a vacuum would not take into account if it is a asset dependent business like a railroad. Comparing in the context of all factors would tell you which company in an industry is the better bet to succeed. The risk again is what assumptions led you to the comparison. Should AirBnb for example be compared to Marriott? Visa to Amex (Only one actually provides the lending behind the card)?
Data Analysis Take Aways
Ultimately whenever you see models or predictions, you should ask what assumptions were used to make the prediction, what variables they controlled for, what is the benchmark if any, and how they viewed the results. Ask what evidence was used to formulate and justify those assumptions. In particular, since this is a finance blog, the relationship is primarily related to those stock and economy predictions. As you dig you will find that the better models have significant testing behind them, but many are just gut feel. Even those market prediction models with significantly more data that are significantly tested still go wrong from time to time. When they do go wrong we learn one more thing about our world that allows us to iterate closer to a correct model. In the interim, until something like a financial model is perfected, you are probably better off investing in index funds. However, if you venture beyond them always remember to ask.
Do you use Data Analysis to choose investments?