The ability to analyze large amounts of data may be quite valuable but it is not an easy solution. We have had the capability a long time, what is new is the scale. Our digital world creates large amount of data in various forms. Big data means capability to analyze large amounts of information rapidly.
Analysis means recognizing changes, trends, patterns and correlations. Traditional method of analyzing data means that it is categorized and structured in tables. Analysis is made from a snapshot of collected data.
Volume, variety, velocity are the three V’s of big data. Volume means large amounts of data, variety means many types of data like text, sound pictures etc and velocity means using real time data. I am sure that there are successful applications of big data but I am sure that there are also a lot of disappointments. The vendors of big data solutions are the loudest proponents of the method listing the success stories. It is quite natural that nobody wants to tell about the failures.
I have some perspective to this approach as my first career was in statistics. Here is a simple predictive model which I made of monthly visits on my web-site. I calculated the future visits in November 2011 for the next two years and it has worked pretty well for 17 months now. I did this out of curiosity, wanted to see if I could make a working model. The predicted total sum of visits is just 2% off the actual number.
The simple model is not big data, just an example that such things like web visits can be predicted. If you don’t know how it is done, it may look magical. This model is a combination of a trend and a pattern and really very basic.
Another thing I have learned from making surveys is that free text answers are very valuable but difficult to analyze. At first there is a mass of text but after some analysis it is usually possible to find some patterns. I have not yet found software which could analyze the data for me so I do it by reading and tagging the text until I have found the patterns. Actually the lack of patterns is information too.
All models do not work so well. I have spent a lot of time trying to find something from a large collection of data. It is far from easy and it takes a lot of trial and error to find a useful way of looking at the data. There are many ways to organize data and it can be crucial that the data has a correct structure. Another common problem is the quality of the data. In demo’s and in fiction this is made to look easy but in real life it is like searching for a needle in a very big haystack. Here are some problems I see with big data.
Amount of (significant) observations
A typical method in statistical analysis is looking for significant observation. A significant observation is one which has a low probability of happening by chance. When you have a large amount of data there will be a large amount of statistics to analyze and there will be random differences that are significant. If you have 1000 variables it is possible to calculate 1000*999/2 =499 500 correlations. Almost 25 000 of these will be ”significant” by the normal 5% significance test. It will be quite hard to work out which of the changes or correlations are actually significant and which are noise.
The same goes for patterns. It is possible to generate endless amount of graphical visualizations of all kinds of data. Humans are good at seeing patterns everywhere. The constellations are a good example. The random patterns of stars become constellations and acquire non-existent meaning. The more data we have the more significant looking but random patterns will we find.
I have read comments that the high volume of the data can be better than complex processing. I doubt it. The benefit of using small samples is that one can invest more in controlling the quality of data which lead us to the next challenge.
Complexity and errors
Errors in the data or in the application of a method can lead to wrong conclusions. Complex mathematical models and calculations can contain all kinds of errors. It is dangerous to use a statistical method without understanding it, unfortunately modern tools make it too easy. One example is the effect of outliers in correlation based tools. Outliers are observations which are different from others. A single outlier can create a false correlation if it is sufficiently far out of the normal variation. The outlier can represent a special case of unusual behavior or just be an error and data errors are common, even in production systems. The more data there is the more difficult it will be to remove errors and outliers. And removing outliers can be dangerous as these can represent a significant change in behavior.
Big Data is not just a tool
There are two more V’s in the concept of big data; value and veracity. The idea in investing in big data is to find some value from it. The output of the big data analysis is information. It means that somebody must be able and willing to change things based on the information. The value will come from the decisions. That may be a bigger challenge than the technical solution. An investment in big data will bring value only if
- the analysis creates information
- the interpretation is correct
- the decision makers trust the information
- their reaction is correct and timely
That is a lot of if ’s and together they are the last V, veracity. I have seen that decision makers will not apply findings which they do not understand and that means in practice that you have to use some very simple and graphical way to convince people.
The decision to use big data methods needs to be business driven. If IT sees the possibility it needs to convince the business of the value. The danger is that some vendor may have sold the idea to business and they want to apply it without understanding the risks and costs. The best approach would be to let them play with the data first, make pilot tests with a temporary solution before investing in a production environment.
I find it hard that the volume is a must for testing an idea. If a concept does not work with a sample of the data it is highly unlike that it will work with full data set. Big data is not magic.