Porque o fenômeno do Big Data está envolvido em Problemas? Eles esqueceram estatística aplicada

2014 May 18

O Jeff Leek neste post coloca um ponto de vista bem relevante no que tange a análise de dados.

Em tempos em que vendedores de software de Business Intelligence, ou mesmo vendedores deSistemas Gerenciadores de Banco de Dados tentam seduzir gerentes, diretores, e tomadores de decisão de que precisamos de mais dados; este post simplesmente diz: “Não, aprendam estatística antes!”

One reason is that when you actually take the time to do an analysis right, with careful attention to all the sources of variation in the data, it is almost a law that you will have to make smaller claims than you could if you just shoved your data in a machine learning algorithm and reported whatever came out the other side.

The prime example in the press is Google Flu trends. Google Flu trends was originally developed as a machine learning algorithm for predicting the number of flu cases based on Google Search Terms. While the underlying data management and machine learning algorithms were correct, a misunderstanding about the uncertainties in the data collection and modeling process have led to highly inaccurate estimates over time. A statistician would have thought carefully about the sampling process, identified time series components to the spatial trend, investigated why the search terms were predictive and tried to understand what the likely reason that Google Flu trends was working.

As we have seen, lack of expertise in statistics has led to fundamental errors in both genomic science and economics. In the first case a team of scientists led by Anil Potti created an algorithm for predicting the response to chemotherapy. This solution was widely praised in both the scientific and popular press. Unfortunately the researchers did not correctly account for all the sources of variation in the data set and had misapplied statistical methods and ignored major data integrity problems. The lead author and the editors who handled this paper didn’t have the necessary statistical expertise, which led to major consequences and cancelled clinical trials.

No final o autor faz uma pergunta que eu acho extremamente relevante: “ When thinking about the big data era, what are some statistical ideas we’ve already figured out?”

Eu tenho algumas:

1) Determinação de tamanho de amostra para criação de modelos usando tamanho de população conhecida ou desconhecida;

2) Design de Experimentos

3) Análise Exploratória de Dados