Lições da competição Kaggle inClass

Muito já foi falado do Kaggle aqui, mas um aspecto que eu acho extremamente positivo neste site de competições em aprendizado de máquina é que sempre há algumas formas bem criativas de resolução de problemas ligados à predição e de modelos classificadores.

Neste post do No Free Hunch um time vencedor postou alguns de seus métodos, e a lição principal que fica é a mesma que o Frank Harrell fala em seu excelente livro  que é de sempre olhar os dados.

A seguir alguns pontos fortes da entrevista sobre os métodos utilizados.

Sobre os métodos de processamento inicial dos dados

[…]From the very beginning, our top priority was to develop useful features. Knowing that we would learn more powerful statistical learning methods as our Stanford course progressed, we made sure that we had the features ready so we would be able apply various models to them quickly and easily. […]

[…]__When we later applied the boosted decision trees model, we derived additional predictors that expressed the variance in the number of subscriptions bought – theorizing that the decision tree would be more easily able to separate “stable” accounts from “unstable” ones.

We created 277 features in the end, which we applied in different combinations. Surprisingly, our final model used only 18 of them.__[…]

Sobre os métodos de aprendizado supervisionado utilizados

[…]Most importantly – and from the very beginning – we used 10-fold cross validation error as the metric to compare different learning methods and for optimization within models.

We started with multiple linear regression models. These simple models helped us become familiar with the data while we concentrated our initial efforts on preparing features for later use.[…]

Sobre o que foi utilizado como técnica

[…]We didn’t have much luck with SVM, BART and KNN. Perhaps we did not put enough effort into that, but since we already had very good results from using boosted trees, the bar was already quite high. Our biggest effort soon turned to tuning the boosted regression tree model parameters.

Using cross validation error, we tuned the following parameters: number of trees, bagfrac, shrinkage, and depth. We then tuned the minobinsnode parameter – we saw significant improvements when adjusting the parameter downwards from its default setting.

Our tuning process was both manual and automated. We wrote R scripts that randomly changed the parameters and set of predictors as then computed the 10-fold cross-validation error on each permutation. But these scripts were usually used only as a guide for approaches that we then further investigated manually. We used this as a kind of modified forward selection process.[…]