Lições em competições do Kaggle
2016 Jan 01Já é desnecessário dizer o quando o Kaggle vem contribuindo com a comunidade de Data Science, e essas lições do Harasymiv mostram que essas contribuições vão além do básico.
Vejam abaixo:
- XG Boosting is the engine of choice for structured problems (where feature manufacturing is the key). Now available as python package. Behind XG are the typical suspects - Random Forest and Gradient Boosted Trees. However, hyper parameter tuning is only the few % accuracy points improvement on top, the major breakthroughs in predictive power come from feature manufacturing;
- Feature manufacturing for structured problems is the key process (or otherwise random permutation of features to find most predictive/telling combination) either by iteratively trying various approaches (as do thousands of individual contributions to Kaggle.com competition) or in an automatic fashion (as done by DataRobot. BTW, DataRobot is based partially in Boston and partially in Ukraine). Some Amazon engineers who attended from Seattle commented they are building a platform which would iteratively try to permute features to randomly (aka “genetic algorithm” fashion) find best features for structured problems, too;
- For unstructured problems (visuals, text, sound) - Neural Networks run the show (and their deep learning - auto feature extracting - and variants of those). Great example was application of NN to Diabetic Retinopathy problem at Kaggle.com which surpassed in accuracy commercially available products;
- Kaggle.com is really suitable for two types of problems: A problem solved now for which a more accurate solution is highly desirable - any fraction % accuracy turns into millions of $ (e.g. loan default rate prediction) or- Problems which were never tackled by machine learning in order to see if ML can help solve them (e.g. EEG readings to predict epilepsy);
- Don’t expect data scientists to perform best in the office! Anthony mentioned his first successful 24h data science hackathon when his senior was guiding him 5 min, coding himself for 15 min and then playing basketball for 40 min each hour. Personally, I find walking, gardening and running are great creativity boosters. How will you work tomorrow? :)