Utilização de Random Forests para o problema de compartilhamento de bicicletas em Seattle

2016 Jan 04

Não é mais necessário dizer que o futuro das cidades vai passar pela a análise de dados e principalmente pela a aplicação da inteligência para resolução de problemas dos pagadores de impostos.

Nesse caso específico o problema era que o sistema de trens urbanos de Seattle disponibiliza 500 bicicletas em suas estações e que a oferta dessas bicicletas deve estar ajustada com a demanda de cada estação.

Aqui está o post original, e a abordagem utilizada:

“From clustering, I discovered two distinct ecosystems of bike stations—Seattle, and the University District—based on traffic flows from station to station,” Sadler said. “It turned out that having separate models for each lent itself to much better predictions.”

Sadler modeled hourly supply and hourly demand separately for each of the two ecosystems, summing the result to predict the change in current bike count, based on the current bike count data from the Pronto API. To do this, he used multiple random forest algorithms, each tuned for a specific task.

“Having groups of smaller random forests worked much better than having a single large random forest try to predict everything,” Sadler said. “This is probably due to the different ecosystems having vastly different signals and different types of noise.”

The model—which is actually two models (a random forest for each ecosystem), of which the branches of each are composed of additional random forests—draws from historical demand based on the current season, current hour, and current weekend. It also uses meta information about each station, such as elevation, size, and proximity to other stations. The model leverages this information to discover signals and patterns in ride usage, then predicts based on the signal it finds.