Um novo operador Softmax para Aprendizado por Reforço

Em alguns problemas de classificação para fugir do espaço restrito dos scores da probabilidade (suma que vai até um) de uma determinada tupla pertencer a uma classe, o operador softmax faz o mapeamento de um vetor para uma probabilidade de uma determinada classe em problemas de classificação.

No Quora tem uma ótima definição dessa função, e de como ela é utilizada como função de ativação de forma combinada em uma rede neural através da transmissão via axiônios.

A função Softmax Logística é definida como:

Onde o θ representa um vetor de pesos, e x é um vetor de valores de input, em que essa função produz um output escalar definido por hθ(x)∈ℝ,0<hθ(x)<1. Para quem quiser se aprofundar mais essa explicação está definitivamente matadora.

Essa pequena introdução foi para mostrar esse artigo abaixo que tem uma ótima abordagem para o Softmax no contexto de aprendizado por reforço. Enjoy.

A New Softmax Operator for Reinforcement Learning

Abstract:  A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one’s weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In this work, we study an alternative softmax operator that, among other properties, is both a non-expansion (ensuring convergent behavior in learning and planning) and differentiable (making it possible to improve decisions via gradient descent methods). We provide proofs of these properties and present empirical comparisons between various softmax operators.

Conclusions: We proposed the mellowmax operator as an alternative for the Boltzmann operator. We showed that mellowmax has several desirable properties and that it works favorably in practice. Arguably, mellowmax could be used in place of Boltzmann throughout reinforcement-learning research. Important future work is to expand the scope of investigation to the function approximation setting in which the state space or the action space is large and abstraction techniques are used. We expect mellowmax operator and its non-expansion property to behave more consistently than the Boltzmann operator when estimates of state–action values can be arbitrarily inaccurate. Another direction is to analyze the fixed point of planning, reinforcement-learning, and game-playing algorithms when using softmax and mellowmax operators. In particular, an interesting analysis could be one that bounds the suboptimality of fixed points found by value iteration under each operator. Finally, due to the convexity (Boyd & Vandenberghe, 2004) of mellowmax, it is compelling to use this operator in a gradient ascent algorithm in the context of sequential decision making. Inverse reinforcement-learning algorithms is a natural candidate given the popularity of softmax in this setting.