Choosing Learning Rate?

Ben Hammel shows how to do it.

Reduce learning rate on plateau

Every time the loss begins to plateau, the learning rate decreases by a set fraction. The belief is that the model has become caught in region similar to the “high learning rate” scenario shown at the start of this post (or visualized in the ‘chaotic’ landscape of the VGG-56 model above). Reducing the learning rate will allow the optimizer to more efficiently find the minimum in the loss surface. At this time, one might be concerned about converging to a local minimum. This is where building intuition from an illustrative representation can betray you, I encourage you to convince yourself of the discussion in the “Local minima in deep learning” section.

Use a learning-rate finder

The learning rate finder was first suggested by L. Smith and popularized by Jeremy Howard in Deep Learning : Lesson 2 2018. There are lots of great references on how this works but they usually stop short of hand-wavy justifications. If you want some convincing of this method, this is a simple implementation on a linear regression problem, followed by a theoretical justification.

Behavior or different learning rate values