Via Baidu Research Blog.

The Ring all-reduce approach came to save a lot of work when training deep neural networks. The approach to propagate and update the gradients (and control the convergence of the model) are well explained below:

*The Ring Allreduce*

*The main issue with the simplistic communication strategy described above was that the communication cost grew linearly with the number of GPUs in the system. In contrast, a ring *allreduce* is an algorithm for which the communication cost is constant and independent of the number of GPUs in the system, and is determined solely by the slowest connection between GPUs in the system; in fact, if you only consider bandwidth as a factor in your communication cost (and ignore latency), the ring *allreduce* is an optimal communication algorithm [4]. (This is a good *estimate for* communication cost when your model is large, and you need to send large amounts of data a small number of times.)*

*The GPUs in a ring *allreduce* are arranged in a logical ring. Each GPU should have a left neighbor and a right neighbor; it will only ever send data to its right *neighbor,* and receive data from its left neighbor.*

*The algorithm proceeds in two steps: first, a scatter-reduce, and then, an *allgather*. In the scatter-reduce step, the GPUs will exchange data such that every GPU ends up with a chunk of the final result. In the *allgather* step, the GPUs will exchange those chunks such that all GPUs end up with the complete final result.*

More can be found here. To implement there’s a Github project called Tensor All-reduce that can be used for distributed deep learning.