One of the biggest problems in data-parallel training is communication. When we only have few GPUs and parameters measured in megabytes of data, it may not matter much exactly how those CPUs communicate. However, when your models have billions of parameters, the gradients may take gigabytes of space (because there is a gradient value for every parameter), and you are coordinating dozens of GPUs, the communication mechanism becomes crucial.