Faster Distributed Machine Learning for Free

Voiceover

Presenter(s)

Xiaorui Liu

Poster Research Category

Artificial Intelligence, Machine Learning, Data Science

Abstract or Description

It is promising to parallelize the data analysis and machine learning tasks via distributed computing systems such that the computation power of massive computing devices can be utilized collaboratively for a potential speedup. For instance, ideally, we desire to finish the computing task ten times faster if using ten computers. However, our research reveals the fact that the synchronization cost in coordinating many devices is the major bottleneck for achieving such speedup. Specifically, the synchronization requires information communication between devices and can be notably slow. To conquer this challenge, we developed two generally useful strategies, namely communication decentralization and compression, which lead to a very efficient distributed learning algorithm. In particular, communication compression allows to compress information during transmission, and to reduce over 95% of the communication bits, say from 1 GB to 50 MB. Decentralization enables the learning algorithms with flexible communication topologies, which can further reduce the synchronization cost. Importantly, these solutions bring remarkable speedup and make machine learning at scale practical without hurting the effectiveness of the learning tasks. The proposed algorithm achieves state-of-art performance both theoretically and empirically, compared to existing algorithms in the literature. The paper has been published in the top-tier machine learning conference (ICLR 2021), and its impact is covered by a recent newsletter from ICER at MSU.

An error occurred while loading the PDF.

Comments

Wei Jin4 years ago

This work looks very interesting! I am wondering if we can apply this idea on federate learning where mobile devices are considered as the client nodes? If the model on each device is small, will such compression still be necessary?

• • 1 comment

Xiaorui Liu4 years ago

Hi Wei, thanks for the question! Federated learning is definitely a well-suited problem setting where our approach can be applied. The advantages have been demonstrated in my response to Katy's comments. Regarding the case when the model is small, the compression will still be very helpful. This is because the communication will happen for each learning iteration, and the number of iteration is typically huge, say 1000. Therefore, if we consider the overall communication cost during the whole learning process, it is still very high. To conclude, our approach can be applied to federated learning, and compression is necessary even when the model is small.

Katy Colbry4 years ago

It sounds like this approach has some significant advantages in both speed and space/storage. I wonder if it is meant primarily for use in high-performance computing systems, like the MSU HPCC, where users submit specific jobs to be analyzed/computed? Or are there potential applications beyond HPC systems?

• • 1 comment

Xiaorui Liu4 years ago

Hi Katy, thanks for your interest in our approach! Our method is general and can be used in all kinds of distributed systems, including but not limit to HPCC. Let me give an example here: the algorithm can be applied to the machine learning tasks distributed over millions of mobile devices where each device holds the user's private data. In this specific scenario, our approach has the following advantages: 1. Due to decentralization, it doesn't require users to upload their private data to the central server, which preserves user privacy; 2. Due to decentralization, it doesn't require a central server to collect information (e.g., model, data) from massive devices, which could be extremely costly in both speed and storage as you mentioned. Therefore, the approach is scalable to problems of a significantly larger scale; 3. Due to the compression technique, over 95% of the communication cost can be reduced. As a matter of fact, the speedup with our approach is more significant when the network bandwidth is limited and the communication bottleneck is more dominating. For instance, the communication cost between mobile devices is much higher than the cost in HPCC where we could potentially deploy fast (but very expensive) network interfaces such as InfiniBand. Therefore, the potential benefits beyond the HPC system are even stronger. Let's now go back to the HPCC case. The good news is if we can reduce the communication cost by over 95%, we can instead choose much cheaper hardware to achieve the same performance. This should be beneficial in reducing both energy consumption and equipment cost. Let me know if you have further questions. Thanks.

Andrew christlieb4 years ago

Nice work. I am wondering about the learning, it does not seam to stagnate with the number of Epoch, which often happens. It is impressive. Is there an obvious reason why the other methods stagnate and your new distributed algorithm does not. Is it simply the fact that you can handle more data because of the parallel efficiency or is there something else going on?

• • 1 comment

Xiaorui Liu4 years ago

Hi Andrew. thanks for the question. The nice part of our approach is that it doesn't require more epochs compared with non-parallel counterparts, as shown in our theoretical analysis and experiments, while all methods in the literature have to sacrifice in this regard. This is because: (1) other methods don't handle the issue in decentralized communication well. For instance, when the data distribution over machines is heterogeneous, the learning in each machine will lead the local models to very different directions. This is totally fine with centralized communication where a global synchronization will resolve the conflict immediately. However, with decentralized communication, such conflict can not be immediately resolved. Therefore, the solution will be biased, which then requires a diminishing learning rate to ensure convergence to the optimal. The consequence is a much slower convergence rate. In our approach, we adopt a primal-dual method, and the dual variables essentially correct the bias caused by decentralization; (2) other methods don't handle the compression error well. In our approach, our compression error will totally disappear when approaching the optimal solution. Moreover, during this converging process, although the compression error exists, it will be compensated in a smart way such that no information is lost. As a result, both our theory and experiments show that our approach maintains the iteration complexity well as long as the network connectivity and the compression ratio are reasonable. I am not totally sure whether my answer aligns well with your question, but feel free to let me know if it doesn't. Thanks.