Publication

ElastiQuant: elastic quantization strategy for communication efficient distributed machine learning in IoT

Sudharsan, Bharath
Breslin, John G.
Ali, Muhammad Intizar
Corcoran, Peter
Ranjan, Rajiv
Citation
Sudharsan, Bharath, Breslin, John G., Ali, Muhammad Intizar, Corcoran, Peter, & Ranjan, Rajiv. (2022). ElastiQuant: elastic quantization strategy for communication efficient distributed machine learning in IoT. Paper presented at the Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, Virtual Event. https://doi.org/10.1145/3477314.3507135
Abstract
In training distributed machine learning, communicating model updates among workers has always been a bottleneck. The magnitude of impact on the quality of resultant models is higher when distributed training on low hardware specification devices and in uncertain real-world IoT networks where congestion, latency, bandwidth issues are common. In this scenario, gradient quantization plus encoding is an effective way to reduce cost when communicating model updates. Other approaches can be to limit the client-server communication frequency, adaptive compression by varying the spacing between quantization levels, reusing outdated gradients, deep compression to reduce transmission packet size, and adaptive tuning of the number of bits transmitted per round. The optimization levels provided by such and other non-comprehensive approaches do not suffice for high-dimensional NN models with large size model updates. This paper presents ElastiQuant, an elastic quantization strategy that aims to reduce the impact caused by limitations in distributed IoT training scenarios. The distinguishable highlights of this comprehensive work are: (i) theoretical assurances and bounds on variance and number of communication bits are provided, (ii) worst-case variance analysis is performed, and (iii) momentum is considered in convergence assurance. ElastiQuant experimental evaluation and comparison with top schemes by distributed training 5 ResNets on 18 edge GPUs over ImageNet and CIFAR datasets show: improved solution quality in terms of ≈ 2--11 % training loss reduction, ≈ 1--4 % accuracy boost, and ≈ 4--22 % variance drop; positive scalability due to higher communication compression resulting in saving bandwidth and ≈ 4--30 min per epoch training speedups.
Publisher
Association for Computing Machinery ACM
Publisher DOI
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International