Adaptive Gradient Compression: An Information-Theoretic Analysis of Entropy and Fisher-Based Learning Dynamics
DOI:
https://doi.org/10.21512/ijcshai.v2i2.14533Keywords:
Gradient compression, entropy, Fisher information, learning dynamics, information theory, optimization efficiency, deep learningAbstract
Deep neural networks require intensive computation and communication due to the large volume of gradient updates exchanged during training. This paper investigates Adaptive Gradient Compression (AGC), an information-theoretic framework that reduces redundant gradients while preserving learning stability. Two independent compression mechanisms are analyzed: an entropy-based scheme, which filters gradients with low informational uncertainty, and a Fisher-based scheme, which prunes gradients with low sensitivity to the loss curvature. Both approaches are evaluated on the CIFAR-10 dataset using a ResNet-18 model under identical hyperparameter settings. Results show that entropy-guided compression achieves a 33.8× reduction in gradient density with only a 4.4% decrease in test accuracy, while Fisher-based compression attains 14.3× reduction and smoother convergence behavior. Despite modest increases in per-iteration latency, both methods maintain stable training and demonstrate that gradient redundancy can be systematically controlled through information metrics. These findings highlight a new pathway toward information-aware optimization, where learning efficiency is governed by the informational relevance of gradients rather than their magnitude alone. Furthermore, this study emphasizes the practical significance of integrating information theory into deep learning optimization. By selectively transmitting gradients that carry higher information content, AGC effectively mitigates communication bottlenecks in distributed training environments. Experimental analyses further reveal that adaptive compression dynamically adjusts to training dynamics, providing robustness across various learning stages. The proposed framework can thus serve as a foundation for developing future low-overhead optimization methods that balance accuracy, stability, and efficiency, and crucial aspects for large-scale deep learning deployments in edge and cloud computing contexts.
References
[1] C. Li, A. Tsourdos, and W. Guo, “A Transistor Operations Model for Deep Learning Energy Consumption Scaling Law,” IEEE Transactions on Artificial Intelligence, vol. 5, no. 1, 2024, doi: 10.1109/TAI.2022.3229280.
[2] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” in 7th International Conference on Learning Representations, ICLR 2019, 2019.
[3] L. Abrahamyan, Y. Chen, G. Bekoulis, and N. Deligiannis, “Learned Gradient Compression for Distributed Deep Learning,” IEEE Trans Neural Netw Learn Syst, vol. 33, no. 12, 2022, doi: 10.1109/TNNLS.2021.3084806.
[4] P. Zegers, B. R. Frieden, C. Alarcón, and A. Fuentes, “Information theoretical measures for achieving robust learning machines,” Entropy, vol. 18, no. 8, 2016, doi: 10.3390/e18080295.
[5] S. ichi Amari, “Information geometry of the EM and em algorithms for neural networks,” Neural Networks, vol. 8, no. 9, 1995, doi: 10.1016/0893-6080(95)00003-8.
[6] A. M. Saxe et al., “On the information bottleneck theory of deep learning,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2019, no. 12, 2019, doi: 10.1088/1742-5468/ab3985.
[7] D. Alistarh, D. Grubic, J. Z. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017.
[8] P. Luo, F. R. Yu, J. Chen, J. Li, and V. C. M. Leung, “A Novel Adaptive Gradient Compression Scheme: Reducing the Communication Overhead for Distributed Deep Learning in the Internet of Things,” IEEE Internet Things J, vol. 8, no. 14, 2021, doi: 10.1109/JIOT.2021.3051611.
[9] T. Sun, K. Tang, and D. Li, “Gradient Descent Learning with Floats,” IEEE Trans Cybern, vol. 52, no. 3, 2022, doi: 10.1109/TCYB.2020.2997399.
[10] J. Wangni, J. Liu, J. Wang, and T. Zhang, “Gradient sparsification for communication-efficient distributed optimization,” in Advances in Neural Information Processing Systems, 2018.
[11] A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” in EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings, 2017. doi: 10.18653/v1/d17-1045.
[12] S. U. Stich, J. B. Cordonnier, and M. Jaggi, “Sparsified SGD with memory,” in Advances in Neural Information Processing Systems, 2018.
[13] S. P. Karimireddy, Q. Rebjock, S. U. Stich, and M. Jaggi, “Error feedback fixes SignSGD and other gradient compression schemes,” in 36th International Conference on Machine Learning, ICML 2019, 2019.
[14] S. U. Stich, “Local SGD converges fast and communicates little,” in 7th International Conference on Learning Representations, ICLR 2019, 2019.
[15] P. Kairouz et al., “Advances and open problems in federated learning,” 2021. doi: 10.1561/2200000083.
[16] N. Parikh, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (FIXME),” Foundations and Trends® in Optimization, vol. 1, no. 3, 2014.
[17] J. Zhang, X. Y. Zhang, C. Wang, and C. L. Liu, “Deep representation learning for domain generalization with information bottleneck principle,” Pattern Recognit, vol. 143, 2023, doi: 10.1016/j.patcog.2023.109737.
[18] B. Li et al., “Invariant Information Bottleneck for Domain Generalization,” in Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022, 2022. doi: 10.1609/aaai.v36i7.20703.
[19] S. I. Amari, “Natural Gradient Works Efficiently in Learning,” Neural Comput, vol. 10, no. 2, 1998, doi: 10.1162/089976698300017746.
[20] J. Martens, “Deep learning via Hessian-free optimization,” in ICML 2010 - Proceedings, 27th International Conference on Machine Learning, 2010.
[21] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” in Advances in Neural Information Processing Systems, 2018.
[22] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, 2015.
[23] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, 2011.
[24] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “DEEP GRADIENT COMPRESSION: REDUCING THE COMMUNICATION BANDWIDTH FOR DISTRIBUTED TRAINING,” in 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, 2018.
[25] A. Krizhevsky, V. Nair, and G. Hinton, “CIFAR-10 and CIFAR-100 datasets,” 2009.
[26] Y. Lin et al., “Influence of Density Gradient on the Compression of Functionally Graded BCC Lattice Structure,” Materials, vol. 16, no. 2, 2023, doi: 10.3390/ma16020520.
[27] H. Li, J. Zhang, Z. Li, J. Liu, and Y. Wang, “Improvement of Min-Entropy Evaluation Based on Pruning and Quantized Deep Neural Network,” IEEE Transactions on Information Forensics and Security, vol. 18, 2023, doi: 10.1109/TIFS.2023.3240859.
[28] T. Galla, “Theory of Neural Information Processing Systems,” J Phys A Math Gen, vol. 39, no. 14, 2006, doi: 10.1088/0305-4470/39/14/b01.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Hidayaturrahman Hidayaturrahman

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License - Share Alike that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
USER RIGHTS
All articles published Open Access will be immediately and permanently free for everyone to read and download. We are continuously working with our author communities to select the best choice of license options, currently being defined for this journal as follows: Creative Commons Attribution-Share Alike (CC BY-SA)



