Adaptive Gradient Compression: An Information-Theoretic Analysis of Entropy and Fisher-Based Learning Dynamics

Hidayaturrahman Hidayaturrahman

doi:10.21512/ijcshai.v2i2.14533

Authors

Hidayaturrahman Hidayaturrahman Bina Nusantara University

DOI:

https://doi.org/10.21512/ijcshai.v2i2.14533

Keywords:

Gradient compression, entropy, Fisher information, learning dynamics, information theory, optimization efficiency, deep learning

Abstract

Deep neural networks require intensive computation and communication due to the large volume of gradient updates exchanged during training. This paper investigates Adaptive Gradient Compression (AGC), an information-theoretic framework that reduces redundant gradients while preserving learning stability. Two independent compression mechanisms are analyzed: an entropy-based scheme, which filters gradients with low informational uncertainty, and a Fisher-based scheme, which prunes gradients with low sensitivity to the loss curvature. Both approaches are evaluated on the CIFAR-10 dataset using a ResNet-18 model under identical hyperparameter settings. Results show that entropy-guided compression achieves a 33.8× reduction in gradient density with only a 4.4% decrease in test accuracy, while Fisher-based compression attains 14.3× reduction and smoother convergence behavior. Despite modest increases in per-iteration latency, both methods maintain stable training and demonstrate that gradient redundancy can be systematically controlled through information metrics. These findings highlight a new pathway toward information-aware optimization, where learning efficiency is governed by the informational relevance of gradients rather than their magnitude alone. Furthermore, this study emphasizes the practical significance of integrating information theory into deep learning optimization. By selectively transmitting gradients that carry higher information content, AGC effectively mitigates communication bottlenecks in distributed training environments. Experimental analyses further reveal that adaptive compression dynamically adjusts to training dynamics, providing robustness across various learning stages. The proposed framework can thus serve as a foundation for developing future low-overhead optimization methods that balance accuracy, stability, and efficiency, and crucial aspects for large-scale deep learning deployments in edge and cloud computing contexts.

Dimensions

Author Biography

Hidayaturrahman Hidayaturrahman, Bina Nusantara University

Computer Science Department, School of Computer Science

References

[1] C. Li, A. Tsourdos, and W. Guo, “A Transistor Operations Model for Deep Learning Energy Consumption Scaling Law,” IEEE Transactions on Artificial Intelligence, vol. 5, no. 1, 2024, doi: 10.1109/TAI.2022.3229280.

[2] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” in 7th International Conference on Learning Representations, ICLR 2019, 2019.

[3] L. Abrahamyan, Y. Chen, G. Bekoulis, and N. Deligiannis, “Learned Gradient Compression for Distributed Deep Learning,” IEEE Trans Neural Netw Learn Syst, vol. 33, no. 12, 2022, doi: 10.1109/TNNLS.2021.3084806.

[4] P. Zegers, B. R. Frieden, C. Alarcón, and A. Fuentes, “Information theoretical measures for achieving robust learning machines,” Entropy, vol. 18, no. 8, 2016, doi: 10.3390/e18080295.

[5] S. ichi Amari, “Information geometry of the EM and em algorithms for neural networks,” Neural Networks, vol. 8, no. 9, 1995, doi: 10.1016/0893-6080(95)00003-8.

[6] A. M. Saxe et al., “On the information bottleneck theory of deep learning,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2019, no. 12, 2019, doi: 10.1088/1742-5468/ab3985.

[7] D. Alistarh, D. Grubic, J. Z. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017.

[8] P. Luo, F. R. Yu, J. Chen, J. Li, and V. C. M. Leung, “A Novel Adaptive Gradient Compression Scheme: Reducing the Communication Overhead for Distributed Deep Learning in the Internet of Things,” IEEE Internet Things J, vol. 8, no. 14, 2021, doi: 10.1109/JIOT.2021.3051611.

[9] T. Sun, K. Tang, and D. Li, “Gradient Descent Learning with Floats,” IEEE Trans Cybern, vol. 52, no. 3, 2022, doi: 10.1109/TCYB.2020.2997399.

[10] J. Wangni, J. Liu, J. Wang, and T. Zhang, “Gradient sparsification for communication-efficient distributed optimization,” in Advances in Neural Information Processing Systems, 2018.

[11] A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” in EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings, 2017. doi: 10.18653/v1/d17-1045.

[12] S. U. Stich, J. B. Cordonnier, and M. Jaggi, “Sparsified SGD with memory,” in Advances in Neural Information Processing Systems, 2018.

[13] S. P. Karimireddy, Q. Rebjock, S. U. Stich, and M. Jaggi, “Error feedback fixes SignSGD and other gradient compression schemes,” in 36th International Conference on Machine Learning, ICML 2019, 2019.

[14] S. U. Stich, “Local SGD converges fast and communicates little,” in 7th International Conference on Learning Representations, ICLR 2019, 2019.

[15] P. Kairouz et al., “Advances and open problems in federated learning,” 2021. doi: 10.1561/2200000083.

[16] N. Parikh, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (FIXME),” Foundations and Trends® in Optimization, vol. 1, no. 3, 2014.

[17] J. Zhang, X. Y. Zhang, C. Wang, and C. L. Liu, “Deep representation learning for domain generalization with information bottleneck principle,” Pattern Recognit, vol. 143, 2023, doi: 10.1016/j.patcog.2023.109737.

[18] B. Li et al., “Invariant Information Bottleneck for Domain Generalization,” in Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022, 2022. doi: 10.1609/aaai.v36i7.20703.

[19] S. I. Amari, “Natural Gradient Works Efficiently in Learning,” Neural Comput, vol. 10, no. 2, 1998, doi: 10.1162/089976698300017746.

[20] J. Martens, “Deep learning via Hessian-free optimization,” in ICML 2010 - Proceedings, 27th International Conference on Machine Learning, 2010.

[21] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” in Advances in Neural Information Processing Systems, 2018.

[22] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, 2015.

[23] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, 2011.

[24] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “DEEP GRADIENT COMPRESSION: REDUCING THE COMMUNICATION BANDWIDTH FOR DISTRIBUTED TRAINING,” in 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, 2018.

[25] A. Krizhevsky, V. Nair, and G. Hinton, “CIFAR-10 and CIFAR-100 datasets,” 2009.

[26] Y. Lin et al., “Influence of Density Gradient on the Compression of Functionally Graded BCC Lattice Structure,” Materials, vol. 16, no. 2, 2023, doi: 10.3390/ma16020520.

[27] H. Li, J. Zhang, Z. Li, J. Liu, and Y. Wang, “Improvement of Min-Entropy Evaluation Based on Pruning and Quantized Deep Neural Network,” IEEE Transactions on Information Forensics and Security, vol. 18, 2023, doi: 10.1109/TIFS.2023.3240859.

[28] T. Galla, “Theory of Neural Information Processing Systems,” J Phys A Math Gen, vol. 39, no. 14, 2006, doi: 10.1088/0305-4470/39/14/b01.