Averaging weights leads to wider optima and better generalization

Author(s):

Izmailov, Pavel
Podoprikhin, Dmitrii
Garipov, Timur
Vetrov, Dmitry
Wilson, Andrew Gordon

Abstract:

Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much broader optima than SGD, and approximates the recent Fast Geometric Ensem-bling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and ShakeShake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.

Document:

https://arxiv.org/abs/1803.05407

References:

P. Chaudhari, Anna Choromanska, S. Soatto, Yann Le-Cun, C. Baldassi, C. Borgs, J. Chayes, Levent Sagun,and R. Zecchina. Entropy-sgd: Biasing gradient de-scent into wide valleys. InInternational Conferenceon Learning Representations (ICLR), 2017.
Anna Choromanska, Mikael Henaff, Michael Mathieu,G ́erard Ben Arous, and Yann LeCun. The loss surfacesof multilayer networks. InArtificial Intelligence andStatistics, pages 192–204, 2015.
Laurent Dinh, Razvan Pascanu, Samy Bengio, andYoshua Bengio. Sharp minima can generalize for deepnets. InInternational Conference on Machine Learn-ing, pages 1019–1028, 2017.
Felix Draxler, Kambis Veschgini, Manfred Salmhofer,and Fred Hamprecht. Essentially no barriers in neu-ral network energy landscape. InProceedings of the35th International Conference on Machine Learning,pages 1308–1317, 2018.
Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin,Dmitry P Vetrov, and Andrew Gordon Wilson. Losssurfaces, mode connectivity, and fast ensembling ofdnns.arXiv preprint arXiv:1802.10026, 2018.
Xavier Gastaldi.Shake-shake regularization.arXivpreprint arXiv:1705.07485, 2017.
Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic pro-gramming.SIAM Journal on Optimization, 23(4):2341–2368, 2013.
Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe.Qualitatively characterizing neural network optimiza-tion problems.International Conference on LearningRepresentations, 2015.
Dongyoon Han,Jiwhan Kim,and Junmo Kim.Deep pyramidal residual networks.arXiv preprintarXiv:1610.02915, 2016.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep residual learning for image recognition.InProceedings of the IEEE conference on computervision and pattern recognition, pages 770–778, 2016.
Sepp Hochreiter and J ̈urgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42, 1997.
Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Lau-rens van der Maaten. Densely connected convolutionalnetworks. InProceedings of the IEEE conference oncomputer vision and pattern recognition, volume 1,page 3, 2017.
Sergey Ioffe and Christian Szegedy. Batch normaliza-tion: Accelerating deep network training by reducinginternal covariate shift. InInternational Conferenceon Machine Learning, pages 448–456, 2015.
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge No-cedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang.On large-batch training for deep learning: Generaliza-tion gap and sharp minima.International Conferenceon Learning Representations, 2017.
Ilya Loshchilov and Frank Hutter. Sgdr: stochastic gra-dient descent with restarts.International Conferenceon Learning Representations, 2017.
Stephan Mandt, Matthew D Hoffman, and David M Blei.Stochastic gradient descent as approximate bayesianinference.The Journal of Machine Learning Research,18(1):4873–4907, 2017.
Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha,and Dmitry Vetrov. Variance networks: When expec-tation does not meet your expectations.arXiv preprintarXiv:1803.03764, 2018.
Boris T Polyak and Anatoli B Juditsky. Acceleration ofstochastic approximation by averaging.SIAM Journalon Control and Optimization, 30(4):838–855, 1992.
David Ruppert.Efficient estimations from a slowlyconvergent robbins-monro process. Technical report,Cornell University Operations Research and IndustrialEngineering, 1988.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, et al.Imagenet large scale visual recognition challenge.In-ternational Journal of Computer Vision, 115(3):211–252, 2012.
Karen Simonyan and Andrew Zisserman. Very deep con-volutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014.
Leslie N Smith and Nicholay Topin.Exploring lossfunction topology with cyclical learning rates.arXivpreprint arXiv:1702.04283, 2017.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:A simple way to prevent neural networks from overfit-ting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
Sergey Zagoruyko and Nikos Komodakis. Wide residualnetworks.arXiv preprint arXiv:1605.07146, 2016.

Averaging weights leads to wider optima and better generalization

Author(s):

Abstract:

Document:

References:

Recent Posts

Archive