Author(s):
- Golkar, Siavash
- Cranmer, Kyle
Abstract:
We introduce backdrop, a flexible and simple-to-implement method, intuitively described as dropout acting only along the backpropagation pipeline. Backdrop is implemented via one or more masking layers which are inserted at specific points along the network. Each backdrop masking layer acts as the identity in the forward pass, but randomly masks parts of the backward gradient propagation. Intuitively, inserting a backdrop layer after any convolutional layer leads to stochastic gradients corresponding to features of that scale. Therefore, backdrop is well suited for problems in which the data have a multi-scale, hierarchical structure. Backdrop can also be applied to problems with non-decomposable loss functions where standard SGD methods are not well suited. We perform a number of experiments and demonstrate that backdrop leads to significant improvements in generalization.
Document:https://arxiv.org/abs/1806.01337
References:
1] Bottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods for large-scale machinelearning.SIAM Review, 60(2):223–311.
[2]Chen, T., Xu, B., Zhang, C., and Guestrin, C. (2016). Training deep nets with sublinear memorycost.CoRR, abs/1604.06174.9
[3]Das, D., Avancha, S., Mudigere, D., Vaidyanathan, K., Sridharan, S., Kalamkar, D. D., Kaul, B.,and Dubey, P. (2016). Distributed deep learning using synchronous stochastic gradient descent.CoRR, abs/1602.06709.
[4]Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., aurelio Ranzato, M., Senior,A., Tucker, P., Yang, K., Le, Q. V., and Ng, A. Y. (2012). Large scale distributed deep networks.In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors,Advances in NeuralInformation Processing Systems 25, pages 1223–1231. Curran Associates, Inc.
[5]Dembczynski, K. J., Waegeman, W., Cheng, W., and Hüllermeier, E. (2011). An exact algorithmfor f-measure maximization. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., andWeinberger, K. Q., editors,Advances in Neural Information Processing Systems 24, pages 1404–1412. Curran Associates, Inc.
[6]Ganin, Y. and Lempitsky, V. (2014). Unsupervised Domain Adaptation by Backpropagation.ArXiv e-prints.
[7]Gladyshev, E. G. (1965). On the stochastic approximation.Theory Probab. Appl., 10:275– 278.
[8]Golkar, S. and Cranmer, K. (2018).Multi-scale gaussian process dataset.Zenodo,http://doi.org/10.5281/zenodo.1252464.
[9]Gruslys, A., Munos, R., Danihelka, I., Lanctot, M., and Graves, A. (2016). Memory-efficientbackpropagation through time.CoRR, abs/1606.03401.
[10]Herschtal, A. and Raskutti, B. (2004). Optimising area under the roc curve using gradientdescent. InProceedings of the Twenty-first International Conference on Machine Learning, ICML’04, pages 49–, New York, NY, USA. ACM.
[11]Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012).Improving neural networks by preventing co-adaptation of feature detectors.CoRR, abs/1207.0580.
[12]Hoffer, E., Hubara, I., and Soudry, D. (2017). Train longer, generalize better: closing thegeneralization gap in large batch training of neural networks.ArXiv e-prints.
[13]Ian Goodfellow, Y. B. and Courville, A. (2016). Deep learning. Book in preparation for MITPress.
[14]Kar, P., Narasimhan, H., and Jain, P. (2014). Online and stochastic gradient methods fornon-decomposable loss functions.CoRR, abs/1410.6776.
[15]Kar, P., Narasimhan, H., and Jain, P. (2015). Surrogate Functions for Maximizing Precision atthe Top.ArXiv e-prints.
[16]Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2016). On large-batch training for deep learning: Generalization gap and sharp minima.CoRR, abs/1609.04836.
[17]Koyejo, O., Natarajan, N., Ravikumar, P., and Dhillon, I. S. (2014). Consistent binary classifica-tion with generalized performance metrics. InProceedings of the 27th International Conferenceon Neural Information Processing Systems – Volume 2, NIPS’14, pages 2744–2752, Cambridge,MA, USA. MIT Press.
[18]Krizhevsky, A., Nair, V., and Hinton, G. (2009). Cifar-10 (canadian institute for advancedresearch).
[19]Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L., and Ranzato, M. (2017).Fader networks: Manipulating images by sliding attributes.CoRR, abs/1706.00409.
[20]Lazebnik, S., Schmid, C., and Ponce, J. (2005). A sparse texture representation using local affineregions.IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1265–1278.
[21]LeCun, Y., Bottou, L., Orr, G. B., and Müller, K. R. (1998).Efficient BackProp, pages 9–50.Springer Berlin Heidelberg, Berlin, Heidelberg.10[22]Li, X., Chen, S., Hu, X., and Yang, J. (2018). Understanding the disharmony between dropoutand batch normalization by variance shift.CoRR, abs/1801.05134.
[23]Masters, D. and Luschi, C. (2018). Revisiting Small Batch Training for Deep Neural Networks.ArXiv e-prints.
[24]Narasimhan, H. (2018). Learning with complex loss functions and constraints. In Storkey, A.and Perez-Cruz, F., editors,Proceedings of the Twenty-First International Conference on ArtificialIntelligence and Statistics, volume 84 ofProceedings of Machine Learning Research, pages1646–1654, Playa Blanca, Lanzarote, Canary Islands. PMLR.
[25]Narasimhan, H., Kar, P., and Jain, P. (2015). Optimizing Non-decomposable PerformanceMeasures: A Tale of Two Classes.ArXiv e-prints.
[26]Narasimhan, H., Vaish, R., and Agarwal, S. (2014). On the statistical consistency of plug-inclassifiers for non-decomposable performance measures. In Ghahramani, Z., Welling, M., Cortes,C., Lawrence, N. D., and Weinberger, K. Q., editors,Advances in Neural Information ProcessingSystems 27, pages 1493–1501. Curran Associates, Inc.
[27]Ravanbakhsh, S., Oliva, J., Fromenteau, S., Price, L. C., Ho, S., Schneider, J., and Poczos, B.(2017). Estimating Cosmological Parameters from the Dark Matter Distribution.ArXiv e-prints.
[28]Robbins, H. and Siegmund, D. (1971). A convergence theorem for nonnegative almost super-martingales and some applications.Optimizing methods in statistics, pages 233–257.
[29]Shanahan, P. E., Trewartha, D., and Detmold, W. (2018). Machine learning action parameters inlattice quantum chromodynamics.
[30]Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. A. (2014). Striving forsimplicity: The all convolutional net.CoRR, abs/1412.6806.
[31] Whitney, W. (2016). Disentangled representations in neural models.CoRR, abs/1602.02383.
[32]Wilson, D. R. and Martinez, T. R. (2003). The general inefficiency of batch training for gradientdescent learning.Neural Netw., 16(10):1429–1451.
[33]Ye, N., Chai, K. M. A., Lee, W. S., and Chieu, H. L. (2012). Optimizing f-measures: A tale oftwo approaches. InProceedings of the 29th International Coference on International Conferenceon Machine Learning, ICML’12, pages 1555–1562, USA. Omnipress.
[34]Yu, F. and Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions.CoRR,abs/1511.07122.11