Optimization Matters: Guidelines to Improve Representation Learning with Deep Networks
Training deep neural networks is a relevant problem with open questions related to convergence and quality of learned representations. Gradient-based optimization methods are used in practice, but cases of failure and success are still to be investigated. In this context, we set out to better understand the convergence properties of different optimization strategies, under different parameter options. Our results show that (i) feature embeddings are impacted by different optimization settings, (ii) suboptimal results are achieved by the use of default parameters, (iii) significant improvement is obtained by making educated choices of parameters, (iv) learning rate decay should always be considered. Such findings offer guidelines for training and deployment of deep networks.
Andrade, N., Faria, F. A., and Cappabianco, F. A. M. (2018). A practical review on medical image registration: from rigid to deep learning based approaches. In 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pages 463– 470. IEEE.
Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and IEEE transactions on pattern analysis and machine intelligence, new perspectives. 35(8):1798–1828.
Bottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311.
Cauchy, M. A. (1847). Méthode générale pour la résolution de systémes d’équations simultanées. C. R. Acad. Sci Paris, pages 25:536–538.
Choi, D., Shallue, C. J., Nado, Z., Lee, J., Maddison, C. J., and Dahl, G. E. (2019). On empirical comparisons of optimizers for deep learning. CoRR, abs/1910.05446.
Defazio, A., Bach, F. R., and Lacoste-Julien, S. (2014). SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. CoRR, abs/1407.0202.
Domhan, T., Springenberg, J. T., and Hutter, F. (2015). Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Twenty-Fourth International Joint Conference on Artificial Intelligence.
Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image reIn Proceedings of the IEEE conference on computer vision and pattern cognition. recognition, pages 770–778.
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., and Keutzer, K. (2016). Squeezenet: Alexnet-level accuracy with 50x fewer parameters and ¡0.5mb model size.
Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, volume 25, pages 1097–1105.
Le, Q. V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., and Ng, A. Y. (2011). On optimization methods for deep learning. In ICML.
Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. (2018). Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems 31, pages 6389–6399.
Lin, T., Goyal, P., Girshick, R. B., He, K., and Dollár, P. (2017). Focal loss for dense object detection. CoRR, abs/1708.02002.
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. (2020). On the variance In International Conference on Learning of the adaptive learning rate and beyond. Representations.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S. E., Fu, C., and Berg, A. C. (2015). SSD: single shot multibox detector. CoRR, abs/1512.02325.
Mello, R. and Ponti, M. A. (2018). Machine Learning: A Practical Approach on the Statistical Learning Theory. Springer.
Miikkulainen, R., Liang, J., Meyerson, E., Rawal, A., Fink, D., Francon, O., Raju, B., Shahrzad, H., Navruzyan, A., Duffy, N., et al. (2019). Evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pages 293–312. Elsevier.
Ponti, M. A., Ribeiro, L. S. F., Nazare, T. S., Bui, T., and Collomosse, J. (2017). Everything you wanted to know about deep learning for computer vision but were afraid to ask. In 2017 30th SIBGRAPI conference on graphics, patterns and images tutorials (SIBGRAPI-T), pages 17–41. IEEE.
Ponti, M. A., Santos, F. P. d., Ribeiro, L. S. F., and Cavallari, G. B. (2021). Training deep networks from zero to hero: avoiding pitfalls and going beyond. arXiv preprint arXiv:2109.02752.
Ren, S., He, K., Girshick, R. B., and Sun, J. (2015). Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497.
Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407.
Rodrigues, L. F., Naldi, M. C., and Mari, J. F. (2017). Exploiting convolutional neural networks and preprocessing techniques for hep-2 cell classification in immunouorescence images. In 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pages 170–177. IEEE.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520.
Schmidt, M., Le Roux, N., and Bach, F. (2017). Minimizing finite sums with the stochastic average gradient. Math. Program., 162(1–2):83–112.
Schmidt, R. M., Schneider, F., and Hennig, P. (2020). Descending through a crowded valley benchmarking deep learning optimizers.
Serpa, Y. R., Pires, L. A., and Rodrigues, M. A. F. (2019). Milestones and new frontiers in deep learning. In 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images Tutorials (SIBGRAPI-T), pages 22–35. IEEE.
Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.
Sivaprasad, P. T., Mai, F., Vogels, T., Jaggi, M., and Fleuret, F. (2019). On the tunability of optimizers in deep learning. CoRR, abs/1910.11758.
Sun, R. (2019). Optimization for deep learning: theory and algorithms. arXiv preprint arXiv:1912.08957.
Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147.
Tileman, T. and Hinton, G. (2012). Neural networks for machine learning. techinical report.
Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. (2018). The marginal value of adaptive gradient methods in machine learning.
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.
Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., and Cong, J. (2015). Optimizing fpgabased accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays, pages 161–170.