## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Sparse Spectrum Warped Input Measures for Nonstationary Kernel Learning

NIPS 2020, (2020)

EI

Keywords

Abstract

We establish a general form of explicit, input-dependent, measure-valued warpings for learning nonstationary kernels. While stationary kernels are ubiquitous and simple to use, they struggle to adapt to functions that vary in smoothness with respect to the input. The proposed learning algorithm warps inputs as conditional Gaussian measu...More

Code:

Data:

Introduction

- Many interesting real world phenomena exhibit varying characteristics, such as smoothness, across their domain.
- The typical kernel based learner canonically relies on a stationary kernel function, a measure of "similarity", to define the prior beliefs over the function space.
- Such a kernel, cannot represent desirable nonstationary nuances, like varying spatial smoothness and sudden discontinuities.
- One obvious way to alleviate the problem of finding the appropriate kernel function given one’s data is hyperparameter optimisation.

Highlights

- Many interesting real world phenomena exhibit varying characteristics, such as smoothness, across their domain
- In this paper we propose a method for nonstationary kernel learning, based on sparse spectral kernel representations
- We have provided implementations for random fourier features kernel (RFFS), RFFNS, and sparse spectrum warped input measures (SSWIM)
- We have proposed a crucial advance to the sparse spectrum Gaussian process framework to account for nonstationarity through a novel input warping formulation
- We introduced a novel form of input warping analytically incorporating complete Gaussian measures in the functional warping with the concept of pseudo-training data and latent self-supervision
- Our model suggests an interesting and effective inductive bias this is nicely interpreted as a learned conditional affine transformation

Methods

**Method GP BWGP**

MLWGP3 MLWGP20 SSWIM1 SSWIM2 GP BWGP MLWGP3 MLWGP20 SSWIM1 SSWIM2 abalone 4.55 ± 0.14 4.55 ± 0.11 4.54 ± 0.10 4.59 ± 0.32 4.64 ± 0.13 4.50 ± 0.11 2.17 ± 0.01 1.99 ± 0.01 1.97 ± 0.02 1.99 ± 0.05 2.18 ± 0.01 2.17 ± 0.02 creep 584.9 ± 71.2 491.8 ± 36.2 502.3 ± 43.3 506.3 ± 46.1 483.69 ± 64.12 279.86 ± 31.88 4.46 ± 0.03 4.31 ± 0.04 4.21 ± 0.03 4.21 ± 0.08 4.45 ± 0.03 4.27 ± 0.03 ailerons 2.95 ± 0.16 2.91 ± 0.14 2.80 ± 0.11 3.42 ± 2.87 2.96 ± 0.08 2.83 ± 0.06 -7.30 ± 0.01 -7.38 ± 0.02 -7.44 ± 0.01 -7.45 ± 0.08 -7.24 ± 0.01 -7.00 ± 0.02

24 C Additional Experiments

C.1 Increasing number of pseudo-training points For the "increasing number of pseudo-training points" experiment we used 1 layer of warping with 256 features for both the warping and top-level predictive functions.

28 C.2 Increasing warping depth 29 The authors used 256 features and 1280 pseudo-training points for all of the experiments.

C.3 Complete real-dataset experiments table Table 2 contains additional real-world experiments to extend the majore experimental results from the main paper.

(18, 8751) elevators (5, 1503) airfoil

C.4 Extended discussion It is imperative to note here the aim is not to demand any algorithmic dominance when comparing methods.- The authors ran with 256 features, 1280 pseudo-training points, for 150 steps, 45 with 10 repeats, and evaluated the test RMSE and test MNLP on the test set for every single epoch of optimisation.
- Other loss functions and training schemes, such as leave56 one-out cross validation
- These results corroborate long known discussions from [?] about the risk of overfitting from trusting the marginal likelihood with standard optimisation procedures, their importance seems to have been largely ignored in evaluation of recent methodology innovations in the GP literature.
- The authors believe that a more open discussion should be on the table for analysing the interplay between model expressiveness and the effect this has on overfitting; this is especially pertinent to the GP literature which has placed a large emphasis on the importance of the marginal likelihood has a valid hyperparameter optimisation loss

Conclusion

- The authors have proposed a crucial advance to the sparse spectrum Gaussian process framework to account for nonstationarity through a novel input warping formulation.
- The authors' model suggests an interesting and effective inductive bias this is nicely interpreted as a learned conditional affine transformation.
- This perspectives invites a fresh take on how the authors can discover more effective representations of nonstationary data

- Table1: RMSE and MNLP metrics for various real world datasets. MSE and MNLP metrics for comparison with Warped and Bayesian Warped GPs [?]. MSE results for ailerons are ×10−8

Related work

- Foundational work [26, 27] on kernel based nonstationarity necessitated manipulation of the kernel function with expensive inference procedures. Recent spectral representation of kernel functions have emerged with Bochner’s theorem [9]. In this paradigm, one constructs kernels in the Fourier domain via random Fourier features (RFFs) [10, 11] and extensions for nonstationarity via the generalised Fourier inverse transform [28, 23, 2, 29]. While general, these methods suffer from various drawbacks such as expensive computations and overfitting due to over-parameterised models [2]. More expressive modelling frameworks [30, 31, 32, 33] have played a major role in expanding the efficacy of kernel based learning. Perhaps the most well known in the recent literature is Deep Kernel Learning Wilson et al [22] and the deep Gaussian process [34] and heretofore its various extensions [25, 35, 36]. While functionally elegant, methods like DKL and DGP often rely on increasing the complexity of the composition to produce expressiveness and are often unsuitable or unwieldy in practice occasionally resulting in performance worse than stationary inducing point GPs [25]. We remark a notable difference between DGP and SSWIM is one should interpret our pseudo-training points as hyperparameters of the kernel as opposed to parameters of a variational approximation. Simple bijective input warpings were considered in [37] for transforming nonstationary functions into more well behaved functions. In [38] the authors augment the standard GP model by learning nonstationary data dependent functions for the hyperparameters of a nonstationary squared exponential kernel [39] however is limited to low dimensions. More recently, the work of [40] has explored a dynamical systems view of input warpings by processing the inputs through a time dependent differential fields. Less related models presented in Wang and Neal [41], Dutordoir et al [42], Snelson et al [43] involve output warping non-Gaussian likelihoods and heteroscedastic noise. For the curious reader we examine contrasting properties of output and input warping in the supplementary material.

Reference

- Andrew Y Ng, Adam Coates, Mark Diel, Varun Ganapathi, Jamie Schulte, Ben Tse, Eric Berger, and Eric Liang. Autonomous inverted helicopter flight via reinforcement learning. In Experimental robotics IX. Springer, 2006.
- Jean-Francois Ton, Seth Flaxman, Dino Sejdinovic, and Samir Bhatt. Spatial mapping with Gaussian processes and nonstationary fourier features. Spatial statistics, 2018.
- Nir Friedman, Michal Linial, Iftach Nachman, and Dana Pe’er. Using Bayesian networks to analyze expression data. Journal of computational biology, 2000.
- Ruben Martinez-Cantin. Bayesian optimization with adaptive kernels for robot control. In 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017.
- Carl Edward Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine learning. Springer, 2004.
- H. Bauer. Probability theory and elements of measure theory. Probability and mathematical statistics. Academic Press, 1981.
- Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, and Aníbal R FigueirasVidal. Sparse spectrum Gaussian process regression. Journal of Machine Learning Research (JMLR), 2010.
- Yunpeng Pan, Xinyan Yan, Evangelos A. Theodorou, and Byron Boots. Prediction under uncertainty in sparse spectrum Gaussian processes with applications to filtering and control. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research, 2017.
- Salomon Bochner. Vorlesungen über Fouriersche Integrale: von S. Bochner. Akad. Verl.-Ges., 1932.
- A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Neural Information Processing Systems (NIPS), 2007.
- A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomisation in learning. In Neural Information Processing Systems (NIPS), 2008.
- C Bishop. Pattern recognition and machine learning (information science and statistics), 1st edn. 2006. corr. 2nd printing edn. Springer, New York, 2007.
- Mauricio A Alvarez, Lorenzo Rosasco, and Neil D Lawrence. Kernels for Vector-Valued Functions: a Review. Technical report, MIT - Computer Science and Artificial Intelligence Laboratory, 2011.
- Carl Jidling, Niklas Wahlström, Adrian Wills, and Thomas B. Schön. Linearly constrained Gaussian processes. In Advances in Neural Information Processing Systems (NIPS), 2017.
- Rafael Oliveira, Lionel Ott, and Fabio Ramos. Bayesian optimisation under uncertain inputs. In International Conference on Artificial Intelligence and Statistics (AISTATS), Naha, Okinawa, Japan, 2019.
- Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2009.
- H. Neudecker, S. Liu, and W. Polasek. The Hadamard product and some of its applications in statistics. Statistics, 26(4):365–373, 1995.
- Rafael González and Richard Woods. Digital image processing. isbn: 9780131687288. Prentice Hall, 2008.
- Dheeru Dua and Casey Graff. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2017.
- Luís Torgo. Regression datasets. "https://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html.
- D Cole, C Martin-Moran, AG Sheard, HKDH Bhadeshia, and DJC MacKay. Modelling creep rupture strength of ferritic steel welds. Science and Technology of Welding and Joining, 2000.
- Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016.
- Sami Remes, Markus Heinonen, and Samuel Kaski. Non-stationary spectral kernels. In Advances in Neural Information Processing Systems (NIPS), 2017.
- James Hensman, Alexander G. de G. Matthews, and Zoubin Ghahramani. Scalable variational Gaussian process classification. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.
- Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information Processing Systems (NIPS), 2017.
- Dave Higdon, Jenise Swall, and J Kern. Non-stationary spatial modeling. Bayesian statistics, 1999.
- Christopher J Paciorek and Mark J Schervish. Nonstationary covariance functions for Gaussian process regression. In Advances in Neural Information Processing Systems (NIPS), 2004.
- Yves-Laurent Kom Samo and Stephen Roberts. Generalized spectral kernels. arXiv preprint arXiv:1506.02236, 2015.
- Shengyang Sun, Guodong Zhang, Chaoqi Wang, Wenyuan Zeng, Jiaman Li, and Roger Grosse. Differentiable compositional kernel learning for Gaussian processes. In International Conference on Machine Learning (ICML), 2018.
- Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth. Manifold Gaussian processes for regression. In 2016 International Joint Conference on Neural Networks (IJCNN). IEEE, 2016.
- Andrew Gordon Wilson, David A Knowles, and Zoubin Ghahramani. Gaussian process regression networks. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1139–1146, 2012.
- Paul D Sampson and Peter Guttorp. Nonparametric estimation of nonstationary spatial covariance structure. Journal of the American Statistical Association, 1992.
- Ethan B Anderes, Michael L Stein, et al. Estimating deformations of isotropic Gaussian random fields on the plane. The Annals of Statistics, 2008.
- Andreas Damianou and Neil Lawrence. Deep Gaussian processes. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 207–215, 2013.
- Kurt Cutajar, Edwin V Bonilla, Pietro Michiardi, and Maurizio Filippone. Random feature expansions for deep Gaussian processes. In International Conference on Machine Learning (ICML), 2017.
- Thang Bui, Daniel Hernández-Lobato, Jose Hernandez-Lobato, Yingzhen Li, and Richard Turner. Deep Gaussian processes for regression using approximate expectation propagation. In International Conference on Machine Learning (ICML), 2016.
- Jasper Snoek, Kevin Swersky, Rich Zemel, and Ryan Adams. Input warping for Bayesian optimization of non-stationary functions. In International Conference on Machine Learning (ICML), 2014.
- Markus Heinonen, Henrik Mannerström, Juho Rousu, Samuel Kaski, and Harri Lähdesmäki. Nonstationary Gaussian process regression with hamiltonian monte carlo. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016.
- M. N. Gibbs. Bayesian Gaussian processes for regression and classification. Ph. D. Thesis, Department of Physics, University of Cambridge, 1997.
- Pashupati Hegde, Markus Heinonen, Harri Lähdesmäki, and Samuel Kaski. Deep learning with differential Gaussian process flows. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
- Chunyi Wang and Radford M. Neal. Gaussian Process Regression with Heteroscedastic or Non-Gaussian Residuals. Technical report, University of Toronto, Toronto, Canada, 2012. URL http://arxiv.org/abs/1212.6246.
- Vincent Dutordoir, Hugh Salimbeni, James Hensman, and Marc Deisenroth. Gaussian Process Conditional Density Estimation. In S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, and R Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2385–2395. Curran Associates, Inc., 2018.
- Edward Snelson, Zoubin Ghahramani, and Carl E Rasmussen. Warped Gaussian processes. In Advances in Neural Information Processing Systems (NIPS), 2004.
- Ransalu Senanayake, Simon O’Callaghan, and Fabio Ramos. Predicting spatio-temporal propagation of seasonal influenza using variational gaussian process regression. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn