AI

Pre-trained Gaussian processes for Bayesian optimization – Google AI Weblog

Bayesian optimization (BayesOpt) is a strong instrument broadly used for global optimization duties, similar to hyperparameter tuning, protein engineering, synthetic chemistry, robot learning, and even baking cookies. BayesOpt is a good technique for these issues as a result of all of them contain optimizing black-box functions which are costly to guage. A black-box perform’s underlying mapping from inputs (configurations of the factor we wish to optimize) to outputs (a measure of efficiency) is unknown. Nevertheless, we will try to know its inside workings by evaluating the perform for various mixtures of inputs. As a result of every analysis may be computationally costly, we have to discover the most effective inputs in as few evaluations as potential. BayesOpt works by repeatedly setting up a surrogate mannequin of the black-box perform and strategically evaluating the perform on the most promising or informative enter location, given the knowledge noticed thus far.

Gaussian processes are widespread surrogate fashions for BayesOpt as a result of they’re simple to make use of, may be up to date with new information, and supply a confidence stage about every of their predictions. The Gaussian course of mannequin constructs a probability distribution over potential features. This distribution is specified by a imply perform (what these potential features appear like on common) and a kernel perform (how a lot these features can differ throughout inputs). The efficiency of BayesOpt relies on whether or not the arrogance intervals predicted by the surrogate mannequin include the black-box perform. Historically, consultants use area data to quantitatively outline the imply and kernel parameters (e.g., the vary or smoothness of the black-box perform) to specific their expectations about what the black-box perform ought to appear like. Nevertheless, for a lot of real-world purposes like hyperparameter tuning, it is vitally obscure the landscapes of the tuning aims. Even for consultants with related expertise, it may be difficult to slim down acceptable mannequin parameters.

In “Pre-trained Gaussian processes for Bayesian optimization”, we contemplate the problem of hyperparameter optimization for deep neural networks utilizing BayesOpt. We suggest Hyper BayesOpt (HyperBO), a extremely customizable interface with an algorithm that removes the necessity for quantifying mannequin parameters for Gaussian processes in BayesOpt. For brand new optimization issues, consultants can merely choose earlier duties which are related to the present job they’re attempting to unravel. HyperBO pre-trains a Gaussian course of mannequin on information from these chosen duties, and robotically defines the mannequin parameters earlier than working BayesOpt. HyperBO enjoys theoretical ensures on the alignment between the pre-trained mannequin and the bottom reality, in addition to the standard of its options for black-box optimization. We share sturdy outcomes of HyperBO each on our new tuning benchmarks for near–state-of-the-art deep learning models and basic multi-task black-box optimization benchmarks (HPO-B). We additionally display that HyperBO is strong to the number of related duties and has low necessities on the quantity of knowledge and duties for pre-training.

Within the conventional BayesOpt interface, consultants have to rigorously choose the imply and kernel parameters for a Gaussian course of mannequin. HyperBO replaces this guide specification with a number of associated duties, making Bayesian optimization simpler to make use of. The chosen duties are used for pre-training, the place we optimize a Gaussian course of such that it could steadily generate features which are just like the features comparable to these chosen duties. The similarity manifests in particular person perform values and variations of perform values throughout the inputs.

Loss features for pre-training

We pre-train a Gaussian course of mannequin by minimizing the Kullback–Leibler divergence (a generally used divergence) between the bottom reality mannequin and the pre-trained mannequin. Because the floor reality mannequin is unknown, we can’t immediately compute this loss perform. To resolve for this, we introduce two data-driven approximations: (1) Empirical Kullback–Leibler divergence (EKL), which is the divergence between an empirical estimate of the bottom reality mannequin and the pre-trained mannequin; (2) Damaging log probability (NLL), which is the the sum of adverse log likelihoods of the pre-trained mannequin for all coaching features. The computational value of EKL or NLL scales linearly with the variety of coaching features. Furthermore, stochastic gradient–based mostly strategies like Adam may be employed to optimize the loss features, which additional lowers the price of computation. In well-controlled environments, optimizing EKL and NLL result in the identical outcome, however their optimization landscapes may be very completely different. For instance, within the easiest case the place the perform solely has one potential enter, its Gaussian course of mannequin turns into a Gaussian distribution, described by the imply (m) and variance (s). Therefore the loss perform solely has these two parameters, m and s, and we will visualize EKL and NLL as follows:

We simulate the loss landscapes of EKL (left) and NLL (proper) for a easy mannequin with parameters m and s. The colours symbolize a heatmap of the EKL or NLL values, the place purple corresponds to greater values and blue denotes decrease values. These two loss landscapes are very completely different, however they each intention to match the pre-trained mannequin with the bottom reality mannequin.

Pre-training improves Bayesian optimization

Within the BayesOpt algorithm, choices on the place to guage the black-box perform are made iteratively. The choice standards are based mostly on the arrogance ranges offered by the Gaussian course of, that are up to date in every iteration by conditioning on earlier information factors acquired by BayesOpt. Intuitively, the up to date confidence ranges ought to be good: not overly assured or too uncertain, since in both of those two instances, BayesOpt can’t make the choices that may match what an professional would do.

In HyperBO, we exchange the hand-specified mannequin in conventional BayesOpt with the pre-trained Gaussian course of. Underneath delicate situations and with sufficient coaching features, we will mathematically confirm good theoretical properties of HyperBO: (1) Alignment: the pre-trained Gaussian course of ensures to be near the bottom reality mannequin when each are conditioned on noticed information factors; (2) Optimality: HyperBO ensures to discover a near-optimal resolution to the black-box optimization drawback for any features distributed in response to the unknown floor reality Gaussian course of.

We visualize the Gaussian course of (areas shaded in purple are 95% and 99% confidence intervals) conditional on observations (black dots) from an unknown check perform (orange line). In comparison with the standard BayesOpt with out pre-training, the anticipated confidence ranges in HyperBO captures the unknown check perform significantly better, which is a crucial prerequisite for Bayesian optimization.

Empirically, to outline the construction of pre-trained Gaussian processes, we select to make use of very expressive imply features modeled by neural networks, and apply well-defined kernel functions on inputs encoded to the next dimensional house with neural networks.

To guage HyperBO on difficult and reasonable black-box optimization issues, we created the PD1 benchmark, which incorporates a dataset for multi-task hyperparameter optimization for deep neural networks. PD1 was developed by coaching tens of hundreds of configurations of near–state-of-the-art deep learning models on widespread picture and textual content datasets, in addition to a protein sequence dataset. PD1 incorporates roughly 50,000 hyperparameter evaluations from 24 completely different duties (e.g., tuning Wide ResNet on CIFAR100) with roughly 12,000 machine days of computation.

We display that when pre-training for only some hours on a single CPU, HyperBO can considerably outperform BayesOpt with rigorously hand-tuned fashions on unseen difficult duties, together with tuning ResNet50 on ImageNet. Even with solely ~100 information factors per coaching perform, HyperBO can carry out competitively in opposition to baselines.

Tuning validation error charges of ResNet50 on ImageNet and Broad ResNet (WRN) on the Street View House Numbers (SVHN) dataset and CIFAR100. By pre-training on solely ~20 duties and ~100 information factors per job, HyperBO can considerably outperform conventional BayesOpt (with a rigorously hand-tuned Gaussian course of) on beforehand unseen duties.

Conclusion and future work

HyperBO is a framework that pre-trains a Gaussian course of and subsequently performs Bayesian optimization with a pre-trained mannequin. With HyperBO, we not must hand-specify the precise quantitative parameters in a Gaussian course of. As a substitute, we solely have to determine associated duties and their corresponding information for pre-training. This makes BayesOpt each extra accessible and more practical. An necessary future course is to allow HyperBO to generalize over heterogeneous search areas, for which we’re growing new algorithms by pre-training a hierarchical probabilistic model.

Acknowledgements

The next members of the Google Analysis Mind Workforce performed this analysis: Zi Wang, George E. Dahl, Kevin Swersky, Chansoo Lee, Zachary Nado, Justin Gilmer, Jasper Snoek, and Zoubin Ghahramani. We would wish to thank Zelda Mariet and Matthias Feurer for assist and session on switch studying baselines. We would additionally wish to thank Rif A. Saurous for constructive suggestions, and Rodolphe Jenatton and David Belanger for suggestions on earlier variations of the manuscript. As well as, we thank Sharat Chikkerur, Ben Adlam, Balaji Lakshminarayanan, Fei Sha and Eytan Bakshy for feedback, and Setareh Ariafar and Alexander Terenin for conversations on animation. Lastly, we thank Tom Small for designing the animation for this submit.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button