HPOBench: Reproducible Benchmarks for HPO

Hyperparameter optimization (HPO) is a crucial component of AutoML. The recent development of multi-fidelity optimization approaches are shown to be more efficient and powerful than existing black-box optimization methods.

However, the performance of an HPO algorithm heavily depends on the HPO problem that it is applied to, i.e. the machine learning model to be evaluated, the configuration space to be explored, the fidelity to be applied, or even the dependencies that the machine learning model relies on. All these factors result in inconsistencies and difficulties in the comparison of different HPO methods.

Additionally, evaluating an HPO problem can be quite expensive: for instance, it usually takes several hours to train a neural network and evaluate the corresponding configuration. Reproducing the results of an HPO algorithm on such problems becomes almost impossible with limited computational resources.

As the successor of HPOlib, HPOBench aims at solving the aforementioned issues with the following contributions:

  • A standard API that allows the users to evaluate multi-fidelity HPO on several benchmarks and their corresponding fidelities.
  • Containers that isolate the benchmarks from the computation environments and mitigate the problem of software dependencies.
  • Surrogate and Tabular benchmarks provide a cheap way of evaluating the target algorithms.

HPOBench currently contains more than 100 multi-fidelity benchmark problems with various properties: numerical and categorical configuration space, different difficulties, and complexities. Furthermore, HPOBench also provides the result of several popular HPO packages to make them easier to be compared with the new HPO algorithms.

Evaluating a configuration using a singularity only requires 4 lines of code:

>>> from hpobench.container.benchmarks.nas.tabular_benchmarks import SliceLocalizationBenchmark
>>> b = SliceLocalizationBenchmark(rng=1)
>>> config = b.get_configuration_space(seed=1).sample_configuration()
>>> result_dict = b.objective_function(configuration=config, fidelity={"budget": 100}, rng=1)

For more information, please check our HPOBench GitHub repository and the corresponding blog post:


  • HPOBench: A Collection of Reproducible Multi-Fidelity Benchmark Problems for HPO [pdf]
    dataset and benchmark track
  • Towards an Empirical Foundation for Assessing Bayesian Optimization of Hyperparameters [pdf] [poster]
    NeurIPS workshop on Bayesian Optimization in Theory and Practice
    This includes results for SMAC, spearmint and TPE on the benchmarks we provided in HPOlib1