Random Rotation Ensembles in Python
- 11. March 2023
- #datascience
The paper “Random Rotation Ensembles” by Blaser et al. introduces a simple but powerful machine learning model: train an ensemble of trees on rotations of the data set, then average their predictions.
In this article we present a simple Python implementation (bottom of this article), and compare the RandomRotationEnsemble
with a few other tree-based regression models:
sklearn.tree.DecisionTreeRegressor
sklearn.ensemble.RandomForestRegressor
sklearn.ensemble.GradientBoostingRegressor
Model performance
The datasets used to assess model performance were retrieved using sklearn.datasets.fetch_openml
.
In the figure below \(R^2\) out-of-sample test scores are shown.
The models were evaluated using \(10\)-fold cross validation, and the out-of-sample scores are plotted for each data set.
A few critical hyperparameters were tuned for each model (e.g. min_samples_leaf
).
The RandomRotationEnsemble
is certainly competitive, but it’s no magic bullet: on some data sets it outperforms the other models, while on others it’s the worst model.
Model boundaries
To gain some understanding of how the model works, we use \(2\) variables from the us_crime data set and plot the results.
An out-of-sample \(R^2\) score is also reported.
The RandomRotationEnsemble
(code below) yields a smoother estimate of the target variable that is not axis-aligned (unlike the other models).
Hyperparameters were tuned for each model on parts on a training set, and the entire data set is plotted using black dots.
Python code for RandomRotationEnsemble
Here is a minimal working Python snippet implementing a Random Rotation Ensemble. I assume that the reader is familiar with the scikit-learn API, and chose to keep the documentation to minimum. If not, take a look at the paper “API design for machine learning software: experiences from the scikit-learn project” by Buitinck et al.
Python implementation: CPython
Python version : 3.9.4
IPython version : 7.22.0
sklearn: 1.1.2
numpy : 1.23.5
scipy : 1.9.3
import numpy as np
from sklearn.base import RegressorMixin, BaseEstimator
from sklearn.utils import check_X_y, check_array
from scipy.stats import special_ortho_group
from sklearn.tree import DecisionTreeRegressor
class RandomRotationEnsemble(BaseEstimator, RegressorMixin):
def __init__(
self,
n_estimators=10,
seed=1,
**kwargs,
):
"""An ensemble of DecisionTreeRegressors trained on random rotations.
Parameters
----------
n_estimators : int, optional
The number of estimators in the ensemble.
The default is 10.
seed : int, optional
An integer used to seed random number generation.
The default is 1.
**kwargs
Arguments passed to DecisionTreeRegressor.
"""
self.n_estimators = n_estimators
self.seed = seed
self.kwargs = kwargs
# Create a list of models
self.models = [
DecisionTreeRegressor(**self.kwargs, random_state=seed + estimator_num)
for estimator_num in range(self.n_estimators)
]
def get_params(self, deep=True):
params = self.kwargs
params["n_estimators"] = self.n_estimators
params["seed"] = self.seed
return params
def set_params(self, **params):
self.__init__(**params)
return type(self)(**params)
def fit(self, X, y):
# Check for NaN, infinite values, and so forth
X, y = check_X_y(X, y)
num_samples, num_features = X.shape
# Sample rotation matrices. The first one is always the identity
self.rotation_matrices = [np.eye(num_features)] + [
special_ortho_group(dim=num_features).rvs(
random_state=self.seed + estimator_num
)
for estimator_num in range(self.n_estimators - 1)
]
# Fit each model
for (rotmat, model) in zip(self.rotation_matrices, self.models):
model.fit(X @ rotmat, y)
def predict(self, X):
X = check_array(X)
# Predict using each model, then take mean value over every model
return np.array(
[
model.predict(X @ rotmat)
for (rotmat, model) in zip(self.rotation_matrices, self.models)
]
).mean(axis=0)
# =====================================================
# ================= EXAMPLE USAGE =====================
# =====================================================
from sklearn.datasets import fetch_california_housing
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
# Fetch and split data
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Create model
model = Pipeline(
[
("scaler", StandardScaler()),
("estimator", RandomRotationEnsemble(n_estimators=100)),
]
)
# Search for best parameters
param_grid = {"esimator__min_samples_leaf": [2**i for i in range(1, 7)]}
search = GridSearchCV(model, param_grid=param_grid, cv=5)
search.fit(X_train, y_train)
# Set parameters and compute score
model.set_params(**search.best_params_)
cross_val_score(model, X_test, y_test, cv=5).mean() # 0.6675