Gaussian Processes III

Problem 3.1¶

The data set called AirPassengers contains the number of monthly airline passenger bookings of PanAm from 1949 to 1960. It is a classic data set often used to illustrate time series modelling. The data was originally used to forecast future flight bookings in order to plan aircraft and crew demand.

a) Read in the built-in dataset AirPassengers and log-transform the number of airpassengers. Hint:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.datasets import get_rdataset
data = get_rdataset("AirPassengers").data
y = data["value"].values.astype(float)

b) Split the data set into a training and a test dataset which contains 30% of the entire data.

c) Detrend the AirPassengers dataset by means of a quadratic polynomial and compute the rest term which you will model by means of a Gaussian Process.

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=True)
# res_train
# y_train_log
# trend_train

d) Model the rest term by means of a Gaussian Process with a kernel composed of two RBF kernels - one with a large length scale, the other one with a shorter length scale -, an exponential sine squared kernel, and a white noise kernel.

e) Predict the rest term and reconstruct the original Airpassengers data. Then, plot the predicted Airpassengers data.

f) Compute the MSE on the training and on the test dataset. In addition, determine the value of the NLPD.

g) Replace the detrending part based on a polynomial of second degree by an appropriate kernel function and compute the MSE on the training and on the test dataset.

h) (Optional) Tune the hyperparameters.

Problem 3.2¶

In practice, it does not make too much sense to use a GP to model a classification problem we can just solve with a logistic regression. Instead, we want to use a GP to model more complex data that ist not well captured with less flexible models. In the following, we will model the probability of getting a flu as a function of age. It turns out that young and very old people have a higher risk than people of middle age.

The dataset space_flu.csv is a synthetic dataset inspired by the previous description.

import pandas as pd
df_sf = pd.read_csv('../space_flu.csv')
age = df_sf.age.values[:, None]
space_flu = df_sf.space_flu
ax = df_sf.plot.scatter('age', 'space_flu', figsize=(8,5))
ax.set_yticks([0, 1])
ax.set_yticklabels(['healthy', 'sick'])

a) Define with scikit a Gaussian process classification model to predict whether a person will be healthy or sick depending on the age. b) Transform the $y$ variable with value set $\{-1,1\}$ . c) Fit a Gaussian Process with a constant, RBF, and white noise kernel. d) Predict for the predictor variabel age the probability being sick and plot the predictions. e) Report on the training accuracy.

Problem 3.3¶

In this exercise, we will classify setosa and versicolor flowers in the Iris dataset based on the predictors sepal length and sepal width using variational inference.

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
iris = load_iris()
X = iris.data[:, :2]
y = iris.target
# Keep only setosa (0) and versicolor (1)
mask = (y == 0) | (y == 1)
X = X[mask]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X = np.array(X_scaled)
y = y[mask]

a) Optimize the ELBO to classify setosa and versicolor flowers in the Iris dataset based on the predictors sepal length and sepal width with variational GPs. Plot the posterior mean, the posterior variance, and $p(y|f)$ .

b) Optimize the ELBO to classify virginica and versicolor flowers in the Iris dataset based on the predictors sepal length and sepal width with variational GPs. Plot the posterior mean, the posterior variance, and $p(y|f)$ .

Short Solutions¶

Solution 3.1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.datasets import get_rdataset
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ExpSineSquared, WhiteKernel
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# 1. Load AirPassengers data
data = get_rdataset("AirPassengers").data
y = data["value"].values.astype(float)
n = len(y)
X = np.arange(n).reshape(-1, 1) # time index
y_log = np.log(y) # work in log-space

train_size = int(0.7 * n)
X_train, X_test = X[:train_size], X[train_size:]
y_train_log, y_test_log = y_log[:train_size], y_log[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=True)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

trend_model = LinearRegression()
trend_model.fit(X_train_poly, y_train_log)
trend_train = trend_model.predict(X_train_poly)
trend_test = trend_model.predict(X_test_poly)

# Residuals for GP
res_train = y_train_log - trend_train

season = 0.7**2 * ExpSineSquared(length_scale=10.0, periodicity=12.0)
slow_amp = 1.0**2 * RBF(length_scale=200.0) # very slow growth
local = 0.3**2 * RBF(length_scale=3.0)
noise = WhiteKernel(noise_level=0.05**2)

kernel = season * slow_amp + local + noise

gp = GaussianProcessRegressor(
    kernel=kernel,
    normalize_y=False,
    n_restarts_optimizer=12,
    random_state=0
)
gp.fit(X_train, res_train)
print("Fitted kernel:")
print(gp.kernel_)

res_train_pred, res_train_std = gp.predict(X_train, return_std=True)
res_test_pred, res_test_std = gp.predict(X_test, return_std=True)

# Reconstruct log(y)
y_train_log_pred = trend_train + res_train_pred
y_test_log_pred = trend_test + res_test_pred

# Back to original scale
y_train_pred = np.exp(y_train_log_pred)
y_test_pred = np.exp(y_test_log_pred)

plt.figure(figsize=(14,6))
plt.plot(X, y, "k-", label="Data")
plt.plot(X_train, y_train_pred, "r--", label="Train prediction")
plt.plot(X_test, y_test_pred, "orange", label="Test prediction")

# 95% CI on test in original space (approx via log-normal)
upper_test = np.exp(y_test_log_pred + 1.96 * res_test_std)
lower_test = np.exp(y_test_log_pred - 1.96 * res_test_std)

plt.fill_between(
    X_test.flatten(),
    lower_test,
    upper_test,
    color="orange",
    alpha=0.2,
    label="Confidence interval (approx.)"
)
plt.title("AirPassengers - GP on detrended log-residuals")
plt.xlabel("Month index")
plt.ylabel("Passengers")
plt.legend()
plt.grid(True)
plt.show()

train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)

# NLPD in log-space (approx), using residual GP std
nlpd = 0.5 * np.sum(
    (y_test_log - y_test_log_pred)**2 / (res_test_std**2)
    + np.log(2 * np.pi * res_test_std**2)
)

print(f"Train MSE: {train_mse:.3f}")
print(f"Test MSE: {test_mse:.3f}")
print(f"NLPD (log-space residual model): {nlpd:.3f}")
# Train MSE: 65.577
# Test MSE: 522.081
# NLPD (log-space residual model): -63.489

from sklearn.gaussian_process.kernels import DotProduct, RationalQuadratic

# 1. TREND GP
trend_kernel = (
    # fixed long-scale trend
    RBF(length_scale=200.0, length_scale_bounds="fixed")
    # mild global slope + stabilizer
    + DotProduct(sigma_0=0.1)
) + WhiteKernel(noise_level=1e-3)

gp_trend = GaussianProcessRegressor(
    kernel=trend_kernel,
    normalize_y=True,
    optimizer=None # prevents overfitting wiggles into trend
)
gp_trend.fit(X_train, y_train_log)
trend_train = gp_trend.predict(X_train)
trend_test = gp_trend.predict(X_test)

# 2. Residuals
res_train = y_train_log - trend_train

# 3. Seasonality Model (fixed periodicity, flexible amplitude)
season_kernel = (
    0.5**2 * ExpSineSquared(
        periodicity=12.0,
        periodicity_bounds="fixed",
        length_scale=2.0
    )
    + 0.2**2 * RationalQuadratic(
        length_scale=2.0,
        alpha=1.0
    )
    + WhiteKernel(noise_level=0.02**2)
)

gp_season = GaussianProcessRegressor(
    kernel=season_kernel,
    normalize_y=True,
    n_restarts_optimizer=8,
    random_state=0
)
gp_season.fit(X_train, res_train)

res_pred_train = gp_season.predict(X_train)
res_pred_test = gp_season.predict(X_test)

# Reconstruct predictions
y_train_pred = trend_train + res_pred_train
y_test_pred = trend_test + res_pred_test

y_train_pred_exp = np.exp(y_train_pred)
y_test_pred_exp = np.exp(y_test_pred)

# Evaluation
train_mse = mean_squared_error(y_train, y_train_pred_exp)
test_mse = mean_squared_error(y_test, y_test_pred_exp)
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)
# Train MSE: 12.068...
# Test MSE: 692.591...

h) By using n_restarts_optimizer, hyperparameters are tuned.

gp = GaussianProcessRegressor(
    kernel=kernel,
    alpha=0.0,
    normalize_y=False,
    n_restarts_optimizer=25, # <<< full hyperparameter tuning
    random_state=0
)

Solution 3.2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.gaussian_process.kernels import RBF, WhiteKernel, ConstantKernel
from sklearn.metrics import accuracy_score
from sklearn.gaussian_process import GaussianProcessClassifier
from scipy.stats import norm

# Load data
df = pd.read_csv("../space_flu.csv")
X = df.age.values[:, None]
y = df.space_flu.values.astype(int)

y_latent = 2 * y - 1

kernel = (
    ConstantKernel(0.5, (1e-3, 2.0)) *
    RBF(length_scale=12.0, length_scale_bounds=(3.0, 200.0))
    + WhiteKernel(noise_level=0.4, noise_level_bounds=(1e-3, 5.0))
)

# Gaussian Process Regressor (latent)
gpr = GaussianProcessRegressor(
    kernel=kernel,
    optimizer="fmin_l_bfgs_b",
    normalize_y=True
)
gpr.fit(X, y_latent)

# Grid for visualization
X_new = np.linspace(0, 80, 300)[:, None]
f_mean_new, f_std_new = gpr.predict(X_new, return_std=True)
probs_new = norm.cdf(f_mean_new)

# Plot
plt.figure(figsize=(10,6))
plt.scatter(
    X[:,0],
    y + np.random.normal(0, 0.02, size=len(y)),
    c=y,
    cmap="coolwarm",
    s=60,
    edgecolor="k",
    label="observed labels"
)
plt.plot(
    X_new[:, 0],
    probs_new,
    "C2",
    lw=3,
    label="GPR + probit probability (regularized)"
)
plt.yticks([0, 1], ["healthy", "sick"])
plt.xlabel("age")
plt.title("Regularized Gaussian Process Regression -> Probit Classification")
plt.legend()
plt.grid(True)
plt.show()

f_mean, f_std = gpr.predict(X, return_std=True)
probs = norm.cdf(f_mean)
y_pred = (probs > 0.5).astype(int)
acc = accuracy_score(y, y_pred)
print(f"Training accuracy after regularization: {acc:.3f}")
# Training accuracy after regularization: 0.917

Solution 3.3 See the Jupyter Notebook Solutions_VI_Iris.ipynb.