Gaussian Processes II

Problem 2.1¶

Suppose we model the relationship of a real-valued response variable $y$ to a single real input, $x$ , using a Gaussian process model in which the mean is zero and the covariances of the observed responses are given by:

$Cov(y_{i},y_{i^{\prime}})=0.5^{2}\delta_{i,i^{\prime}}+K(x_{i},x_{i^{\prime}})$ ,

with the noise-free covariance function $K$ defined by:

$K(x,x^{\prime})=\begin{cases}1-|x-x^{\prime}|,&if|x-x^{\prime}|<1\\ 0,&otherwise.\end{cases}$ ,

Suppose we have four training cases, as follows:

x	y
0.5	2.0
2.8	3.3
1.6	3.0
3.9	2.7

Recall that the conditional mean of the response in a test case with input $x_{*}$ , given the responses in the training cases, is:

$f_{*}=K_{f_{*}f}(K_{ff}+\sigma_{obs}^{2}I)^{-1}f$

where:

$f$ is the vector of observed responses in training cases,
$f_{*}$ is the vector of responses in test cases,
$K_{ff}$ is the matrix of covariances for the responses in training cases,
$K_{f_{*}f}$ is the vector of covariances of the response in the test case with the responses in training cases.

a) Find the predictive mean for the response in a test case in which the input is $x_{*}=1.2$ . b) Find the predictive mean and variance of the response in a test case in which the input is $x_{*}=1.2$ by means of NumPy.

Problem 2.2¶

Recall that for a Gaussian process model, the predictive distribution for the response $y_{*}$ in a test case with inputs $x_{*}$ has mean and variance given by:

$E[f_{*}|x_{*}, \text{training data}] = K_{f_{*}f}(K_{ff}+\sigma_{obs}^{2}I)^{-1}f$

$Var[f_{*}|x_{*}, \text{training data}] = K_{f_{*}f_{*}} - K_{f_{*}f}(K_{ff}+\sigma_{obs}^{2}I)^{-1}K_{f_{*}f}^{T}$

where:

$f$ is the vector of observed responses in training cases,
$f_{*}$ is the vector of responses in test cases,
$K_{ff}$ is the matrix of covariances for the responses in training cases,
$K_{f_{*}f}$ is the vector of covariances of the response in the test case with the responses in training cases,
$K_{f_{*}f_{*}}$ is the prior variance of the response in the test case.

a) Suppose we have just one training case, with $x_{1}=3$ and $y_{1}=4$ . Suppose also that the noise-free covariance function is: $K(x,x^{\prime})=2^{-|x-x^{\prime}|}$ and the variance of the noise is $1/2$ . Find the mean and variance of the predictive distribution for the response in a test case for which the value of the input is $x_{*}=5$ .

b) Repeat the calculations for (a), but using: $K(x,x^{\prime})=2^{|x-x^{\prime}|}$ What can you conclude from the result of this calculation?

Problem 2.3¶

Below are five functions randomly drawn from five different Gaussian processes. For all five Gaussian processes, the mean function is zero. The covariance functions are one of those listed below. For each of the five covariance functions below, indicate which of the five functions above is most likely to have been drawn from the Gaussian process with that covariance function.

(Note: The five functions a, b, c, d, e correspond to the plots in the document)

a) $Cov(y_{i1},y_{i2})=0.5^{2}exp(-(\frac{x_{i1}-x_{i2}}{0.5})^{2})$ b) $Cov(y_{i1},y_{i2})=x_{i1}x_{i2}$ c) $Cov(y_{i1},y_{i2})=5^{2}+5^{2}x_{i1}x_{i2}+0.5^{2}exp(-(\frac{x_{i1}-x_{i2}}{0.1})^{2})$ d) $Cov(y_{i1},y_{i2})=0.7^{2}exp(-(\frac{x_{i1}-x_{i2}}{0.1})^{2})+8^{2}exp(-(\frac{x_{i1}-x_{i2}}{2})^{2})$ e) $Cov(y_{i1},y_{i2})=8^{2}exp(-(\frac{x_{i1}-x_{i2}}{5})^{2})$

Problem 2.4¶

Decide which of the given alternatives is correct.

a) Consider the RBF kernel. As we substantially decrease the length scale, keeping the data and other parameters fixed, the more “wiggly” the resulting curves will be. b) In a GP with an RBF kernel, the covariance of two points depends on their relative position, whereas in a GP with a linear kernel, the covariance depends also on their absolute location. c) Conditioning in Gaussian Processes corresponds to integrating over one of the dimensions. d) In conventional regression methods we typically allow for variation in both, the model class and coefficients, whereas in Gaussian Processes regression the model class is fixed.

Problem 2.5¶

We intend to model the house price dataset:

x = np.array([50,55,59,61,79,81,88,90,91,97,99,
              105,107,110, 111, 112, 116, 117, 121, 123, 124, 125,
              135,141, 142,144,145,149, 150,151])
y = np.array([0.36,0.37,0.28,0.29,0.3,
              0.5,0.58,0.61,0.62,0.78,0.77,0.83,0.78,0.84,
              0.91,0.95,1.05,0.99,0.97,0.93,0.81, 0.9,1.1,
              0.98, 0.88, 1.05, 1.02, 1.1, 1.08, 1.12])

a) Using the first two observations of the house price dataset, fit a Gaussian Process with a squared exponential kernel specified by $\alpha=1$ , $scale=10$ , and $\sigma_{noise}=0$ . b) Using the first five observations of the house price dataset, fit a Gaussian Process with a squared exponential kernel specified by $\alpha=1$ , $scale=10$ , and $\sigma_{noise}=0$ . c) Using all observations of the house price dataset, fit a Gaussian Process with a squared exponential kernel specified by $\alpha=1$ , $scale=10$ , and $\sigma_{noise}=0.00218$ .

Problem 2.6¶

In this exercise, you will perform the basic steps of Gaussian Process regression. To this end, you will work with the Credit dataset that you can load with Pandas from credit_data.csv.

a) Data standardization: Standardize the response variable Balance and plot the data points of Balance vs. Limit. Describe the relationship you observe. Does a linear model seem appropriate? b) Choosing priors: Generate 10 samples from a zero-mean Gaussian process defined by a squared exponential kernel with $\sigma_{noise}=0.5081$ , $\alpha=2$ , and $scale=3000$ . c) Computing the posterior: Compute the posterior distribution of function values $f$ evaluated at each of the points in $X_{p}$ conditioned on Limit and Balance using the squared exponential kernel with $\sigma_{noise}=0.5081$ , $\alpha=2$ , and $scale=3000$ . Plot the prior and posterior distribution of function values. d) Log Marginal Likelihood: Compute the log marginal likelihood for several values of the length scale. Decompose it into three components: the constant term, the determinant term (model complexity), and the quadratic term (data fit). Plot each component separately and identify the length scale at which the log marginal likelihood attains its maximum. e) Mean Log Posterior Predictive Density: Split the dataset Credit into a validation data set (40%) and a training data set (60%). Compute the Mean Log Posterior Predictive Density (MLLPD) on the validation data set for several values of the length scale. Identify the length scale at which the MLLPD attains its maximum.

Short Solutions¶

Solution 2.1 a) The covariance matrix of the training responses is $K_{ff}+\sigma_{obs}^{2}I = \begin{bmatrix}1+0.5^{2}&0&0&0\\ 0&1+0.5^{2}&0&0\\ 0&0&1+0.5^{2}&0\\ 0&0&0&1+0.5^{2}\end{bmatrix}$ The inverse of this is $(K_{ff}+\sigma_{obs}^{2}I)^{-1} = 0.8 I$ . The vector of covariances of the test response with the training responses is $K_{f_{*}f} = [1-0.7, 0, 1-0.4, 0] = [0.3, 0, 0.6, 0]$ . So $K_{f_{*}f}(K_{ff}+\sigma_{obs}^{2}I)^{-1} = [0.24, 0, 0.48, 0]$ . The predictive mean is $0.24 \times 2.0 + 0.48 \times 3.0 = 1.92$ .

# Training data
x_train = np.array([0.5, 2.8, 1.6, 3.9])
y_train = np.array([2.0, 3.3, 3.0, 2.7])
# Test input
x_test = np.array([1.2])

def custom_covariance(x, x_prime):
    diff = np.abs(x[:, None] - x_prime[None, :])
    return np.where(diff < 1, 1 - diff, 0)

# Covariance matrices
K_train = custom_covariance(x_train, x_train)
noise = 0.5**2 * np.eye(len(x_train))
C_train = K_train + noise
k_star = custom_covariance(x_train, x_test)
k_star_star = custom_covariance(x_test, x_test)

C_train_inv = np.linalg.inv(C_train)
mu_pred = np.dot(k_star.T, np.dot(C_train_inv, y_train))
var_pred = k_star_star - np.dot(k_star.T, np.dot(C_train_inv, k_star))

print(f"Predictive mean: {mu_pred}")
print(f"Predictive variance: {var_pred}")
# Predictive mean at x*=1.2: [1.92]
# Predictive variance at x*=1.2: [[0.64]]

Solution 2.2 a) The mean of the predictive distribution is $K(3,5)[K(3,3)+\frac{1}{2}]^{-1}(4) = \frac{1}{4}[1+\frac{1}{2}]^{-1}(4) = \frac{4}{6}$ The variance is $K(5,5)-K(3,5)[K(3,3)+\frac{1}{2}]^{-1}K(3,5) = 1-\frac{1}{4}[1+\frac{1}{2}]^{-1}\frac{1}{4} = \frac{23}{24}$

b) The mean is $K(3,5)[K(3,3)+\frac{1}{2}]^{-1}(4) = (4)[1+\frac{1}{2}]^{-1}(4) = \frac{32}{3}$ The variance is $1 - (4)[1+\frac{1}{2}]^{-1}(4) = -\frac{29}{3}$ But variances cannot be negative! We can conclude that $K(x,x^{\prime})=2^{|x-x^{\prime}|}$ is not a valid covariance function.

Solution 2.3

Solution 2.4 a) True b) True c) False d) True

Solution 2.5 (Code solution snippet provided in the text defines kernel functions and plotting. The specific plots for (a), (b), and (c) correspond to fitting with 2 points, 5 points, and all points respectively.)

Solution 2.6 a) The scatter plot shows a positive linear relationship. However, only approximately 75% of the variance is explained by a linear model. A GP is appropriate. b) - c) Code provided defines create_se_kernel, posterior, and plot_with_uncertainty functions. d) The log_marginal_likelihood function is implemented to return const_term + det_term + quad_term. The optimal scale value found is around 2154. e) The MLPPD function calculates the mean log posterior predictive density. The optimal scale value is 2154.4, with a log value of roughly 7.68.