{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Problem Sheet 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bias-variance decomposition\n",
    "\n",
    "## Task (a) (Warm-up)\n",
    "Assume that $h_{\\theta}(x) = \\theta x$, and $(X,Y) \\sim \\mathbb{P}$, where $Y = \\theta^{true} X + \\xi$ for some fixed \"ground truth\" parameter $\\theta^{true} \\in \\mathbb{R}$, and $\\xi$ is a random variable independent of $X$ with $\\mathbb{E}[\\xi] = 0$ and $\\mathrm{Var}(\\xi)<\\infty$. Let the dataset $D=\\{(x_1,y_1),\\ldots,(x_m,y_m)\\}$ contain $m$ independent samples of $(X,Y)$. Assume quadratic pointwise loss, $\\ell(y,\\hat y) = (y-\\hat y)^2.$\n",
    "- Find the empirical risk minimizer $\\theta^* = \\arg\\min_{\\theta \\in \\mathbb{R}} L_{D}(\\theta)$.\n",
    "- Since data are sampled at random, the entire dataset $D$, and hence $\\theta^*$ which depends only on $D$, can be also seen as realisations of random variables $\\Delta$ and $\\Theta^*$, respectively. Let $\\mathbb{E}_{\\Delta}$ denote the expectation with respect to the distribution of $\\Delta$ only. Prove that $\\mathbb{E}_{\\Delta}[\\Theta^*] = \\theta^{true}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task (b)\n",
    "- In the assumptions of Task (a), **prove** that $\\theta^{best} = \\theta^{true}$, where $\\theta^{best} = \\mathbb{E}[XY]/\\mathbb{E}[X^2]$ is the expected risk minimizer from Task (a) of Problem Sheet 2.\n",
    "- Hence prove that $L_{bias} = L(\\mathbb{E}_{\\Delta}[\\Theta^*])$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task (c): why the variance loss is a variance\n",
    "- In the assumptions of Task (a), **prove** that $\\mathbb{E}_{\\Delta} [L_{var}(\\Theta^*)] = \\mathrm{Var}_{\\Delta}[\\Theta^*] \\mathbb{E}[X^2]$, where $\\mathrm{Var}_{\\Delta}$ is the variance with respect to the distribution of $\\Delta$.\n",
    "\n",
    "_Hint: start with adding and subtracting_ $\\mathbb{E}_{\\Delta}[\\Theta^*]X$ under the loss, \n",
    "$$\n",
    "L_{var}(\\Theta^*) + L_{bias} = L(\\Theta^*) = \\mathbb{E}[(\\Theta^* X - Y)^2] = \\mathbb{E}[((\\Theta^* X - \\mathbb{E}_{\\Delta}[\\Theta^*]X) + (\\mathbb{E}_{\\Delta}[\\Theta^*]X - Y))^2],\n",
    "$$\n",
    "_expand the square under the expectation, and finally take_ $\\mathbb{E}_{\\Delta}$ _of everything. Note that_ **precomputed** $\\Delta$ _is independent of_ **new** $(X,Y)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 0: simulating the \"true\" distribution\n",
    "- **Copy** over the solution of Task 1 of Problem Sheet 2 (generation of synthetic data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 1: K-fold cross validation\n",
    "- **Copy** over functions `features`, `optimise_loss` and `split_data` from Problem Sheet 2.\n",
    "- **Write a Python function** `cv(X,Y,K,n)` that takes as input arrays X and Y, the number of folds K, and the polynomial degree n, and computes the cross validation loss as follows:\n",
    " - create an integer array `ind` containing a random shuffle of the $0,\\ldots,N-1$ index sequence, where $N$ is the size of X and Y.\n",
    " - for each $k=0,\\ldots,K-1$,\n",
    "   - let $X_{train}, Y_{train}, X_{test},Y_{test}$ be training and test arrays produced from `split_data` with K folds applied to shuffled arrays `X[ind]`, `Y[ind]`, such that $X_{test},Y_{test}$ are the $k$-th chunks of `X[ind]`, `Y[ind]`, respectively.\n",
    "   - compute $\\boldsymbol\\theta^{(k)} = \\arg\\min_{\\boldsymbol\\theta\\in\\mathbb{R}^{n+1}}L_{D_{train}}(\\boldsymbol\\theta)$ using `features` to compute the Vandermonde matrix, and `optimise_loss` to solve the equations, defined by $D_{train} = (X_{train}, Y_{train})$.\n",
    "   - compute the $k$-th test loss $L_k:=L_{D_{test}^{(k)}}(\\boldsymbol\\theta^{(k)})$, where $D_{test}^{(k)}:=(X_{test},Y_{test})$.\n",
    " - Return the cross validation loss $L_{cv} = \\frac{1}{K} \\sum_{k=0}^{K-1}L_k$   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 2: model selection\n",
    "- **Vary** $n$ from 0 to 8 and **plot** the $5$-fold cross validation loss for datasets created in Task 0 as a function of $n$.\n",
    "- **Find** which value of $n$ gives the smallest cross validation loss. Can we expect this value if we know how the datasets were produced in Task 0?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 3: convergence of the cross validation loss\n",
    "- **Vary** the number of samples in Task 0 in a range 30, 100, 300, 1000, 3000, and for each corresponding realisation of X and Y repeat the cross validation loss plotting in Task 2. What you observe as the number of samples gets larger?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}