{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Problem Sheet 4\n",
    "\n",
    "## Clustering of text documents\n",
    "\n",
    "In this problem sheet, we will consider clustering algorithms and their application to text data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task (a) (Warm-up): centroid of a set\n",
    "\n",
    "Let $C$ be a set of vectors $\\mathbf{x} \\in \\mathbb{R}^n$. \n",
    "\n",
    "- **Prove** that the vector\n",
    "$$\n",
    "\\boldsymbol\\mu = \\frac{1}{|C|} \\sum_{\\mathbf{x} \\in C} \\mathbf{x}\n",
    "$$\n",
    "(where $|C|$ is the cardinality of $C$) minimises the so-called _centroid_ loss $L(\\boldsymbol\\mu) := \\sum_{\\mathbf{x} \\in C} \\|\\mathbf{x}-\\boldsymbol\\mu\\|_2^2$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task (b): K-means with cosine similarity score\n",
    "\n",
    "- **Prove** that if $\\|\\mathbf{x}\\|_2=\\|\\boldsymbol\\mu\\|_2=1$ $\\forall \\mathbf{x},\\boldsymbol\\mu \\in \\mathbb{R}^n$, then \n",
    "$$\n",
    "\\|\\mathbf{x} - \\boldsymbol\\mu\\|_2^2 = 2 - 2\\cos\\angle(\\mathbf{x},\\boldsymbol\\mu).\n",
    "$$\n",
    "\n",
    "_Hint: expand $\\|\\mathbf{x} - \\boldsymbol\\mu\\|_2^2$ using inner products of vectors._\n",
    "\n",
    "**Remark**: therefore, lower Euclidean distance is equivalent to higher cosine score, and the standard K-means algorithm with Euclidean distance can be used if term-to-document vectors are produced with `TfidfVectorizer`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task (c) (Warm-up): single linkage document clustering\n",
    "Consider the squared distance function $d(\\mathbf{x},\\boldsymbol\\mu)^2 := 1 - \\cos\\angle(\\mathbf{x},\\boldsymbol\\mu)$,\n",
    "and the following term-to-document matrix:\n",
    "$$\n",
    "X = \\begin{bmatrix}\n",
    "     1 & 1 & 0 & 0 & 1 \\\\\n",
    "     0 & 1 & 1 & 0 & 1 \\\\\n",
    "     0 & 0 & 0 & 1 & 1\n",
    "    \\end{bmatrix}.\n",
    "$$\n",
    "\n",
    "- **Draw** the clustering dendrogram of the single linkage method applied to $\\mathbf{x}_1,\\mathbf{x}_2,\\mathbf{x}_3$ (the rows of $X$)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task (d): K-means document clustering\n",
    "Consider the same distance function and term-to-document matrix as in Task (c).\n",
    "\n",
    "- Trace one iteration of the 2-means algorithm, that is, **compute** the clusters and their total K-means loss for the following options for the fixed centroids:\n",
    "\n",
    "  - $\\boldsymbol\\mu_1 = \\mathbf{x}_1$, $\\boldsymbol\\mu_2 = \\mathbf{x}_2$,\n",
    "  - $\\boldsymbol\\mu_1 = \\mathbf{x}_1$, $\\boldsymbol\\mu_2 = \\mathbf{x}_3$.\n",
    "\n",
    "- Which centroids give the **smallest** K-means loss?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 1 (Warm-up)\n",
    "**Write** (or borrow) Python **code** to apply the K-means algorithm (with $K=2$) to the term-to-document matrix in Tasks (c) and (d)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 2: Information retrieval of Bath pages\n",
    "\n",
    "Write a Python code which:\n",
    "- Loads the Bath website text data from `BathPages.npz`\n",
    "- Computes its TF $\\cdot$ IDF term-to-document matrix with the 2-norm of each row equal 1 as required in Task (b)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 3: K-means clustering of Bath pages\n",
    "\n",
    "- **Write** a Python **code** that:\n",
    "  - varies the number of clusters $K$ from 2 to 7 (inclusive);\n",
    "  - for each $K$ runs the $K$-means algorithm 10 times to clusterise the rows of the TF $\\cdot$ IDF term-to-document matrix from Task 2;\n",
    "  - for each $K$ and each run of the $K$-means computes and stores the Silhouette Coefficient of the clustering computed in that run;\n",
    "  - once both loops are complete, finds the clustering with the maximum Silhouette Coefficient among those computed;\n",
    "  - prints the titles of Bath Pages in each cluster in the maximum-Silhouette clustering.\n",
    "\n",
    "_Hint: recall the `sklearn.metrics.silhouette_score` function. You can use the `append` method of a `list` to simplify your code within the loops._\n",
    "  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}