{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Problem Sheet 4\n", "\n", "## Clustering of text documents\n", "\n", "In this problem sheet, we will consider clustering algorithms and their application to text data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task (a) (Warm-up): centroid of a set\n", "\n", "Let $C$ be a set of vectors $\\mathbf{x} \\in \\mathbb{R}^n$. \n", "\n", "- **Prove** that the vector\n", "$$\n", "\\boldsymbol\\mu = \\frac{1}{|C|} \\sum_{\\mathbf{x} \\in C} \\mathbf{x}\n", "$$\n", "(where $|C|$ is the cardinality of $C$) minimises the so-called _centroid_ loss $L(\\boldsymbol\\mu) := \\sum_{\\mathbf{x} \\in C} \\|\\mathbf{x}-\\boldsymbol\\mu\\|_2^2$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task (b): K-means with cosine similarity score\n", "\n", "- **Prove** that if $\\|\\mathbf{x}\\|_2=\\|\\boldsymbol\\mu\\|_2=1$ $\\forall \\mathbf{x},\\boldsymbol\\mu \\in \\mathbb{R}^n$, then \n", "$$\n", "\\|\\mathbf{x} - \\boldsymbol\\mu\\|_2^2 = 2 - 2\\cos\\angle(\\mathbf{x},\\boldsymbol\\mu).\n", "$$\n", "\n", "_Hint: expand $\\|\\mathbf{x} - \\boldsymbol\\mu\\|_2^2$ using inner products of vectors._\n", "\n", "**Remark**: therefore, lower Euclidean distance is equivalent to higher cosine score, and the standard K-means algorithm with Euclidean distance can be used if term-to-document vectors are produced with `TfidfVectorizer`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task (c) (Warm-up): single linkage document clustering\n", "Consider the squared distance function $d(\\mathbf{x},\\boldsymbol\\mu)^2 := 1 - \\cos\\angle(\\mathbf{x},\\boldsymbol\\mu)$,\n", "and the following term-to-document matrix:\n", "$$\n", "X = \\begin{bmatrix}\n", " 1 & 1 & 0 & 0 & 1 \\\\\n", " 0 & 1 & 1 & 0 & 1 \\\\\n", " 0 & 0 & 0 & 1 & 1\n", " \\end{bmatrix}.\n", "$$\n", "\n", "- **Draw** the clustering dendrogram of the single linkage method applied to $\\mathbf{x}_1,\\mathbf{x}_2,\\mathbf{x}_3$ (the rows of $X$)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task (d): K-means document clustering\n", "Consider the same distance function and term-to-document matrix as in Task (c).\n", "\n", "- Trace one iteration of the 2-means algorithm, that is, **compute** the clusters and their total K-means loss for the following options for the fixed centroids:\n", "\n", " - $\\boldsymbol\\mu_1 = \\mathbf{x}_1$, $\\boldsymbol\\mu_2 = \\mathbf{x}_2$,\n", " - $\\boldsymbol\\mu_1 = \\mathbf{x}_1$, $\\boldsymbol\\mu_2 = \\mathbf{x}_3$.\n", "\n", "- Which centroids give the **smallest** K-means loss?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 1 (Warm-up)\n", "**Write** (or borrow) Python **code** to apply the K-means algorithm (with $K=2$) to the term-to-document matrix in Tasks (c) and (d)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 2: Information retrieval of Bath pages\n", "\n", "Write a Python code which:\n", "- Loads the Bath website text data from `BathPages.npz`\n", "- Computes its TF $\\cdot$ IDF term-to-document matrix with the 2-norm of each row equal 1 as required in Task (b)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 3: K-means clustering of Bath pages\n", "\n", "- **Write** a Python **code** that:\n", " - varies the number of clusters $K$ from 2 to 7 (inclusive);\n", " - for each $K$ runs the $K$-means algorithm 10 times to clusterise the rows of the TF $\\cdot$ IDF term-to-document matrix from Task 2;\n", " - for each $K$ and each run of the $K$-means computes and stores the Silhouette Coefficient of the clustering computed in that run;\n", " - once both loops are complete, finds the clustering with the maximum Silhouette Coefficient among those computed;\n", " - prints the titles of Bath Pages in each cluster in the maximum-Silhouette clustering.\n", "\n", "_Hint: recall the `sklearn.metrics.silhouette_score` function. You can use the `append` method of a `list` to simplify your code within the loops._\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.0" } }, "nbformat": 4, "nbformat_minor": 2 }