diff --git a/Round_5_-_Clustering.ipynb b/Round_5_-_Clustering.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..680c3e8183f84e4a5d7b8dd5e6cfdc81fe5df053 --- /dev/null +++ b/Round_5_-_Clustering.ipynb @@ -0,0 +1,1921 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "38ea68f5a3197ef39aad2eb736828a68", + "grade": false, + "grade_id": "cell-708d46d3f9180abe", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "# Machine Learning with Python - Clustering\n", + "\n", + "\n", + "\n", + "Suppose that we are running a Cafe in Helsinki and want to explore whether there are distinct customer segments in our clientele, in order to find the best marketing strategy for the next summer. Such a customer segmentation can be done efficiently using **clustering methods**. \n", + "\n", + "In this exercise you will learn how to group a set of data points (e.g. representing customers of a cafe) into coherent groups (**clusters** or segments) using **clustering methods**. You will learn about the hard clustering method **k-means** and a soft clustering method based on a probabilistic **Gaussian mixture model** (GMM) for the data points. \n", + "\n", + "Both, $k$-means and soft clustering via GMM assume data points are represented by feature vectors $\\mathbf{x}^{(i)}=\\big(x^{(i)}_{1},\\ldots,x^{(i)}_{n}\\big)^{T}$ in the [Euclidean space]() $\\mathbb{R}^{n}$. Moreover, these methods use the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) $\\|\\mathbf{x}^{(i)}-\\mathbf{x}^{(j)}\\|_2 = \\sqrt{\\sum_{t=1}^{n} \\big( x^{(i)}_{t} - x^{(j)}_{t} \\big)^{2} }$ between two data points as a measure for the (dis-)similarity between them.\n", + "\n", + "In some applications, it is beneficial to use a different concept of similarity which is not directly tied to the Euclidean distance. Hence, we will also consider the clustering method [DBSCAN](https://en.wikipedia.org/wiki/DBSCAN) which uses a \"non-Euclidean\" notion of similarity between data points." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "81103ecb53a4c14726d5008d2466bd39", + "grade": false, + "grade_id": "cell-32738734cf6f1a4f", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "## Learning goals \n", + "\n", + "After this round, you should \n", + "\n", + "- be able to use k-means for hard clustering of a data set. \n", + "- be able to use GMM for soft clustering of a data set. \n", + "- be able to use DBSCAN for hard clustering of data points having a non-Euclidean structure. \n", + "- be able to choose between different clustering methods. \n", + "\n", + "\n", + "## Additional Material \n", + "\n", + "* Prof. A. Ng explaining [hard-clustering via k-Means](https://www.youtube.com/watch?v=hDmNF9JG3lo)\n", + "* Prof. A. Ihler on [soft-clustering with Gaussian mixture models](https://www.youtube.com/watch?v=qMTuMa86NzU)\n", + "* scikit-learn page on [k-means](https://scikit-learn.org/stable/modules/clustering.html#k-means)\n", + "* scikit-learn page on [Gaussian mixture models](https://scikit-learn.org/stable/modules/mixture.html#mixture)\n", + "* scikit-learn page on [DBSCAN](https://scikit-learn.org/stable/modules/clustering.html#dbscan)\n", + "* Chapter 8 in the [course book](https://arxiv.org/abs/1805.05052)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "538f7afa33d972ecf49f2898ec1e07cc", + "grade": false, + "grade_id": "cell-8d16ae084e9499d4", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "## Introduction\n", + "\n", + "Clustering methods partition a dataset $\\mathbf{X} = \\{ \\mathbf{x}^{(i)} \\}_{i=1}^{m}$, consisting of $m$ data points $\\mathbf{x}^{(i)} \\in \\mathbb{R}^{n}$, into a small number of groups or \"clusters\" $\\mathcal{C}_{1},\\ldots,\\mathcal{C}_{k}$. Each cluster $\\mathcal{C}_{l}$ represents a subset of data points which are more similar to each other than to data points in another cluster. The precise meaning of two data points being \"similar\" depends on the application at hand. \n", + "\n", + "Clustering methods do not require labeled data and can be applied to data points characterized solely by its features $\\mathbf{x}^{(i)}$. Therefore, clustering methods are an example of **unsupervised machine learning methods**. However, clustering methods are often used in combination (e.g., as a preprocessing step) with supervised learning methods such as regression or classification. \n", + "\n", + "Clustering methods can be roughly grouped into\n", + "\n", + "* Hard clustering methods assign each data point to exactly one cluster and \n", + "* Soft clustering methods assign each data point to several different clusters with varying degrees of belonging.\n", + "\n", + "Hard clustering can be interpreted as a special case of soft-clustering where the degrees of belonging are enforced to be either 0 (no belonging) or 1 (belongs).\n", + "\n", + "In this exercise we will implement one popular method for hard clustering, the k-means algorithm, and one popular method for soft clustering, which is based on a probabilistic Gaussian mixture model (GMM). These two methods use a notion of similarity that is tied to the Euclidean geometry of $\\mathbb{R}^{n}$. \n", + "In some applications, it is more useful to use a different notion of similarity. The hard clustering method DBSCAN is an example of a clustering method which uses a non-Euclidean notion of similarity. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "7044166f621c0ada1dbb3a0ecc9a0963", + "grade": false, + "grade_id": "cell-7e096db2d9e8c24e", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "## The Data\n", + "\n", + "The file \"data.csv\" contains $m=400$ rows representing the data points $\\mathbf{x}^{(i)}=\n", + "\\big( x_{1}^{(i)},x_{2}^{(i)} \\big)$, for $i=1,\\ldots,m$. The first column of the $i$-th row in the file contains the age $x_{1}^{(i)}$ of the $i$-th customer. The second column contains the amount $x_{2}^{(i)}$ of money spent by the $i$-th customer." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "56d337b0cf551cbb11bf357ad7d0344c", + "grade": false, + "grade_id": "cell-88316e7c619d6981", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='handsondata'></a>\n", + "<div class=\" alert alert-info\">\n", + " <b>Demo.</b> Loading the Data. \n", + " \n", + "\n", + "The following code snippet reads in $m=400$ data points from the csv file \"data.csv\". Data points are represented by feature vectors $\\mathbf{x}^{(i)} \\in \\mathbb{R}^{n}$, with $n=2$, for $i=1,\\ldots,m=400$. The feature vectors are stacked into the data matrix \n", + "$\\mathbf{X}= \\big( \\mathbf{x}^{(1)},\\ldots,\\mathbf{x}^{(m)} \\big)^{T} \\in \\mathbb{R}^{m \\times n} \\tag{1}$ and then depicted using a scatter plot.\n", + "\n", + " </div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "5ddf5b7a018c1c89855206a94fad555f", + "grade": false, + "grade_id": "cell-25bd30950c8c3bbe", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "# Import required libraries\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from IPython.display import display\n", + "import numpy as np\n", + "\n", + "\n", + "# Read in data from the csv file and store it in the data matrix X.\n", + "df = pd.read_csv(\"/coursedata/R5_Clustering/data.csv\")\n", + "X = df.to_numpy()\n", + "\n", + "# Display first 5 rows\n", + "print(\"First five datapoints:\")\n", + "display(df.head(5)) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "7dffd624a6c2061b0d3ad79deb7c5595", + "grade": false, + "grade_id": "cell-2c535af6d5fafda4", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "def plotting(data, centroids=None, clusters=None, show=True):\n", + " # This function will later on be used for plotting the clusters and centroids. But now we use it to just make a scatter plot of the data\n", + " # Input: the data as an array, cluster means (centroids), cluster assignemnts in {0,1,...,k-1} \n", + " # Output: a scatter plot of the data in the clusters with cluster means\n", + " plt.figure(figsize=(8,6))\n", + " data_colors = ['orangered', 'dodgerblue', 'springgreen']\n", + " centroid_colors = ['red', 'darkblue', 'limegreen'] # Colors for the centroids\n", + " plt.style.use('ggplot')\n", + " plt.title(\"Data\")\n", + " plt.xlabel(\"feature $x_1$: customers' age\")\n", + " plt.ylabel(\"feature $x_2$: money spent during visit\")\n", + "\n", + " alp = 0.5 # data points alpha\n", + " dt_sz = 20 # marker size for data points \n", + " cent_sz = 130 # centroid sz \n", + " \n", + " if centroids is None and clusters is None:\n", + " plt.scatter(data[:,0], data[:,1], s=dt_sz, alpha=alp, c=data_colors[0])\n", + " if centroids is not None and clusters is None:\n", + " plt.scatter(data[:,0], data[:,1], s=dt_sz, alpha=alp, c=data_colors[0])\n", + " plt.scatter(centroids[:,0], centroids[:,1], marker=\"x\", s=cent_sz, c=centroid_colors[:len(centroids)])\n", + " if centroids is not None and clusters is not None:\n", + " plt.scatter(data[:,0], data[:,1], c=[data_colors[i] for i in clusters], s=dt_sz, alpha=alp)\n", + " plt.scatter(centroids[:,0], centroids[:,1], marker=\"x\", c=centroid_colors[:len(centroids)], s=cent_sz)\n", + " if centroids is None and clusters is not None:\n", + " plt.scatter(data[:,0], data[:,1], c=[data_colors[i] for i in clusters], s=dt_sz, alpha=alp)\n", + " \n", + " if show:\n", + " plt.show()\n", + "\n", + "# Plot the (unclustered) data\n", + "plotting(X) " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "d3bc0e9c8566269aebe14b8c954c97c5", + "grade": false, + "grade_id": "cell-f1d445ffe42b0659", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "# Hard Clustering\n", + "\n", + "Hard clustering partitions a data set $\\mathbf{X}=\\{\\mathbf{x}^{(1)},\\ldots,\\mathbf{x}^{(m)}\\}$ into $k$ non-overlapping clusters $\\mathcal{C}_1,\\ldots,\\mathcal{C}_k$. Each data point is assigned to precisely one cluster. We denote by $y^{(i)} \\in \\{1,\\ldots,k\\}$ the index of the cluster to which the $i$th data point $\\mathbf{x}^{(i)}$ belongs to. \n", + "\n", + "The formal setup of hard clustering is quite similar to that of classification methods. We can interpret the cluster index $y^{(i)}$ as the label (quantity of interest) associated with the $i$:th data point. In contrast to classification problems, clustering methods do not require any labeled data points. \n", + "\n", + "Clustering methods do not require knowledge of the **true cluster assignment** for any data point. Instead, clustering methods learn a reasonable cluster assignment for a data point based on the intrinsic geometry of the entire dataset $\\mathbb{X}$. Clustering methods are referred to as **unsupervised machine learning** methods since they do not need supervision in the form of labeled data. \n", + "\n", + "\n", + "\n", + "## The k-Means Algorithm\n", + "\n", + "A popular method for hard clustering is the k-means algorithm which takes as input a list of data points $\\mathbb{X}= \\{ \\mathbf{x}^{(1)},...,\\mathbf{x}^{(m)} \\}$ and groups them into $k$ non-overlapping clusters $\\mathcal{C}_{ยก},\\ldots,\\mathcal{C}_{k}$. Each (non-empty) cluster $\\mathcal{C}_{c} \\subseteq \\mathbb{X}$ is characterized by the cluster mean \n", + "\n", + "\\begin{equation*}\n", + "\\mathbf{m}^{(c)} = (1/|\\mathcal{C}_{c}|) \\sum_{\\mathbf{x}^{(i)} \\in \\mathcal{C}_{c}} x^{(i)}, \n", + "\\end{equation*}\n", + "\n", + "where $|\\mathcal{C}_c|$ denotes the number of data points in the cluster $C_c$.\n", + "\n", + "If we would know the cluster means $\\mathbf{m}^{(c)}$ for each cluster, we could assign each data point $\\mathbf{x}^{(i)}$ to the cluster with index $y^{(i)}$ whose mean is closest to $\\mathbf{x}^{(i)}$: \n", + "\\begin{equation}\n", + "\\| \\mathbf{x}^{(i)} - \\mathbf{m}^{y^{(i)}} \\| = {\\rm min}_{c \\in \\{1,\\ldots,k\\}}\\| \\mathbf{x}^{(i)} - \\mathbf{m}^{(c)} \\|. \n", + "\\end{equation} \n", + "However, in order to determine the cluster means $\\mathbf{m}^{(c)}$, we would have needed the cluster assignments $y^{(i)}$ already in the first place. This instance of an [egg-chicken dilemma](https://en.wikipedia.org/wiki/Chicken_or_the_egg) is resolved by the $k$-means algorithm as follows:\n", + "\n", + "* __Input__: data points $\\mathbf{x}^{(i)} \\in \\mathbb{R}^{n}$, for $i=1,\\ldots,m$ and number $k$ of clusters\n", + "\n", + "\n", + "* __Initialization__: choose initial cluster means $\\mathbf{m}^{(1)},\\ldots,\\mathbf{m}^{(k)} \\in \\mathbb{R}^{n}$\n", + "\n", + "\n", + "* __Repeat Until Stopping Condition is Met:__ \n", + "\n", + " * __Update Cluster Assignments__: assign each data to the nearest cluster: \n", + " \n", + " for each data point $i=1,\\ldots,m$, set \n", + " \n", + " \\begin{equation*}\n", + " y^{(i)} = \\underset{c' \\in \\{1,\\ldots,k\\}}{\\operatorname{argmin}} \\|\\mathbf{x}^{(i)} - \\mathbf{m}^{(c')}\\|^2 , \n", + " \\tag{1}\n", + " \\end{equation*}\n", + " \n", + " * __Update Cluster Means__: determine cluster means for new cluster assignments \n", + " \n", + " for each cluster $c=1,\\ldots,k$, set \n", + " \\begin{equation*}\n", + " \\mathbf{m}^{(c)} = \\frac{1}{\\mid\\{i: y^{(i)}= c\\}\\mid}{\\sum_{i: y^{(i)}= c}\\mathbf{x}^{(i)}} \n", + " \\label{mean}\n", + " \\tag{2}\n", + " \\end{equation*}\n", + " where $\\{i: y^{(i)}= c\\}$ represents the set of datapoints belonging to cluster c and $\\mid\\{i: y^{(i)}= c\\}\\mid$ the size of cluster c. \n", + " \n", + "\n", + "\n", + "The $k$-means algorithm is best understood by walking through an example. To this end, we apply the $k$-means algorithm to the customer data discussed above." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "f1f28aaeaf989beb7d10c3c0b89cce75", + "grade": false, + "grade_id": "cell-3804f5df8a9fd322", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='handsondata'></a>\n", + "<div class=\" alert alert-info\">\n", + " <b>Demo.</b> Apply k-means. \n", + "\n", + "The code snippet below uses the `KMeans()` class in scikit-learn to group the Cafe customers into $k=3$ clusters $\\mathcal{C}_{1}$,$\\mathcal{C}_{2}$ and $\\mathcal{C}_{3}$ using k-means. We hypothesize that these clusters represent three different customer segments. The resulting clusters are depicted in a scatter plot with distinct colors for the different clusters. The cluster means (centers) $\\mathbf{m}^{(1)}$, $\\mathbf{m}^{(2)}$ and $\\mathbf{m}^{(3)}$ are represented by crosses. \n", + "\n", + " </div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "7c915f78d08beae5c90124cacb940333", + "grade": false, + "grade_id": "cell-b0f1fbd252a8619f", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "from sklearn.cluster import KMeans\n", + "\n", + "m, n = X.shape # Get the number of data points m and number of features n\n", + "\n", + "k = 3 # Define number of clusters to use\n", + "\n", + "cluster_means = np.zeros((k,n)) # Store the resulting clustering means in the rows of this np array\n", + "cluster_labels = np.zeros(m) # Store here the resulting cluster indices (one for each data point)\n", + "\n", + "k_means = KMeans(n_clusters = k, max_iter = 100) # Create k-means object with k=3 clusters and using maximum 100 iterations\n", + "k_means.fit(X) # Fit the k-means object (find the cluster labels for the datapoints in X)\n", + "cluster_means = k_means.cluster_centers_ # Get cluster means (centers)\n", + "cluster_indices = k_means.labels_ # Get the cluster labels for each data point\n", + "\n", + "# Plot the clustered data\n", + "plotting(X, cluster_means, cluster_indices)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "fbba340b4711f5ccb92356e8e4264816", + "grade": false, + "grade_id": "cell-499e32db19e29a2b", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='handsondata'></a>\n", + "<div class=\" alert alert-warning\">\n", + " <b>Student Task.</b> Apply k-means. \n", + " \n", + "\n", + "Apply the k-means algorithm on the customer data in the matrix `X` using the `scikit-learn` class `KMeans`. The class allows specifying the number of clusters and number of iterations with the input parameters `n_clusters` and `max_iter`. Use `max_iter=10` so that the function alternatingly updates the cluster assignment and cluster means $10$ times. Apply k-means with $k=2$ clusters. Do not set any other input parameter so that their default values are used. \n", + "\n", + "Store the resulting cluster means in the numpy array `cluster_means` of shape (2,2) and the resulting cluster assignments in the numpy array `cluster_indices` of shape (400, ). \n", + "\n", + "[Documentation of k-means in scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)\n", + "\n", + " </div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "aca16ad5065e9834b2b1cf2efea56288", + "grade": false, + "grade_id": "cell-ba554a7805d8665b", + "locked": false, + "schema_version": 3, + "solution": true + } + }, + "outputs": [], + "source": [ + "from sklearn.cluster import KMeans\n", + "\n", + "m, n = X.shape # Get the number of data points m and number of features n\n", + "\n", + "k = 2 # The number of clusters to use\n", + "\n", + "np.random.seed(1) # Set random seed for reproducability (DO NOT CHANGE THIS!)\n", + "\n", + "### STUDENT TASK ###\n", + "# ...\n", + "# cluster_means = ...\n", + "# cluster_indices = ...\n", + "# YOUR CODE HERE\n", + "raise NotImplementedError()\n", + "\n", + "# Plot the clustered data\n", + "plotting(X, cluster_means, cluster_indices)\n", + "print(\"The final cluster mean values are:\\n\", cluster_means)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "ce014bd4f8646a48ba4115fa44c0b49d", + "grade": true, + "grade_id": "cell-54e9055fb23e8bbb", + "locked": true, + "points": 3, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "# Perform some sanity checks on the outputs\n", + "assert X.shape == (400, 2), f'numpy array X has wrong shape'\n", + "assert cluster_means.shape == (2, 2), f'numpy array cluster_means has wrong shape'\n", + "assert cluster_indices.shape == (400,), f'numpy array cluster indices has wrong shape'\n", + "assert cluster_means[0, 0] > 30, f'the value at cluster_means[0,0] is too small'\n", + "\n", + "print('Sanity check tests passed!')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "4bb4fbac1ad21a51aa6441f0987293a2", + "grade": false, + "grade_id": "cell-75969c53d25b1a55", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "## Handling Local Minima\n", + "\n", + "The k-means algorithm can be interpreted as method for minimizing the **clustering error**\n", + "\n", + "\\begin{equation}\n", + "\\mathcal{E} \\big( \\{\\mathbf{m}^{(c)}\\}_{c=1}^{k},\\{y^{(i)}\\}_{i=1}^{m} \\mid \\{\\mathbf{x}^{(i)}\\}_{i=1}^{m} \\big)\n", + "=(1/m) \\sum_{i=1}^{m} {\\left\\|\\mathbf{x}^{(i)}-\\mathbf{m}^{(y^{(i)})}\\right\\|^2}, \n", + "\\label{EqErr}\n", + "\\tag{3}\n", + "\\end{equation}\n", + "\n", + "which is defined as the mean of the squared distances between each data point $\\mathbf{x}^{(i)}$ and the mean $\\mathbf{m}^{(y^{(i)})}$ of its assigned cluster.\n", + "\n", + "On each iteration, the k-means algorithm first minimizes the clustering error with respect to the cluster assignments $\\{y^{(i)}\\}_{i=1}^{m}$ conditionally on the current means $\\mathbf{m}^{(y^{(i)})}$. Next, the algorithm minimizes the clustering error with respect to the cluster means $\\mathbf{m}^{(y^{(i)})}$ given the newly assigned clusters. By repeating this alternating minimization, the k-means algorithm moves towards progressively better clusterings with lower clustering errors.\n", + "\n", + "The optimization interpretation of k-Means allows us to define a criterion for when to stop iterating the cluster assignment and means updates. The input parameter `tol` of the Python function `KMeans` allows us to specify a relative value of objective decrease that is used to define convergence (see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)). When the error decreases by less than `tol`$/m$ on some step, the k-means algorithm terminates.\n", + "\n", + "Since the clustering error is a [non-convex function](https://stats.stackexchange.com/questions/324561/difference-between-convex-and-concave-functions) of the cluster means and assignments, the k-means method might get trapped in a [**local minimum**](https://en.wikipedia.org/wiki/Maxima_and_minima#/media/File:Extrema_example_original.svg). This means that the k-means algorithm terminates before finding the best possible clustering, as measured by the clustering error. \n", + "\n", + "In particular, the initial cluster means have a significant effect on the final clustering of the k-means algorithm. For some selections of initial means, the best possible clustering might be unobtainable. As such, it is useful to repeat k-means **several times with different initializations** for the cluster means and choose the cluster assignment resulting in the smallest clustering error among all repetitions. \n", + "\n", + "This is what we will do in the next task. However, instead of repeating k-means until convergence we will only perform a fixed number of iterations for each choice of initial clusters." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "703956e7f02e9d7be393655f57731edf", + "grade": false, + "grade_id": "cell-c0fcfeaa318281b6", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<div class=\"alert alert-warning\">\n", + "\n", + "<b> Student Task.</b> Repeat $k$-means To Escape Local Minima.\n", + "\n", + "Consider using $k$-means for a fixed number of $k=3$ clusters und $L=10$ iterations. Instead of running $k$-means once using some initial choice for the cluster means, we repeat $k$-means for a total of $50$ repetitions. We enumerate the repetitions using the index $r=0,1,\\ldots,49$. \n", + "\n", + "* For the $r$th repetition, we use the $r$th row of the numpy arrays `init_mean_cluster1`, `init_mean_cluster2`, `init_mean_cluster3` as initial cluster means for the Python function `KMeans`. The initial cluster means are defined by passing a matrix of shape `(n_clusters, n_features)` in the `init` parameter of the `KMeans` object. The first row of this matrix should contain the mean of the first cluster etc.. When creating the `KMeans` object, also set the parameter `n_init=1`.\n", + " \n", + " \n", + "* The clustering error obtained from the resulting cluster assignments of $k$-means during the $r$th repetition is stored in $r$th entry of the numpy array `clustering_err` (shape (50,1)). \n", + "\n", + " \n", + "* The cluster assignments obtained in those repetitions that resulted in the smallest and largest clustering error should be stored in the numpy array `best_assignment` and `worst_assignment` each of shape (400,1). The index (starting at 0) of the repetition yielding the smallest clustering error should be stored in the variable `min_ind`. The index (starting at 0) of the repetition yielding the largest clustering error should be stored in the variable `max_ind`. \n", + "\n", + "**Hint:** The sum of the squared distances of the data points to their respective centers is stored in the attribute `KMeans.inertia_` after fitting the k-means model. This value can be used to calculate the clustering error with the formula in (3). \n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "5570b09c6358546192e4838c2e372392", + "grade": false, + "grade_id": "cell-f82df05617001b3b", + "locked": false, + "schema_version": 3, + "solution": true + } + }, + "outputs": [], + "source": [ + "\n", + "m = X.shape[0] # Number of data points\n", + "\n", + "min_ind = 0 # Store here the index of the repetition yielding smallest clustering error \n", + "max_ind = 0 # .... largest clustering error\n", + "\n", + "cluster_assignment = np.zeros((50, m), dtype=np.int32) # Array for storing clustering assignments\n", + "clustering_err = np.zeros(50,) # Array for storing the clustering errors for each assignment\n", + "\n", + "np.random.seed(42) # Set random seed for reproducibility (DO NOT CHANGE THIS!)\n", + "\n", + "init_means_cluster1 = np.random.randn(50,2) # Use the rows of this numpy array to init k-means \n", + "init_means_cluster2 = np.random.randn(50,2) # Use the rows of this numpy array to init k-means \n", + "init_means_cluster3 = np.random.randn(50,2) # Use the rows of this numpy array to init k-means \n", + "\n", + "best_assignment = np.zeros((m,1)) # Store here the cluster assignment achieving the smallest clustering error\n", + "worst_assignment = np.zeros((m,1)) # Store here the cluster assignment achieving the largest clustering error\n", + "\n", + "### STUDENT TASK ###\n", + "# loop 0,...,49\n", + "# ...\n", + "# end loop\n", + "# min_ind = ...\n", + "# max_ind = ...\n", + "# best_assignment = ...\n", + "# worst_assignment = ...\n", + "# YOUR CODE HERE\n", + "raise NotImplementedError()\n", + "\n", + "# Plot the best and worst cluster assignments (w.r.t. clustering error)\n", + "print(\"Cluster assignment with smallest clustering error:\")\n", + "plotting(X, clusters=best_assignment)\n", + "print(\"Cluster assignment with largest clustering error:\")\n", + "plotting(X, clusters=worst_assignment)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "c34a4719bdf53c3cde86195b11d02678", + "grade": true, + "grade_id": "cell-07f03a2383510f62", + "locked": true, + "points": 3, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "# Perform some sanity checks on the solution\n", + "assert any(best_assignment != 0), 'You have to assign value for best_assignment ' \n", + "assert any(worst_assignment != 0), 'You have to assign value for worst_assignment ' \n", + "assert best_assignment.shape[0] == 400, 'incorrect cluster labels for minimal clustering error'\n", + "assert worst_assignment.shape[0] == 400, 'incorrect cluster labels for maximal clustering error'\n", + "print('Sanity check tests passed!')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "0a9a3cac2c5ccd027664a2d52f393aa7", + "grade": false, + "grade_id": "cell-faf9155650cb8972", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "## How Many Clusters ? \n", + "\n", + "Sometimes it is not clear what a good choice for the number $k$ of clusters should be. One possible (data-driven) method to choose $k$ is to run k-means for increasing values of $k$ until the clustering error is below a prescribed level (say $10$ percent). \n", + "\n", + "How would you expect the error to behave with respect to the number of clusters? Do you think that the clustering error is a good selection criteria for the number of clusters?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "0a8397d926ecb21b00bf38f6749dd660", + "grade": false, + "grade_id": "cell-4241c9a7fb62aa51", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<div class=\"alert alert-warning\">\n", + "\n", + "<b> Student Task.</b> Try out different number $k$ of clusters. \n", + "\n", + "\n", + "Apply k-means to the Cafe customer data for the choices $k=1,\\ldots,8$. For each choice of $k$, fit `KMeans` using $L=100$ iterations. Store the resulting clustering error in the numpy array `err_clustering` of shape (8,). \n", + "\n", + " \n", + "**Hint:** The sum of the squared distances of the data points to their respective centers is stored in the attribute `KMeans.inertia_` after fitting the k-means model. This value can be used to calculate the clustering error with the formula in (3). \n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "cd186dd3dfb981dcff951d39f6844acc", + "grade": false, + "grade_id": "cell-1866c8dbba1169f6", + "locked": false, + "schema_version": 3, + "solution": true + } + }, + "outputs": [], + "source": [ + "m = X.shape[0] # Number of data points\n", + "err_clustering = np.zeros(8) # Array for storing clustering errors\n", + "\n", + "### STUDENT TASK ###\n", + "# YOUR CODE HERE\n", + "raise NotImplementedError()\n", + "\n", + "print(f'Clustering errors: \\n{err_clustering}')\n", + "\n", + "# Plot the clustering error as a function of the number k of clusters\n", + "plt.figure(figsize=(8,6))\n", + "plt.plot(range(1,9), err_clustering)\n", + "plt.xlabel('Number of clusters')\n", + "plt.ylabel('Clustering error')\n", + "plt.title(\"The number of clusters vs clustering error\")\n", + "plt.show() " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "58cd0ead7fd5d73901e20595c606bb64", + "grade": true, + "grade_id": "cell-04389af635df57fc", + "locked": true, + "points": 3, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "# Perform some sanity checks on the outputs\n", + "assert all(err_clustering > 0), 'Store clustering errors for varing number of clusters' \n", + "assert err_clustering.shape == (8,), 'Incorrect shape for errors of models. Use the pre-defined variable err_clustering'\n", + "np.testing.assert_allclose(err_clustering[0], 9.5, atol = 0.5 ), 'clustering error when using one cluster is incorrect!'\n", + "np.testing.assert_allclose(err_clustering[7], 1, atol = 0.5), 'clustering error when using eight clusters is incorrect!'\n", + "\n", + "print('Sanity check tests passed!')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "8b1d6c9310d76f0a9b3f5150de95082b", + "grade": false, + "grade_id": "cell-e0bb173981deb14e", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "In the figure above, we can see that the clustering error decreases monotonously with an increasing number of clusters. This is expected since the addition of an additional cluster cannot result in a larger average distance from the points to their corresponding cluster means. It naturally follows that the raw clustering error is not a good measure for the selection of the number of clusters since it will always favor a larger number. \n", + "\n", + "The selection of a suitable number of clusters is a central part of solving clustering problems, and many techniques have been developed for this purpose. We do not consider this aspect of clustering problems in this course in more detail, but additional information can be found on, for example, [wikipedia](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set). " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "3de44f2f8044785bef46205139666a55", + "grade": false, + "grade_id": "cell-563bae1795984110", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "# Soft Clustering \n", + "\n", + "The information provided by the cluster assignments $y^{(i)}$, for $i=1,\\ldots,m$, delivered by $k$-means is rather coarse-grained, as each data point belongs to only one cluster. Even if two data points belong to the same cluster, their location within the cluster might be very different. \n", + "\n", + "Consider the data set whose scatter plot is shown below. The data points represented by blue dots that lie outside the biggest circle are somewhat in-between different clusters. However, the cluster assignments delivered by k-means do not reflect the different locations of data points relative to the center of the cluster.\n", + "\n", + "<center>\n", + " <img src=\"../../../coursedata/R5_Clustering/graph_example.png\" alt=\"Example of not good hard-clustering\"/>\n", + "</center>\n", + "\n", + "In many applications it is useful to measure the degree of a data point belonging to a cluster. This can be done using probabilistic clustering methods that determine the probabilities of each datapoint belonging to the different clusters.\n", + "\n", + "\n", + "## Probabilistic Models\n", + "\n", + "When applying a **probabilistic model**, we interpret data points as realizations of random variables and try to fit a probability distribution to them. For example, when applying a Gaussian model to a dataset we assume that the data is normally distributed, and fit a normal distribution to the data by finding the optimal values for the mean and variance of the distribution, e.g. the mean and variance that maximize the likelihood of observing the data (assuming the data is generated from a Gaussian distribution). \n", + "\n", + "In this section we will demonstrate how to use the [`GaussianMixture`](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture) to fit a mixture of Gaussians to datasets. After this, we will consider the use of the Gaussian Mixture Model in soft clustering.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "b33b5f2c4e5df02b5215c124ea68daa1", + "grade": false, + "grade_id": "cell-7770eda90b5ace82", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "<a id='FitProbabilisticDemo'></a>\n", + "<div class=\" alert alert-info\">\n", + " <b>Demo.</b> Fitting a Gaussian Mixture Model to Data. \n", + " \n", + "\n", + "The code snippet below fits a mixture of two Gaussian distributions to data points which are generated using a random number generator. In particular, the data points are sampled from a mixture of two Gaussian distributions with different mean and covariance. Gaussian mixture models (GMM) are implemented in the Python class [`GaussianMixture`](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture). Fitting a GMM to data points is done using the function `GaussianMixture.fit()`. Thus, instead of fitting a GMM using pen and paper (solving a difficult [maximum likelihood problem](https://stephens999.github.io/fiveMinuteStats/intro_to_em.html)) we can use one line of Python code :-) \n", + "\n", + " </div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from matplotlib.colors import LogNorm\n", + "from sklearn.mixture import GaussianMixture\n", + "\n", + "m = 300 # Define number of data points to generate from each distribution\n", + "np.random.seed(0) # Set random seed for reproducability (DO NOT CHANGE!)\n", + "\n", + "# Generate spherical data centered on (20, 20)\n", + "shifted_gaussian = np.random.randn(m, 2) + np.array((20, 20))\n", + "\n", + "# Generate zero centered stretched Gaussian data\n", + "C = np.array([[0., -0.7], [3.5, .7]]) \n", + "stretched_gaussian = np.dot(np.random.randn(m, 2), C)\n", + "\n", + "# Concatenate the two datasets into the final simulateda dataset\n", + "X_sim = np.vstack([shifted_gaussian, stretched_gaussian])\n", + "\n", + "# Fit a Gaussian Mixture Model with two components\n", + "gmm = GaussianMixture(n_components=2, covariance_type='full', random_state=1)\n", + "gmm.fit(X_sim)\n", + "\n", + "# Display predicted scores by the model as a contour plot\n", + "x = np.linspace(-20, 30, 100)\n", + "y = np.linspace(-20, 40, 100)\n", + "X_mg, Y_mg = np.meshgrid(x, y)\n", + "XX = np.array([X_mg.ravel(), Y_mg.ravel()]).T\n", + "Z = -gmm.score_samples(XX)\n", + "Z = Z.reshape(X_mg.shape)\n", + "\n", + "plt.figure(figsize=(8,6))\n", + "CS = plt.contour(X_mg, Y_mg, Z, norm=LogNorm(vmin=1.0, vmax=1000.0), levels=np.logspace(0, 3, 10))\n", + "CB = plt.colorbar(CS, shrink=0.8, extend='both')\n", + "plt.scatter(X_sim[:, 0], X_sim[:, 1], .8) # Scatterplot the datapoints onto the same figure\n", + "\n", + "plt.title('Negative log-likelihood predicted by a GMM')\n", + "plt.axis('tight')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "3910419dccc9c1eef1ba7ef06d9d6860", + "grade": false, + "grade_id": "cell-62c4dfa465ca226a", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "## Gaussian Mixture Models \n", + "\n", + "Let us now consider soft-clustering methods which provide a more fine-grained information about the cluster structure of a data set. In our customer segmentation problem, we might need some measure for the extent (or degree) by which a customer belongs to various groups. Soft-clustering methods associate each data point $\\mathbf{x}^{(i)}$ with a \"cluster-membership\" vector $\\mathbf{y}^{(i)}= (y^{(i)}_1,...,y^{(i)}_k) \\in [0,1]^k$. The entry $y^{(i)}_c$ is the degree of confidence by which we assign $\\mathbf{x}^{(i)}$ to cluster $\\mathcal{C}_c$. \n", + "\n", + "A principled approach to soft-clustering is based on interpreting data points $\\mathbf{x}^{(i)}$ as realizations of a random vector $\\mathbf{x}$ with probability distribution $p(\\mathbf{x})$. Soft-clustering assumes that each data points is obtained by randomly drawing from one of $k$ different clusters. Each cluster $\\mathcal{C}_{c}$ corresponds to a [Gaussian random vector](https://en.wikipedia.org/wiki/Multivariate_normal_distribution) with some mean vector $\\mathbf{\\mu}^{(c)}$ and covariance matrix $\\mathbf{C}^{(c)}$. The probability density function (pdf) of a Gaussian random vector $\\mathbf{x}$ with mean $\\mu$ and covariance matrix $\\mathbf{C}$ is given as \n", + "\n", + "\\begin{equation}\n", + "\\mathcal{N}(\\mathbf{x} ; \\mathbf{\\mu}, \\mathbf{C}) = \\frac{1}{\\sqrt{{\\rm det} \\big(2 \\pi \\mathbf{C}\\big)}} {\\rm exp } \\bigg( -\\frac{1}{2} \\big(\\mathbf{x} - \\mathbf{\\mu} \\big)^{T} \\mathbf{C}^{-1} \\big(\\mathbf{x} - \\mathbf{\\mu}\\big)\\bigg).\n", + "\\end{equation}\n", + "\n", + "Note that this expression is only valid for Gaussian random vectors having a non-singular (invertible) covariance matrix $\\mathbf{C}$. The resulting pdf of a data point is a **Gaussian mixture model (GMM)** \n", + "\\begin{equation}\n", + "p(\\mathbf{x}) = \\sum_{c=1}^{k} p_{c} \\mathcal{N}(\\mathbf{x};\\mathbf{\\mu}^{(c)},\\mathbf{C}^{(c)}). \n", + "\\end{equation}\n", + "The coefficients $p_{c} \\geq 0$ are required to satisfy $\\sum_{c=1}^{k}p_c=1$ and represent the (prior) probability that a data point is drawn from cluster $\\mathcal{C}_{c}$. \n", + "\n", + "Note that the GMM $p(\\mathbf{x})$ is parametrized by \n", + "\n", + "* the cluster probabilities $p_{1},p_{2}, ..., p_k$, \n", + "* the cluster means $\\mathbf{\\mu}^{(1)},\\mathbf{\\mu}^{(2)},..., \\mathbf{\\mu}^{(k)}$ \n", + "* and the covariance matrices $\\mathbf{C}^{(1)},\\mathbf{C}^{(2)},..., \\mathbf{C}^{(k)}$.\n", + "\n", + "Using the GMM $p(\\mathbf{x})$, we can make the notion of a degree of belonging precise. In particular, we define the degree $y_{c}^{(i)}$ of a data point $\\mathbf{x}^{(i)}$ belonging to cluster $\\mathcal{C}_{c}$ as the (posterior) probability that $\\mathbf{x}^{(i)}$ is generated (drawn) from the Gaussian distribution associated with $\\mathcal{C}_{c}$: \n", + "\n", + "$$\\mathbf{y}^{(i)}_c = \\frac{p_{c} \\mathcal{N}(\\mathbf{x}^{(i)} ; \\mathbf{\\mu}^{(c)}, \\mathbf{C}^{(c)})}{\\sum_{c'=1}^k p_{c'} \\mathcal{N}(\\mathbf{x}^{(i)} ; \\mathbf{\\mu}^{(c')}, \\mathbf{C}^{(c')})} $$\n", + "\n", + "After determining the degrees of belonging $\\mathbf{y}^{(i)}_c$, we can update our guess for (estimate of) the cluster probabilities $p_{c}$, cluster means $\\mathbf{\\mu}^{(c)}$ and covariance matrix $\\mathbf{C}^{(c)}$.\n", + "\n", + "In summary this algorithm consists of 4 steps:\n", + "\n", + "* __Step 1 - Initialize the cluster parameters. These are the means and covariances for every cluster.__\n", + "* __Step 2 - Update the degree of data point $\\mathbf{x}^{(i)}$ belonging to cluster c.__\n", + "* __Step 3 - Update cluster probabilities $p_{c}$, means $\\mathbf{\\mu}^{(c)}$ and covariances $\\mathbf{C}^{(c)}$.__\n", + "* __Step 4 - If stopping criterion is not satisfied, go to step 2__" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "d1be3aeb72ef251ef91f036bb358aa5b", + "grade": false, + "grade_id": "cell-421919c3669a6f95", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='handsondata'></a>\n", + "<div class=\" alert alert-info\">\n", + " <b>Demo.</b> Plotting GMM. \n", + "\n", + "The code snippet below implements a helper-function `plot_GMM()` which can be used to illustrate GMM along with the scatterplot of data points. You do not need to understand the details, but feel free to explore it.\n", + "\n", + " </div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "1ac41d19150e9ea1e57a69702dfd6610", + "grade": false, + "grade_id": "cell-1d31463f3e548c97", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "def plot_GMM(data, means, covariances, k, clusters=None):\n", + " \n", + " ## Select three colors for the plot\n", + " # if you want to plot curves k>3, extend these lists of colors\n", + " data_colors = ['orangered', 'dodgerblue', 'springgreen'] # colors for data points\n", + " centroid_colors = ['red', 'darkblue', 'limegreen'] # colors for the centroids\n", + " \n", + " k = means.shape[0]\n", + " plt.figure(figsize=(8,6)) # Set figure size\n", + " if clusters is None:\n", + " plt.scatter(data[:,0], data[:,1], s=13, alpha=0.5)\n", + " else:\n", + " plt.scatter(data[:,0], data[:,1], c=[data_colors[i] for i in clusters], s=13, alpha=0.5)\n", + "\n", + " # Visualization of results\n", + " x_plot = np.linspace(19, 35, 100)\n", + " y_plot = np.linspace(0, 12, 100)\n", + " x_mesh, y_mesh = np.meshgrid(x_plot, y_plot)\n", + " pos = np.empty(x_mesh.shape + (2,))\n", + " pos[:,:,0] = x_mesh \n", + " pos[:,:,1] = y_mesh\n", + "\n", + " # For each cluster, plot the pdf defined by the mean and covariance\n", + " for i in range(k):\n", + " z = multivariate_normal.pdf(pos, mean = means[i,:], cov = covariances[i])\n", + " plt.contour(x_mesh, y_mesh, z, 4, colors=centroid_colors[i], alpha=0.5)\n", + " plt.scatter(means[i,0], means[i,1], marker='x', c=centroid_colors[i])\n", + "\n", + " plt.title(\"Soft clustering with GMM\")\n", + " plt.xlabel(\"feature x_1: customers' age\")\n", + " plt.ylabel(\"feature x_2: money spent during visit\")\n", + " plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "ca658bce084b205be5d4b5e2ee726317", + "grade": false, + "grade_id": "cell-5fe73c8270fab14d", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='handsondata'></a>\n", + "<div class=\" alert alert-warning\">\n", + " <b>Student task.</b> Soft clustering using GMM. \n", + "\n", + "Your task is to perform soft clustering on the customer data using a GMM model. The means of the components of the fitted GMM should be stored in the variable `means`, the covariance matrices in `covariaces`, and the cluster labels (or indices) in `cluster_labels`.\n", + "\n", + "After fitting the `GaussianMixture` object, the means can be found in the `.means_` attribute and the covariances in the `.covariances_` attribute.\n", + " \n", + "The final result of the clustering will be plotted below.\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "fa1ff224d95c88928c5c873bb7e16baf", + "grade": false, + "grade_id": "cell-84901f86938b51a6", + "locked": false, + "schema_version": 3, + "solution": true, + "task": false + } + }, + "outputs": [], + "source": [ + "from sklearn.mixture import GaussianMixture\n", + "from scipy.stats import multivariate_normal # Multivariate normal random variable\n", + "\n", + "n, m = X.shape\n", + "\n", + "# Define the number of clusters\n", + "k = 3\n", + "\n", + "means = np.zeros((k, m)) # Array for storing the cluster means\n", + "covariances = np.zeros((k, m, m)) # Array for storing the covariance matrices\n", + "cluster_labels = np.zeros(n) # Array for storing the cluster labels of each data point\n", + "\n", + "np.random.seed(1) # Set random seed for reproducability \n", + "\n", + "### STUDENT TASK ###\n", + "# ...\n", + "# ...\n", + "# y_pred = ...\n", + "# means = ...\n", + "# covariances = ...\n", + "# YOUR CODE HERE\n", + "raise NotImplementedError()\n", + "\n", + "plot_GMM(X, means, covariances, k, cluster_labels)\n", + "print(\"The means are:\\n\", means, \"\\n\")\n", + "print(\"The covariance matrices are:\\n\", covariances)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "c60f60677d2994054d00032cfb8ec323", + "grade": true, + "grade_id": "cell-9bc17a16e26f09ea", + "locked": true, + "points": 3, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "outputs": [], + "source": [ + "# Perform some sanity check on the outputs\n", + "assert means.shape == (3, 2), \"The shape of 'means' is wrong!\"\n", + "assert covariances.shape == (3, 2, 2), \"The shape of 'covariances' is wrong!\"\n", + "assert cluster_labels.shape == (X.shape[0],)\n", + "\n", + "print('Sanity checks passed')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "55f88544e01dcfd355244ee2764e0700", + "grade": false, + "grade_id": "cell-1977954f1d4bdfaa", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "From the figure, we can see that the red cluster seems quite distinct from the other datapoints. In contrast, the blue and green clusters seem to have considerable overlap. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "024adf3a750b5c51f2caec36c0c0f909", + "grade": false, + "grade_id": "cell-6a7b4694364079a7", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "# Density Based Clustering\n", + "\n", + "Both k-means and GMM cluster data points according to their Euclidean distance. These methods judge the similarity between two data points solely based on their Euclidean distance, which is in many cases a suitable measure. However, in some applications the data conforms to a different non-Euclidean structure. \n", + "\n", + "One important non-Euclidean structure is based on the notion of connectivity. Here, two data points are considered similar if they can be reached by intermediate data points that have a small Euclidean distance. Two data points can be similar even if their Euclidean distance is large.\n", + "\n", + "<img src=\"../../../coursedata/R5_Clustering/DBSCAN.png\" alt=\"Drawing\" style=\"width: 500px\"/>\n", + "\n", + "\n", + "**Density-based spatial clustering of applications with noise (DBSCAN)** is a hard clustering method that uses a connectivity based similarity measure. In contrast to k-means and GMM, DBSCAN does not require the number of clusters to be pre-defined; the number of clusters will depend on its parameters. Moreover, DBSCAN allows to detect outliers which can be interpreted as degenerated clusters consisting of exactly one data point. For a detailed discussion of how DBSCAN works, we refer to https://en.wikipedia.org/wiki/DBSCAN. \n", + "\n", + "<img src=\"../../../coursedata/R5_Clustering/DBSCAN_tutorial.gif\" alt=\"Drawing\" style=\"width: 450px;\"/>\n", + "\n", + "DBSCAN is implemented in the scikit-learn function `DBSCAN.fit_predict()` [Documentation can be found here](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html). DBSCAN requires specification of two design parameters `eps` and `min_samples`. The meaning of these parameter are well explained [here](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html). \n", + "The DBSCAN implementation `fit_predict(self, X[, y, sample_weight])` returns cluster labels." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "0cd8226ea6cbbdefc7f166ab7ab71d07", + "grade": false, + "grade_id": "cell-5c140c77d4aa7add", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='handsondata'></a>\n", + "<div class=\" alert alert-info\">\n", + " <b>Demo.</b> Generating two datasets with different structures. \n", + "\n", + "The code snippet below creates two datasets $\\mathbb{X}^{(1)}$ and $\\mathbb{X}^{(2)}$ with different structures, which will be used to present the usefulness of DBSCAN. The datasets are stored in the numpy arrays `dataset1` and `dataset2` respectively.\n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "5da02b67dc0e036ce2518c4734c6e62c", + "grade": false, + "grade_id": "cell-a83d63263f658e6a", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.metrics import silhouette_score\n", + "from sklearn import cluster, datasets, mixture\n", + "from sklearn.neighbors import kneighbors_graph\n", + "\n", + "np.random.seed(844) # Set random seed for reproducibility\n", + "\n", + "# Create dataset with separate Gaussian clusters\n", + "clust1 = np.random.normal(5, 2, (1000,2))\n", + "clust2 = np.random.normal(15, 3, (1000,2))\n", + "clust3 = np.random.multivariate_normal([17,3], [[1,0],[0,1]], 1000)\n", + "clust4 = np.random.multivariate_normal([2,16], [[1,0],[0,1]], 1000)\n", + "dataset1 = np.concatenate((clust1, clust2, clust3, clust4))\n", + "\n", + "# Create dataset containing circular data\n", + "dataset2 = datasets.make_circles(n_samples=1000, factor=.5, noise=.05)[0]\n", + "\n", + "# Function for plotting clustering output on two datasets\n", + "def cluster_plots(data_1, data_2, clusters_1, clusters_2, title1='Dataset 1', title2='Dataset 2'):\n", + " fig, ax = plt.subplots(1, 2, figsize=(12,6))\n", + " ax[0].set_title(title1,fontsize=14)\n", + " ax[0].set_xlim(min(data_1[:,0]), max(data_1[:,0]))\n", + " ax[0].set_ylim(min(data_1[:,1]), max(data_1[:,1]))\n", + " ax[0].scatter(data_1[:,0], data_1[:,1], s=13, lw=0, c=clusters_1)\n", + " ax[1].set_title(title2,fontsize=14)\n", + " ax[1].set_xlim(min(data_2[:,0]), max(data_2[:,0]))\n", + " ax[1].set_ylim(min(data_2[:,1]), max(data_2[:,1]))\n", + " ax[1].scatter(data_2[:,0], data_2[:,1], s=13, lw=0, c=clusters_2)\n", + " fig.tight_layout()\n", + " plt.show()\n", + "\n", + "# Plot the unclustered datasets (i.e. all points belonging to cluster 1)\n", + "cluster_plots(dataset1, dataset2, np.ones(dataset1.shape[0]), np.ones(dataset2.shape[0]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "While the first dataset consists of clusters in the Euclidean sense, the structure of the second dataset is different. Visually, it seems clear that there are two clusters corresponding to the data in the inner and outer ring respectively. Let us apply k-means to both datasets and evaluate whether the resulting clusterings match our intuitive ones." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "<a id='handsondata'></a>\n", + "<div class=\" alert alert-info\">\n", + " <b>Demo.</b> K-means on Euclidean and non-Euclidean clusters. \n", + "\n", + "The code snippet below uses k-means to cluster the data `dataset1` and `dataset2`, and plots the resulting clusterings for evaluation of the results.\n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "c723be75ca45a92a2599b25ee4eb5004", + "grade": false, + "grade_id": "cell-e1b686855e4ec2c5", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "from sklearn.cluster import KMeans\n", + "\n", + "# Perform k-means clustering on both datasets, and get the clusters for each datapoint\n", + "k_means_1 = KMeans(n_clusters=4)\n", + "k_means_2 = KMeans(n_clusters=2)\n", + "clusters_1 = k_means_1.fit_predict(dataset1)\n", + "clusters_2 = k_means_2.fit_predict(dataset2)\n", + "\n", + "# Plot the clustered datasets\n", + "cluster_plots(dataset1, dataset2, clusters_1, clusters_2)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "8bc8fffa8a28c540a5f3f794172f4882", + "grade": false, + "grade_id": "cell-932e7a28147c62dd", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "The shortcomings of k-means when applied to the circular data are apparent in the figure above. While k-means produces a reasonable clustering of the first dataset $\\mathbb{X}^{(1)}$, it fails to find the intrinsic cluster structure of the second dataset $\\mathbb{X}^{(2)}$. Next, you will apply DBSCAN to the same datasets and in particular explore whether this method results in a better clustering for the second dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "85fac725672277b9b4b32046b4a7ec9b", + "grade": false, + "grade_id": "cell-251a1bf4fbd36501", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<div class=\"alert alert-warning\">\n", + "\n", + "<b> Student Task.</b> Clustering with DBSCAN. \n", + "\n", + "Apply DBSCAN to the two datasets $\\mathbb{X}^{(1)}$ and $\\mathbb{X}^{(2)}$ stored in the numpy arrays `dataset1` and `dataset2`.\n", + "Use the parameter choices `min_samples=5`, `metric='euclidean'` and `eps` value $1$ for $\\mathbb{X}^{(1)}$ and value $0.1$ for dataset $\\mathbb{X}^{(2)}$. \n", + "Use the `fit_predict(dataset)` method to obtain the cluster assignments for the data points in each of the two data sets and store them in the numpy arrays `clusters_1` and `clusters_2`, respectively. \n", + "\n", + "* `clusters_1` should be of shape $(4000, )$\n", + "\n", + "* `clusters_2` should be of shape $(1000, )$\n", + "\n", + "* for each dataset, count the number of data points that do not belong to any cluster and Store them in the variables `dataset1_noise_points` and `dataset2_noise_points`. The assigned cluster label for these data points is `-1`. \n", + "\n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "9635eab2590824141c35a16633447a7a", + "grade": false, + "grade_id": "cell-0067b2f63bfa39f5", + "locked": false, + "schema_version": 3, + "solution": true + } + }, + "outputs": [], + "source": [ + "from sklearn.cluster import DBSCAN\n", + "\n", + "# Define eps values for the two datasets\n", + "eps_1 = 1\n", + "eps_2 = 0.1\n", + "\n", + "### STUDENT TASK ###\n", + "# clusters_1 = DBSCAN(eps = ... \n", + "# clusters_2 = DBSCAN(eps = ...\n", + "# dataset1_noise_points = ...\n", + "# dataset2_noise_points = ...\n", + "# YOUR CODE HERE\n", + "raise NotImplementedError()\n", + "\n", + "print(f'Noise points in Dataset 1:\\n {dataset1_noise_points}/{len(clusters_1)} \\n')\n", + "print(f'Noise points in Dataset 2:\\n {dataset2_noise_points}/{len(clusters_2)} \\n')\n", + "\n", + "# Plot the clustered datasets\n", + "cluster_plots(dataset1, dataset2, clusters_1, clusters_2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "02409ac3d20aef5c2e2bfbe615ffdc79", + "grade": true, + "grade_id": "cell-bbfb7bb9dc2cbb53", + "locked": true, + "points": 3, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "# Perform some sanity checks on the outputs\n", + "assert clusters_1.shape == (4000,), 'Shape of dbscan_dataset1 is wrong.'\n", + "assert clusters_2.shape == (1000,), 'Shape of dbscan_dataset1 is wrong.'\n", + "assert dataset1_noise_points < 50, 'Number of noise points in dataset 1 should be less than 50.'\n", + "assert dataset2_noise_points < 5, 'Number of noise points in dataset 2 should be less than 5.'\n", + "\n", + "print('Sanity check tests passed!')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "0e5c1108a782f1705389a99e208e9ae6", + "grade": false, + "grade_id": "cell-bca398236dbb4364", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "If you solved the exercise correctly, it should be evident that DBSCAN performs significantly better on the second dataset compared to the k-means algorithm. We can also observe that some of the datapoints in both datasets are left unclustered. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "80a97cb10032777607b31768667b7dfa", + "grade": false, + "grade_id": "cell-ababb5d9a609b7b1", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "## Combining Clustering with Logistic Regression\n", + "\n", + "Assume that we have found different customer segments in the customer dataset using clustering. Now, we would like to provide each new customer targeted offers on our products based on the segment that the customer belongs to. This means that we would have to classify each new customer into one of the customer segments.\n", + "\n", + "A simple approach to this is to use some classification criteria based on the clustering algorithm used. For example, if we used k-means for clustering we could assign each new customer to the class whose mean is closest to the new data point. A more refined approach is to first use some algorithm to find a reasonable clustering for the dataset, and then train a separate classifier (e.g. logistic regression) on the clustered data, using the cluster assignments as the labels of the data points." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "8207b0d6c5c35d177e3c85908cc61cc7", + "grade": false, + "grade_id": "cell-88645d8c3565c185", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "<a id='clusterregression'></a>\n", + "<div class=\" alert alert-info\">\n", + " <b>Demo.</b> Combining clustering with logistic regression. \n", + "\n", + "The code snippet below demonstrates how to use clustering in conjunction with logistic regression to classify new customers into one of the customer segments found by clustering. In this example we use k-means for clustering and logistic regression for classification, but these could be replaced by alternative clustering and classification algorithms. The process consists of the following steps:\n", + " \n", + "- Choose the number $k$ of presumed clusters in the dataset and use k-means to cluster the data\n", + " \n", + " \n", + "- Define a supervised classification problem with feature matrix $\\mathbf{X}$ and label vector $y$, which contains the labels obtained by clustering\n", + " \n", + " \n", + "- Use logistic regression to find the decision boundaries between the k classes (previously clusters).\n", + " \n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "e4bdaebf7100140298d761a5c6746337", + "grade": false, + "grade_id": "cell-a3e23e9a802f58d3", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "outputs": [], + "source": [ + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "np.random.seed(42) # Set random seed for reproducibility\n", + "\n", + "k = 2 # Select the number of clusters \n", + "k_means = KMeans(n_clusters=k)\n", + "k_means.fit(X)\n", + "y = k_means.labels_ # Define the label vector y as the cluster labels\n", + "\n", + "# Fit logistic regression model \n", + "log_reg = LogisticRegression()\n", + "log_reg.fit(X, y)\n", + "\n", + "# Get the weights and intercept of the decision boundary for plotting\n", + "w = log_reg.coef_\n", + "intercept = log_reg.intercept_\n", + "\n", + "# Plot clusters and decision boundary\n", + "plotting(X, clusters=y, show=False)\n", + "ax = plt.gca() # Get current axes \n", + "x_min, x_max = ax.get_xlim() # Get x-axis limits from axes\n", + "x_g = np.linspace(x_min, x_max) # Define values w.r.t. which to plot predictors\n", + "decision_boundary = (-intercept - w[:,0] * x_g) / w[:,1] # Calculate decision boundary w.r.t. x_1\n", + "y_lim = ax.get_ylim() \n", + "plt.plot(x_g, decision_boundary, color='navy') # Plot decision boundary\n", + "plt.ylim(y_lim)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "9c1338733bf7f38f35c146372de14181", + "grade": false, + "grade_id": "cell-cbf2ec0b82d8a530", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "We can see that the linear decision boundary nicely separates the two customer segments. Given a dataset contain data on new customers, we could now use the logistic regression model to classify these customers into one of the two segments.\n", + "\n", + "There is one significant caveat that must be taken into account when using combining clustering with classification in the way that is described here. The labels of classification problems are in general guaranteed to be accurate (up to human judgement) since the data points are labelled manually. However, when using clustering to obtain the labels there is no similar guarantee that the labels are reasonable or that the classes actually represent something meaningful. One should thus strive to validate the reasonability of the clustering if possible." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "ad186857b271a95b5669fdf15b553ed1", + "grade": false, + "grade_id": "cell-dcdf87f2b12d531a", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='QuestionR5_1'></a>\n", + "<div class=\" alert alert-warning\">\n", + " <p><b>Student Task.</b> Question R5.1. </p>\n", + "\n", + " <p>Which of the following statements is true?</p>\n", + "\n", + "<ol>\n", + " <li> Clustering methods require labeled data points.</li>\n", + " <li> Clustering methods aim at predicting a numeric quantity.</li>\n", + " <li> Clustering methods do not require any labels.</li>\n", + " <li> Clustering methods can only be used for less than 500 data points.</li>\n", + "</ol> \n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "0c50a328b5cd904b31a86ba31050be46", + "grade": false, + "grade_id": "cell-1e6f109f6ceac301", + "locked": false, + "schema_version": 3, + "solution": true + } + }, + "outputs": [], + "source": [ + "# answer_R5_Q1 = ...\n", + "# YOUR CODE HERE\n", + "raise NotImplementedError()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "03b7efbb57d6f8fe02ca4aa7edb8a085", + "grade": true, + "grade_id": "cell-b241eab691108c69", + "locked": true, + "points": 1, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "# this cell is for tests\n", + "assert answer_R5_Q1 in [1,2,3,4], '\"answer_R5_Q1\" Value should be an integer between 1 and 4.'\n", + "print('Sanity check tests passed!')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "b4e6e5d300151d66b0bdf14e683644d9", + "grade": false, + "grade_id": "cell-5c79c55384709e4c", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='QuestionR5_2'></a>\n", + "<div class=\" alert alert-warning\">\n", + " <p><b>Student Task.</b> Question R5.2. </p>\n", + "\n", + " <p>Which of the following statements is true?</p>\n", + "\n", + "<ol>\n", + " <li> DBSCAN is a soft clustering method.</li>\n", + " <li> DBSCAN automatically determines the number of clusters.</li>\n", + " <li> DBSCAN requires labeled data.</li>\n", + " <li> DBSCAN can only be used for less than $1000$ data points.</li>\n", + "</ol> \n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "b3ed013ab359fb0b0317ab5aee9a09b3", + "grade": false, + "grade_id": "cell-84ca5f04b7227c8d", + "locked": false, + "schema_version": 3, + "solution": true + } + }, + "outputs": [], + "source": [ + "# answer_R5_Q2 = ...\n", + "# YOUR CODE HERE\n", + "raise NotImplementedError()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "f5e651e9b2ba5f1ed791875cf3dac808", + "grade": true, + "grade_id": "cell-c8e0f373e1823bbf", + "locked": true, + "points": 1, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "# this cell is for tests\n", + "assert answer_R5_Q2 in [1,2,3,4], '\"answer_R5_Q2\" Value should be an integer between 1 and 4.'\n", + "print('Sanity check tests passed!')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "ce9910cdd1e52ac05dccc5dcaeeb0a0e", + "grade": false, + "grade_id": "cell-e155345970d1af9c", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='QuestionR5_3'></a>\n", + "<div class=\" alert alert-warning\">\n", + " <p><b>Student Task.</b> Question R5.3. </p>\n", + "\n", + " <p>What is the maximum length of feature vectors $\\mathbf{x}^{(i)}$ that can be handled using k-means ?</p>\n", + "\n", + "<ol>\n", + " <li> Given sufficient computational resources, k-means can handle arbitrarily long feature vectors.</li>\n", + " <li> $100$ </li>\n", + " <li> $1000$ </li>\n", + " <li> $10^{10}$ </li>\n", + "</ol> \n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "d7b7641c19d82dc1cadb71d1736f0fee", + "grade": false, + "grade_id": "cell-8f6781ca15b315b0", + "locked": false, + "schema_version": 3, + "solution": true + } + }, + "outputs": [], + "source": [ + "# answer_R5_Q3 = ...\n", + "# YOUR CODE HERE\n", + "raise NotImplementedError()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "4aafa2d537b10f4345c602d6eabf6a9f", + "grade": true, + "grade_id": "cell-dc7686f9360d87f4", + "locked": true, + "points": 1, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "# this cell is for tests\n", + "assert answer_R5_Q3 in [1,2,3,4], '\"answer_R5_Q3\" Value should be an integer between 1 and 4.'\n", + "print('Sanity check tests passed!')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "ba5d8187538198d1fef99187e00846c6", + "grade": false, + "grade_id": "cell-7c215a9a84b020a8", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "<a id='QuestionR5_4'></a>\n", + "<div class=\" alert alert-warning\">\n", + " <p><b>Student Task.</b> Question R5.4. </p>\n", + "\n", + " <p>In which of the following ways is clustering combined with classification in this notebook?</p>\n", + "\n", + "<ol>\n", + " <li> Clustering is used to separate the training and validation sets used for classification </li>\n", + " <li> Clustering is used to find clusters within different classes </li>\n", + " <li> Clustering is used to give a label for each datapoint, after which a supervised classification algorithm is trained on the labelled data.</li>\n", + " <li> Classification is used to validate the reasonability of the clustering</li>\n", + "</ol> \n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "7686c82c053473872cc949ce1637258d", + "grade": false, + "grade_id": "cell-d523029a39e429a7", + "locked": false, + "schema_version": 3, + "solution": true, + "task": false + } + }, + "outputs": [], + "source": [ + "# answer_R5_Q4 = ...\n", + "# YOUR CODE HERE\n", + "raise NotImplementedError()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "e43db741419b2ea99492e47ec59a9bb6", + "grade": true, + "grade_id": "cell-59b838799b7d2d99", + "locked": true, + "points": 1, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "outputs": [], + "source": [ + "# this cell is for tests\n", + "assert answer_R5_Q4 in [1,2,3,4], '\"answer_R5_Q4\" Value should be an integer between 1 and 4.'\n", + "print('Sanity check tests passed!')\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + }, + "varInspector": { + "cols": { + "lenName": 16, + "lenType": 16, + "lenVar": 40 + }, + "kernels_config": { + "python": { + "delete_cmd_postfix": "", + "delete_cmd_prefix": "del ", + "library": "var_list.py", + "varRefreshCmd": "print(var_dic_list())" + }, + "r": { + "delete_cmd_postfix": ") ", + "delete_cmd_prefix": "rm(", + "library": "var_list.r", + "varRefreshCmd": "cat(var_dic_list()) " + } + }, + "types_to_exclude": [ + "module", + "function", + "builtin_function_or_method", + "instance", + "_Feature" + ], + "window_display": false + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}