diff --git a/Round_6_-_Dimensionality_Reduction.ipynb b/Round_6_-_Dimensionality_Reduction.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..9bc24489d9fda67e7179ce3a4ed3cf034b756864 --- /dev/null +++ b/Round_6_-_Dimensionality_Reduction.ipynb @@ -0,0 +1,1823 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "8a7407672fbc81d46b89f4036e0480ef", + "grade": false, + "grade_id": "cell-50a11b583482297d", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "# Machine Learning with Python - Dimensionality Reduction\n", + "\n", + "Machine learning applications often involve high-dimensional data. Consider a $1000 \\times 1000$ pixel image. If we stack the intensities of pixels into a vector, we obtain a vector of length $10^6$. In this round, we study dimensionality reduction methods that transform long feature vectors to shorter feature vectors which retain most of the relevant information contained in the raw long vectors. Such dimensionality reduction is useful for at least three reasons: \n", + "\n", + "- Shorter feature vectors typically imply **less computations** required by ML methods. \n", + "- Shorter (but informative) feature vectors ensure **good generalization** of ML methods from training data to new data points. Using very long feature vectors bears the risk of overfitting the training data. \n", + "- Transforming long raw vectors (e.g. obtained from pixel intensities of a snahpshot obtained from a megapixel camera) to vectors of length 2 (or 3) allows to **visualize** data points in a scatter plot.\n", + "\n", + "In this notebook, we specifically focus on dimensionality reduction by principal component analysis (PCA). In PCA, the data is linearly transformed into a lower dimensional representation that results in the minimal amount of information loss over the entire dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "841960154221606c5b3503c75c1b59ab", + "grade": false, + "grade_id": "cell-23dd358e94895c83", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "## Learning goals\n", + "\n", + "After this round, you should \n", + "\n", + "- understand the basic idea behind dimensionality reduction. \n", + "- be able to implement PCA using the Python library `scikit-learn`. \n", + "- understand the trade-off between amount of dimensionality reduction and information loss. \n", + "- be able to combine PCA with a supervised ML method such as linear regression. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "fb9546471d5d2dc19a9169338d53716b", + "grade": false, + "grade_id": "cell-9c8204a98447db2b", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "## Additional Material \n", + "* [Video](https://www.youtube.com/watch?v=FgakZw6K1QQ) by StatQuest on PCA\n", + "* [Video lecture](https://www.youtube.com/watch?v=Zbr5hyJNGCs) of Prof. Andrew Ng on dimensionality reduction \n", + "* [Video lecture](https://www.youtube.com/watch?v=cnCzY5M3txk) of Prof. Andrew Ng on dim. red. for data visualization \n", + "* [Video lecture](https://www.youtube.com/watch?v=T-B8muDvzu0) of Prof. Andrew Ng on principal component analysis\n", + "* https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues\n", + "* Chapter 9 of the [Course Book](https://arxiv.org/abs/1805.05052) - Dimensionality Reduction." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "740eb7f88b7bbdc90aacacd3e8f8db54", + "grade": false, + "grade_id": "cell-b119cb801fe9fd35", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "## Bananas and Apples\n", + "\n", + "<a id=\"data\"></a>\n", + "Consider a dataset containing $m=30$ image files of apples and bananas: \n", + "\n", + "* 15 images of apples stored in the files named `1.jpg` to `15.jpg`\n", + "* 15 images of bananas stored in the files named `16.jpg` to `30.jpg`\n", + "\n", + "The files contain color images, but for our purposes we convert them to grayscale images. We can represent each pixel of a grayscale image by a number between 0 (black pixel) and 255 (white pixel). The size of each images is $50\\times50$ pixels. Thus, we can represent each fruit image by the \"raw\" feature vector $\\mathbf{z} =\\big(z_{1},\\ldots,z_{J}\\big)^{T} \\in \\mathbb{R}^{2500}$. The $j$th entry $z_{j}$ is the grayscale level of the $j$th pixel." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "c2c0a22e80127c102c3fc3ad44892c9e", + "grade": false, + "grade_id": "cell-4da09c247a436939", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='handsondata'></a>\n", + "<div class=\" alert alert-info\">\n", + " <b>Demo.</b> Loading the Data. \n", + " \n", + " \n", + "The following code block loads the images, converts them into grayscale images and stores them in the matrix $\\mathbf{Z} \\in \\mathbb{R}^{30 \\times 2500}$ whose $i$th row $\\mathbf{z}^{(i)} \\in \\mathbb{R}^{2500}$ contains the grayscale intensities for the $i$th image. The first three apple images and the first three banana images are displayed.\n", + "\n", + " </div>" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "85ef4be0d441044ab9e1ce9bce9bf15a", + "grade": false, + "grade_id": "importPIL", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "# Import required libraries (packages) for this exercise\n", + "from PIL import Image\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "\n", + "plt.style.use('ggplot') # Change style for nice plots :)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "a02ae9b0c51a42e28bc777621fb62a74", + "grade": false, + "grade_id": "cell-5cfd1c8afd6117c0", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The shape of the datamatrix Z is (30, 2500)\n" + ] + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "<Figure size 720x720 with 6 Axes>" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "m = 30 # Number of images to include in dataset\n", + "dataset = np.zeros((m,50,50), dtype=np.uint8) # create numpy array for images and fill with zeros \n", + "D = 50*50 # length of raw feature vectors \n", + "\n", + "for i in range(1, m+1):\n", + " # With convert('L') we convert the images to grayscale\n", + " try:\n", + " img = Image.open('/coursedata/fruits/%s.jpg'%(str(i))).convert('L') # Read in image from jpg file\n", + " except:\n", + " img = Image.open('../../data/fruits/%s.jpg'%(str(i))).convert('L') # Read if you are doing exercise locally\n", + " dataset[i-1] = np.array(img, dtype=np.uint8) # Convert image to numpy array with greyscale values\n", + " \n", + "# Store raw image data in matrix Z\n", + "Z = dataset.reshape(m,-1) # Reshape the 50 x 50 pixels into a long numpy array of shape (2500,1)\n", + "print(\"The shape of the datamatrix Z is\", Z.shape) \n", + "\n", + "# Display first three apple images (fruits1.jpg,fruits2.jpg,fruits3.jpg) \n", + "# and first three banana images (fruits16.jpg,fruits17.jpg,fruits18.jpg)\n", + "fig, ax = plt.subplots(3, 2, figsize=(10,10), gridspec_kw = {'wspace':0, 'hspace':0})\n", + "for i in range(3):\n", + " for j in range(2):\n", + " ax[i,j].imshow(dataset[i + (15*j)], cmap='gray')\n", + " ax[i,j].axis('off')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "451b161856747ffddbfdcf8648b602b7", + "grade": false, + "grade_id": "cell-1d01aef88d590379", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "## Principal Component Analysis\n", + "<a id=\"Q1\"></a>\n", + "\n", + "### Basic Idea of Linear Dimensionality Reduction \n", + "\n", + "Suppose we have at our disposal a dataset \n", + "\n", + "\\begin{equation}\n", + " \\mathbf{Z} = \\begin{bmatrix}\n", + " z_1^{(1)} & z_2^{(1)} & \\ldots & z_D^{(1)} \\\\\n", + " z_1^{(2)} & z_2^{(2)} & \\ldots & z_D^{(2)} \\\\\n", + " \\vdots & \\vdots &\\ddots & \\vdots \\\\\n", + " z_1^{(m)} & z_2^{(m)} & \\ldots & z_D^{(m)}\n", + " \\end{bmatrix}\\in \\mathbb{R}^{m \\times D},\n", + "\\end{equation}\n", + "\n", + "where $m$ is the number of datapoints and $D$ the number of features. The goal of dimensionality reduction is to transform the dataset $\\mathbf{Z}$ into a lower dimensional dataset\n", + "\n", + "\\begin{equation}\n", + " \\mathbf{X} = \\begin{bmatrix}\n", + " x_1^{(1)} & x_2^{(1)} & \\ldots & x_n^{(1)} \\\\\n", + " x_1^{(2)} & x_2^{(2)} & \\ldots & x_n^{(2)} \\\\\n", + " \\vdots & \\vdots &\\ddots & \\vdots \\\\\n", + " x_1^{(m)} & x_2^{(m)} & \\ldots & x_n^{(m)}\n", + " \\end{bmatrix}\\in \\mathbb{R}^{m \\times n},\n", + "\\end{equation}\n", + "\n", + "where $n < D$. As was mentioned in the introduction, the low-dimensional representation\n", + "\n", + "- reduces the computational requirements of training an ML model on the dataset\n", + "- provides short but informative representations of the datapoints, which might improve generalization capability of an ML model\n", + "- makes it easier to visualize high-dimensional data.\n", + "\n", + "In **linear dimensionality reduction** the datapoints $\\mathbf{z}^{(i)}, i=1.\\ldots,m$ are transformed to the corresponding low-dimensional representations $\\mathbf{x}^{(i)}$ with a linear transformation \n", + "\n", + "\\begin{equation}\n", + "\\begin{aligned}\n", + " &\\mathbf{x}^{(i)} =\\mathbf{W} \\mathbf{z}_c^{(i)}= \\mathbf{W} \\big( \\mathbf{z}^{(i)} - \\overline{\\mathbf{z}}\\big),& \\overline{\\mathbf{z}} = (1/m) \\sum_{i=1}^m \\mathbf{z}^{(i)}.\n", + "\\end{aligned}\n", + "\\end{equation}\n", + "\n", + "Here, the compression matrix $\\mathbf{W} \\in \\mathbb{R}^{n \\times D}$ maps the centered feature vectors $\\mathbf{z}_c^{(i)} \\in \\mathbb{R}^D$ to their low-dimensional representations $\\mathbf{x}^{(i)} \\in \\mathbb{R}^n$. Observe that the sample mean $\\overline{\\mathbf{z}}$ is subtracted from the datapoint $\\mathbf{z}^{(i)}$ before performing the linear transformation with $\\mathbf{W}$. This results in the transformed dataset $\\mathbf{X}$ being centered at the origin.\n", + "\n", + "\n", + "### Optimal dimensionality reduction\n", + "\n", + "The central objective in linear dimensionality reduction is to find the optimal compression matrix $\\mathbf{W}$ **that results in the least amount of information loss** when transforming $\\mathbf{Z}$ into its lower-dimensional representation $\\mathbf{X}$. Let us assume that the rows of $\\mathbf{W}$ are orthogonal unit vectors.\n", + "\n", + "A natural measure of the information loss incurred by the transformation is the **reconstruction error**\n", + "\n", + "\\begin{equation}\n", + "\\begin{aligned}\n", + " \\mathcal{E}(\\mathbf{W}) &= (1/m) \\sum_{i=1}^{m} \\| \\mathbf{z}_c^{(i)} - \\hat{\\mathbf{z}}_c^{(i)} \\|^{2} \\\\ &= (1/m) \\sum_{i=1}^{m} \\| \\mathbf{z}_c^{(i)} - \\mathbf{W}^T \\mathbf{x}^{(i)} \\|^{2} \\\\ &= (1/m) \\sum_{i=1}^{m} \\| \\mathbf{z}_c^{(i)} - \\mathbf{W}^T \\mathbf{W} \\mathbf{z}_c^{(i)} \\|^{2}.\n", + "\\end{aligned}\n", + "\\end{equation}\n", + "\n", + "Here, the centralized **reconstruction** $\\hat{\\mathbf{z}}_c^{(i)} \\in \\mathbb{R}^D$ of the datapoint $\\mathbf{z}_c^{(i)}$ is obtained by the formula\n", + "\n", + "\\begin{equation}\n", + "\\hat{\\mathbf{z}}_c^{(i)} = \\mathbf{W}^T \\mathbf{x}^{(i)} = \\mathbf{W}^T \\mathbf{W}\\mathbf{z}_c^{(i)}. \n", + "\\end{equation}\n", + "\n", + "By multiplying the transformed datapoint $\\mathbf{x}^{(i)}$ by $\\mathbf{W}^T$ from the left, the transformed datapoints are defined in terms of the original features of the data (i.e. in terms of the coordinates in $\\mathbb{R}^D$). This enables us to calculate the distance between the original and compressed datapoints in the original space $\\mathbb{R}^D$, and hence also the reconstruction error. Furthermore, it can be shown that the matrix $\\mathbf{W}^T\\mathbf{W}$ is an [**orthogonal projection**](https://en.wikipedia.org/wiki/Projection_(linear_algebra)) matrix that (orthogonally) projects the datapoints $\\mathbf{z}_c^{(i)}$ onto the subspace spanned by the rows of $\\mathbf{W}$.\n", + "\n", + "Note that the reconstruction error can be given equivalently in terms of the non-centralized datapoints $\\mathbf{z}^{(i)}$ and reconstructions $\\hat{\\mathbf{z}}^{(i)}$:\n", + "\n", + "\\begin{equation}\n", + "\\begin{aligned}\n", + " \\mathcal{E}(\\mathbf{W}) &= (1/m) \\sum_{i=1}^{m} \\| \\mathbf{z}_c^{(i)} - \\hat{\\mathbf{z}}_c^{(i)} \\|^{2} \\\\ &= (1/m) \\sum_{i=1}^{m} \\| \\mathbf{z}_c^{(i)} + \\overline{\\mathbf{z}} - \\overline{\\mathbf{z}} - \\hat{\\mathbf{z}}_c^{(i)} \\|^{2} \\\\ &= (1/m) \\sum_{i=1}^{m} \\| (\\mathbf{z}_c^{(i)} + \\overline{\\mathbf{z}}) - (\\hat{\\mathbf{z}}_c^{(i)} + \\overline{\\mathbf{z}}) \\|^{2} \\\\ & = (1/m) \\sum_{i=1}^{m} \\| \\mathbf{z}^{(i)} - \\hat{\\mathbf{z}}^{(i)} \\|^{2}.\n", + "\\end{aligned}\n", + "\\end{equation}\n", + "\n", + " More descriptively speaking, the reconstruction error gives the mean squared distance between the true datapoints $\\mathbf{z}^{(i)}$ and the reconstructed datapoints $\\hat{\\mathbf{z}}^{(i)} = \\mathbf{W}^T \\mathbf{x}^{(i)} + \\overline{\\mathbf{z}}$. This is a very intuitive quantification of information loss - the less information is lost in the transformation to the lower dimenstional space, the closer the reconstruction $\\hat{\\mathbf{Z}}$ is to the true data $\\mathbf{Z}$.\n", + "\n", + "The fundamental result of **principal component analysis** (PCA) is that the reconstruction error is minimized when\n", + "\n", + "\\begin{equation}\n", + " \\mathbf{W} = \\mathbf{W}_{\\rm PCA} = \\big(\\mathbf{u}^{(1)}, \\mathbf{u}^{(2)}, \\ldots, \\mathbf{u}^{(n)} \\big)^T,\n", + "\\end{equation}\n", + "\n", + "where the rows $\\mathbf{u}^{(i)}$ are the $n$ first **principal components** of $\\mathbf{Z}$. The first principal component is chosen so that it corresponds to the direction in $\\mathbb{R}^D$ along which the data $\\mathbf{Z}$ exhibits the most variance. The subsequent components are chosen so that the $i$:th component corresponds to the direction that exhibits the most variance, **while being orthogonal** to the components $1 \\leq j < i$.\n", + "\n", + "When we use PCA to transform the datapoints in $\\mathbf{Z}$ to the lower dimensional space $\\mathbb{R}^n$ (where each datapoint has $n$ features) we say that we perform PCA with $n$ components. An important consequence of the definition of $\\mathbf{W}_{\\rm PCA}$ is that the order of the rows in $\\mathbf{W}_{\\rm PCA}$ is the same, regardless of the amount of components used in the transformation. Given two PCA matrices $\\mathbf{W}_n$ and $\\mathbf{W}_k$ with $n$ and $k < n$ components respectively, the k first components in $\\mathbf{W}_n$ will be identical to the ones in $\\mathbf{W}_k$.\n", + "\n", + "### Example: PCA in 2d with one component\n", + "\n", + "The image below visualizes principle component analysis in 2-d, and shows the datapoints along with the first - and in this case the only - principal component. When the dataset is transformed into a lower dimensional representation, the data points are transformed onto the subspace spanned by the principal component as shown by the red lines.\n", + "\n", + "<img src=\"../../../coursedata/R6_DimRed/pca.png\" alt=\"Drawing\" style=\"width: 600px;\"/>\n", + "\n", + "In this case, the transformed data is one-dimensional and each datapoint will only be characterized by its value along the axis spanned by the 1st principal component. Even though this is not shown in the image, it is easy to imagine the green line being the sole axis and the datapoints being located on that 1-d axis at the points indicated by the red lines.\n", + "\n", + "When the data is reconstructed the resulting data points $\\hat{\\mathbf{z}}^{(i)}$ will be located on the 1st principal component in 2-d space as shown in the image. As such, the red lines represent the difference between the original and reconstructed datapoints, and the reconstruction error corresponds to the mean of the squared lengths of these lines.\n", + "\n", + "Based on the image, it should be quite easy to believe that the principal component in fact minimizes the reconstruction error. If we would move or rotate the green component, the total squared distances would inevitably get larger. We can also verify that this component also corresponds to the direction of largest variance in the data. \n", + "\n", + "### Transforming entire datasets\n", + "\n", + "So far in this notebook, the transformation formulas have been of the form used to transform one datapoint at a time. In practice, we typically want to compress or reconstruct the entire dataset at once. The compression of the entire dataset $\\mathbf{Z}$ is done by the transform\n", + "\n", + "\\begin{equation}\n", + "\\mathbf{X} = \\mathbf{Z}_c\\mathbf{W}_{\\rm PCA}^T = \\big(\\mathbf{Z} - \\overline{\\mathbf{Z}}\\big) \\mathbf{W}_{\\rm PCA}^T,\n", + "\\end{equation}\n", + "\n", + "where $\\overline{\\mathbf{Z}}$ is a matrix with the sample mean vector in every row.\n", + "\n", + "In order to understand why this transform works, first recall that the individual datapoints in the matrices $\\mathbf{Z}$ and $\\mathbf{X}$ are stacked row-wise. By standard matrix multiplication we get that the $i$:th row in $\\mathbf{X}$ corresponding to the $i$:th compressed datapoint is given as\n", + "\n", + "\\begin{equation}\n", + " {\\mathbf{x}^{(i)}}^T = {\\mathbf{z}_c^{(i)}}^T \\mathbf{W}_{\\rm PCA}^T,\n", + "\\end{equation}\n", + "\n", + "which is clearly just the transposed equivalent of \n", + "\n", + "\\begin{equation}\n", + " {\\mathbf{x}^{(i)}} = \\mathbf{W}_{\\rm PCA} {\\mathbf{z}_c^{(i)}}.\n", + "\\end{equation}\n", + "\n", + "Hence we can use the matrix multiplication defined above to compress the dataset instead of compressing each individual datapoint separately. By similar arguments we find that the formula for the reconstructed data is\n", + "\n", + "\\begin{equation}\n", + " \\hat{\\mathbf{Z}} = \\mathbf{X}\\mathbf{W}_{\\rm PCA} + \\overline{\\mathbf{Z}}.\n", + "\\end{equation}\n", + "\n", + "If you are not completely convinced by this, it is a good exercise to verify that the rows of the $\\hat{\\mathbf{Z}}$ contain the reconstructed datapoints!\n", + "\n", + "**General note:** In this section we have defined $\\mathbf{W}_{\\rm PCA}$ such that the principal components are in the rows of the matrix. In contrast, some sources define the matrix such that the components are in the columns. If you consult other resources and find seemingly different transformation formulas, be sure to check the definition of the matrix in order to avoid unnecessary confusion!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "13cd2cc75df4b9b71bc9a3722817191c", + "grade": false, + "grade_id": "cell-c034306c58b538dd", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "## PCA Off-The-Shelf in Python\n", + "\n", + "<img src=\"../../../coursedata/R6_DimRed/transforms.png\" alt=\"transforms\" />\n", + "\n", + "The Python library `scikit-learn` provides the class `PCA`. This class provides methods to compute the optimal compression matrix $\\mathbf{W}_{\\rm PCA}$ of the data $\\mathbf{Z}$, as well as methods with which to perform PCA transformations and reconstructions. The image above shows how the methods of the `PCA` object correspond to the transformations presented in this notebook. Below, we give a brief description of the most important methods in the class.\n", + "\n", + "- `PCA.fit(Z)` calculates $\\mathbf{W}_{\\rm PCA}$ and stores it in the attribute `PCA.components_`. Note that the data matrix `Z` should be uncentered, since the `PCA` class stores the sample mean of the data that is used when transforming and reconstructing data with the fitted `PCA` object.\n", + "\n", + "- `PCA.transform(Z)` performs PCA transformation on `Z` in order to get the lower dimensional representation `X` of the data. Observe that the function performs the centralization of `Z` \"under the hood\", and as such the resulting compression is of the form $\\mathbf{X} = \\mathbf{Z}_c\\mathbf{W}^T$.\n", + "\n", + "- `PCA.inverse_transform(X)` performs the inverse operation to `PCA.transform(Z)`, that is, it takes as input the PCA transformed data `X` and returns a reconstrution `Z_hat` with the same dimensionality as the original data `Z`. Mathematically, the reconstruction is of the form $\\hat{\\mathbf{Z}} = \\mathbf{XW} + \\overline{\\mathbf{Z}}$, where each row of $\\overline{\\mathbf{Z}}$ is the sample mean." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "d02a37d58e56ef2d918849d62a9e0b09", + "grade": false, + "grade_id": "cell-0792063f0c3913c0", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='ImplementPCA'></a>\n", + "<div class=\" alert alert-warning\">\n", + " <b>Student Task.</b> Compute PCA. <br/>\n", + "\n", + "Apply PCA to the data matrix $\\mathbf{Z}$ obtained from the fruit images. In particular, compute the optimal compression matrix `W_pca` for the specified number $n$ of PCs (the value of n is set in the code snippet below). \n", + "The compression matrix should be stored in the numpy array `W_pca` of shape $(n,2500)$. Also, compute the corresponding reconstruction error\n", + "\\begin{equation}\n", + "(1/m) \\sum_{i=1}^{m} \\big\\| \\mathbf{z}^{(i)} - \\hat{\\mathbf{z}}^{(i)} \\big\\|^{2}_{2}= \n", + "(1/m) \\sum_{i=1}^{m} \\big\\| \\mathbf{z}^{(i)} - \\mathbf{W}_{\\rm PCA}^{T} \\mathbf{W}_{\\rm PCA} \\mathbf{z}^{(i)} \\big\\|^{2}_{2} \n", + "\\end{equation} \n", + "\n", + "You should store the reconstruction error in variable `err_pca`. \n", + " \n", + "Hints: \n", + "- Use the Python class `PCA` from the library `sklearn.decomposition` to compute the optimal compression matrix $\\mathbf{W}_{\\rm PCA}$ for the feature vectors (representing fruit images) in the rows of `Z`.\n", + "- Use the functions described above to calculate the reconstruction `Z_hat`. You can take advantage of `pca.transform` and `pca.inverse_transform` functions.\n", + "- Finally, use the resulting `Z_hat` to calculate the reconstruction error according to the formula above.The easiest way to calculate the error might be to expand the formula for the reconstruction error above, and try to understand how this formula could be obtained using the elements of the matrix subtraction `Z - Z_hat`.\n", + "- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html\n", + " \n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "526b8cb3427f3d4ae6859951dcd740e4", + "grade": false, + "grade_id": "cell-3e96f85639fb5b97", + "locked": false, + "schema_version": 3, + "solution": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Shape of Z: (30, 2500)\n" + ] + }, + { + "ename": "NameError", + "evalue": "name 'W_pca' is not defined", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m<ipython-input-3-b109370def79>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Shape of Z: {Z.shape}'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Shape of compression matrix W: {W_pca.shape}'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 11\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'PCA error: {err_pca}'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mNameError\u001b[0m: name 'W_pca' is not defined" + ] + } + ], + "source": [ + "from sklearn.decomposition import PCA\n", + "\n", + "n = 10 # Define the number of principal components\n", + "\n", + "### STUDENT TASK ###\n", + "# YOUR CODE HERE\n", + "\n", + "\n", + "print(f'Shape of Z: {Z.shape}')\n", + "print(f'Shape of compression matrix W: {W_pca.shape}')\n", + "print(f'PCA error: {err_pca}')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "873c7fe22fa7b9716fe8fbf60d152011", + "grade": true, + "grade_id": "cell-fe48a7f5675575ee", + "locked": true, + "points": 3, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "# Perform sanity checks on the results\n", + "assert W_pca.shape == (n, Z.shape[1]), \"Output matrix (W_pca) dimensions are wrong.\"\n", + "assert err_pca <= 1e6, \"The reconstruction error is too high.\"\n", + "\n", + "print('Sanity checks passed!')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "1f850119c48796138b2bd39fa05f02c8", + "grade": false, + "grade_id": "cell-2164880970b6db31", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "## Interpretation of the compressed data and compression matrix\n", + "\n", + "While we know that the features of the original dataset $\\mathbf{Z}$ represents the intensities at each pixel, we have not considered how the compression matrix and compressed data can be interpreted. As was shown in the example earlier in the notebook, **each principle component $\\mathbf{u}^{(i)}$ stored in the rows of $\\mathbf{W}_{\\rm PCA}$ corresponds to a direction in the space that is spanned by the features of $\\mathbf{Z}$**. With the image data used in this notebook, these directions correspond to a vector of pixel values and can be represented as an image.\n", + "\n", + "By multiplying the centralized feature vector $\\mathbf{z}_c$ with the compression matrix $\\mathbf{W}_{\\rm PCA}$, we get the compressed vector $\\mathbf{x}$ in which each element $x_i$ represents the position of the centralized datapoint $\\mathbf{z}_c$ along the $i$:th principle component. In essence, **the values of the elements $x_i$ quantify the amount of the principle component $\\mathbf{u}^{(i)}$ present in the datapoint $\\mathbf{z}_c$**. With this in mind, it is very intuitive that the centralized reconstruction $\\hat{\\mathbf{z}}_c$ is given by the linear combination \n", + "\n", + "\\begin{equation}\n", + "\\hat{\\mathbf{z}}_c = \\mathbf{W}_{\\rm PCA}^T \\mathbf{x} = \\left[ \\mathbf{u}^{(1)}, \\mathbf{u}^{(2)}, \\ldots, \\mathbf{u}^{(n)}\\right] \\mathbf{x} = \\sum_{i=1}^n \\mathbf{u}^{(i)}x_i\n", + "\\end{equation}\n", + "\n", + "of the principle components $\\mathbf{u}^{(i)}$ in the rows of $\\mathbf{W}_{\\rm PCA}$ (and the columns of the transpose). In the case of the image data, this means that each image can be given as a linear combination of images corresponding to the principle components.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "092f4f2bec42d5436ffe746217a4fa77", + "grade": false, + "grade_id": "cell-4b57a2247992e427", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='reconstructionerror'></a>\n", + "<div class=\" alert alert-info\">\n", + " <b>Demo.</b> Principal Directions. <br/> \n", + " \n", + "Keeping the above in mind, it is instructive to examine the principle components $\\mathbf{u}^{(1)},\\mathbf{u}^{(2)}...,\\mathbf{u}^{(n)}$ in the rows of optimal compression matrix $\\mathbf{W}_{\\rm PCA}$. Since each successive principle components explains a decreasing amount of variance in the original data, it is especially interesting to examine the first principle components as they explain the largest amount of variance.\n", + "\n", + "The code snippet below plots the five first principal directions.\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "0aa04edc265f5b867b43b53aea71a75d", + "grade": false, + "grade_id": "cell-090d1f05afe9bc56", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "def plot_princ_comp(W_pca):\n", + " fig, ax = plt.subplots(1, 5, figsize=(15,15))\n", + " # Select the PCs we are plotting\n", + " # You can change these to see what other components look like\n", + " plot_pd = [0,1,2,3,4]\n", + "\n", + " for i in range(len(plot_pd)):\n", + " ax[i].imshow(np.reshape(W_pca[plot_pd[i]], (50,50)), cmap='gray')\n", + " ax[i].set_title(\"Principal Direction %d\"%(plot_pd[i] + 1))\n", + " ax[i].set_axis_off() # Remove x- and y-axis from each image\n", + " plt.show()\n", + "\n", + "plot_princ_comp(W_pca)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "76397d3e4930e0fc089b4f21bdc75bb8", + "grade": false, + "grade_id": "cell-95adb56538f4c2c8", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "Now we can try interpret the meaning of the principle directions from the images above. For example, the principle direction 1 seems to correspond to the level of some kind of \"appleness\" of the image, whereas the third direction seems to capture some kind of \"banananess\". Doing this kind of interpretation can help us in obtaining insight of a dataset by finding meaning in the directions of the largest variance. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "d4d0deaf46a35a7b98b9c7e04c7f0a6f", + "grade": false, + "grade_id": "cell-e84eac83a215b2b4", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "## Trading Compression Against Information Loss\n", + "<a id=\"Q2\"></a>\n", + "\n", + "Next, we study the effect of using different numbers $n$ of PCs for the new feature vector $\\mathbf{x}$. In particular, we will examine whether there is a trade-off between amount of compression (larger for smaller $n$) and the amount of information loss, which is quantified by the reconstruction error." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "76257dda8d7cbf77c6b2c1dad8b262e5", + "grade": false, + "grade_id": "cell-1a63c986f3d58d4c", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='reconstructionerror'></a>\n", + "<div class=\" alert alert-warning\">\n", + " <p><b>Student Task.</b> Reconstruction Error vs. Number of PCs. <br/> </p>\n", + " \n", + "Use the fruit image data `Z` and analyze how the reconstruction error of PCA changes with the number of PCs n used. In particular, apply PCA to the dataset `Z` for varying number of PCs $n=1,\\ldots,m$. For each choice for $n$, compute the corresponding reconstruction error and store it in the numpy array `err_pca` of shape (m, ).\n", + "\n", + " \n", + " \n", + "Hints:\n", + " \n", + "For each number $n$ of PCs:\n", + " \n", + "- Compute the compressed dataset $\\mathbf{X}$.\n", + " \n", + "- Use $\\mathbf{X}$ to compute the optimal reconstruction $\\widehat{\\mathbf{Z}}$.\n", + " \n", + "- Use $\\widehat{\\mathbf{Z}}$ to compute the reconstruction error $(1/m) \\sum_{i=1}^{m} \\big\\| \\mathbf{z}^{(i)} - \\widehat{\\mathbf{z}}^{(i)}\\big\\|^{2}$.\n", + " \n", + "- Store the reconstruction error in the numpy array `err_pca`. The first entry should be the reconstruction error obtained for $n=1$.\n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "c6584465080a5dd9f1a23c5902ef0c6c", + "grade": false, + "grade_id": "cell-65dd9f2f576d010d", + "locked": false, + "schema_version": 3, + "solution": true + } + }, + "outputs": [], + "source": [ + "err_pca = np.zeros(m) # Array for storing the PCA errors\n", + "\n", + "for n_minus_1 in range(m):\n", + " ### STUDENT TASK ### \n", + " # Compute the reconstruction error for PCA using n PCs and store it in err_pca[n-1]\n", + " # YOUR CODE HERE\n", + " raise NotImplementedError()\n", + " \n", + "# Plot the number of PCs vs the reconstruction error\n", + "plt.figure(figsize=(8,5))\n", + "plt.plot([i + 1 for i in range(m)], err_pca)\n", + "plt.xlabel('Number of PCs ($n$)')\n", + "plt.ylabel(r'$\\mathcal{E}(\\mathbf{W}_{\\rm PCA})$')\n", + "plt.title('Number of PCs vs the reconstruction error')\n", + "plt.show() " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "8e5aa6af7c4d9f826a22d0c2d4811a4d", + "grade": true, + "grade_id": "cell-4c0ec1f5bc750139", + "locked": true, + "points": 3, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "# Perform some sanity checks on the outputs\n", + "assert err_pca.shape == (m, ), \"shape of err_pca is wrong.\"\n", + "assert err_pca[0] > err_pca[m-1], \"values of err_pca are incorrect\"\n", + "assert err_pca[0] > err_pca[1], \"values of err_pca are incorrect\"\n", + "\n", + "print('Sanity checks passed!')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "e536cb43e2b84036f34476c139c7221b", + "grade": false, + "grade_id": "cell-dd63af0edecdc31f", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "If the task is solved correctly, we can see that the reconstruction error decreases quite smoothly with an increasing number of components. \n", + "\n", + "## Choosing the number of components in PCA\n", + "\n", + "An important decision that has to be made when applying PCA is the choice of the number of components $n$. The correct choice of $n$ is highly specific to the application, and there is no hard rule based on which to choose an objectively correct value. \n", + "\n", + "A simple but common approach for choosing $n$ is to choose a threshold proportion of total variance explained (e.g. 0.8 or 0.9), and select the smallest number of components such that their cumulative explained variance exceeds the threshold. By using this approach, we are effectively choosing an upper bound for the tolerated information loss." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "65da255494c0ba9470e04ca44e03c99f", + "grade": false, + "grade_id": "cell-9d7f977c8fa11c3f", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "<a id='explainedvariance'></a>\n", + "<div class=\" alert alert-warning\">\n", + " <p><b>Student Task.</b> Proportion of variance explained. <br/> </p>\n", + " \n", + "Use the fruit image data `Z` and `PCA` to find the minimum number of components $n$ such that the cumulative proportion of explained variance of the $n$ first components exceeds $0.9$. \n", + " \n", + "You are required to store \n", + " \n", + "- an $(n_{\\rm max}, )$ array containing the **proportions of total variance explained** by each of the individual components in the variable `var_ratio`, \n", + "- an $(n_{\\rm max}, )$ array containing the **cumulative proportion of total variance explained** in the variable `cum_var_ratio`, \n", + "- the minimum number of components whose cumulative proportion of explained variance exceeds `threshold` in the variable `n_threshold`\n", + "\n", + " \n", + "**Hints**:\n", + " \n", + "- Fit a PCA model with `n_max` components.\n", + " \n", + "- The proportions of total variance explained by each component can be found in the attribute `PCA.explained_variance_ratio_`.\n", + " \n", + "- `np.cumsum()` is convenient for calculating the cumulative proportions of total variance explained.\n", + " \n", + "- `np.where(condition)` can be a useful tool for finding `n_threshold`.\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "cc92265c0422b1746e8367b74eefa342", + "grade": false, + "grade_id": "cell-84120c5379d42699", + "locked": false, + "schema_version": 3, + "solution": true, + "task": false + } + }, + "outputs": [], + "source": [ + "n_max = m # Maximum amount of components\n", + "threshold = 0.9 # Threshold for selecting the number of components\n", + "### STUDENT TASK ###\n", + "# ...\n", + "# var_ratio = ...\n", + "# cum_var_ratio = ...\n", + "# n_threshold = ...\n", + "# YOUR CODE HERE\n", + "raise NotImplementedError()\n", + " \n", + "print(f\"Number of components selected: {n_threshold}\")\n", + "print(f\"Proportion of variance explained: {cum_var_ratio[n_threshold-1]}\")\n", + "\n", + "# Plot the number of PCs vs the reconstruction error\n", + "fig, ax = plt.subplots(2, 1, figsize=(8,12))\n", + "x_bar = range(1, n_max+1)\n", + "ax[0].bar(x_bar, var_ratio)\n", + "ax[0].set_title(\"Proportion of explained variance\")\n", + "ax[0].set_xlabel(\"Component\")\n", + "ax[0].set_ylabel(\"Proportion\")\n", + "barlist = ax[1].bar(x_bar, cum_var_ratio)\n", + "barlist[n_threshold-1].set_color('b')\n", + "ax[1].plot([0,31], [0.9, 0.9], '--', color='black')\n", + "ax[1].set_title(\"Cumulative proportion of explained variance\")\n", + "ax[1].set_xlabel(\"Component\")\n", + "ax[1].set_ylabel(\"Proportion\")\n", + "plt.show() " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "b3dc894310e7ddc50f9bb8b6b1a6a254", + "grade": true, + "grade_id": "cell-e96a6ae4c245d3d7", + "locked": true, + "points": 2, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "outputs": [], + "source": [ + "# Perform some sanity checks on the outputs\n", + "assert var_ratio.shape == (30,), \"var_ratio is of the wrong shape!\"\n", + "assert cum_var_ratio.shape == (30,), \"cum_var_ratio is of the wrong shape!\"\n", + "assert n_threshold in range(10, 20), \"n_threshold is too low or too high!\"\n", + "\n", + "print(\"Sanity checks passed!\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we will examine how increasing the number of components in the PCA transformation affects the quality of the reconstructions in practice. Will we be able to see the decrease in reconstruction error with our own eyes in the case of the fruit images?\n", + "\n", + "In the demo below, we will plot reconstructed images for PCA compressions with a different number of components. By relying on intuition and the results above, we should expect that a larger number of components results in a reconstruction that is closer to the original image." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "c66a0c741f432deb03bc7d0038231566", + "grade": false, + "grade_id": "cell-45b5e1214d2b641c", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='reconstructionerror'></a>\n", + "<div class=\" alert alert-info\">\n", + " <b>Demo.</b> Reconstructions. <br/> \n", + " \n", + "The code snippet below shows the reconstructions $\\widehat{\\mathbf{z}}^{(i)} = \\mathbf{W}_{\\rm PCA}^{T} \\mathbf{x}^{(i)} + \\overline{\\mathbf{z}}$ for the number of PCs $n=1,5,10,20,30$. \n", + "\n", + "In the code, the `PCA` class is only used to obtain the compression matrix $\\mathbf{W}_{\\rm PCA}$. The PCA transformation and reconstruction are then done by matrix operations.\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "b6bc8f6dc281670f4fd4862bdc0fc24d", + "grade": false, + "grade_id": "cell-925684dea4eb902f", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "from matplotlib.lines import Line2D\n", + "import matplotlib.gridspec as gridspec\n", + "import warnings\n", + "warnings.filterwarnings(\"ignore\")\n", + "\n", + "## Input:\n", + "# Z: Dataset\n", + "# n: number of dimensions\n", + "# m_pics: number of pics per class (Apple, Banana). Min 1, Max 15\n", + "def plot_reconstruct(Z, W_pca, n, m_pics=3):\n", + "\n", + " # Center Z\n", + " Z_centered = Z - np.mean(Z, axis=0)\n", + " # x=w*z\n", + " X_pca = np.matmul(W_pca[:n,:], Z_centered[:,:,None])\n", + " # x_reversed=r*x + mean(z)\n", + " Z_hat = np.matmul(W_pca[:n,:].T, X_pca)[:,:,0] + np.mean(Z, axis=0)\n", + " \n", + " # Setup figure size that scales with number of images\n", + " fig = plt.figure(figsize = (10,10))\n", + " \n", + " # Setup a (n_pics,2) grid of plots (images)\n", + " gs = gridspec.GridSpec(1, 2*m_pics)\n", + " gs.update(wspace=0.0, hspace=0.0)\n", + " for i in range(m_pics):\n", + " for j in range(0,2):\n", + " # Add a new subplot\n", + " ax = plt.subplot(gs[0, i+j*m_pics])\n", + " # Insert image data into the subplot\n", + " ax.imshow(np.reshape(Z_hat[i+(15*j)], (50,50)), cmap='gray', interpolation='nearest')\n", + " # Remove x- and y-axis from each plot\n", + " ax.set_axis_off()\n", + " \n", + " plt.subplot(gs[0,0]).set_title(\"Reconstructed images using %d PCs:\"%(n), size='large', y=1.08)\n", + " plt.show()\n", + " \n", + "pca = PCA(n_components=m) # create the object\n", + "pca.fit(Z) # compute optimal transform W_PCA\n", + "W_pca = pca.components_\n", + " \n", + "# The values of PCS n to plot for. You can change these to experiment\n", + "num_com = [1, 5, 10, 20, 30]\n", + "for n in num_com:\n", + " # If you want to print different amount of pictures, you can change the value of m_pics. (1-15)\n", + " plot_reconstruct(Z, W_pca, n, m_pics=3)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "fa82b659a4eb7c36c3347ef46745b717", + "grade": false, + "grade_id": "cell-6a62d117f2e7f864", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "Observing the reconstructed images, one can clearly see that the reconstructions improve with an increasing number of principal components. This is in line with the decreasing reconstruction error seen in the student task \"Reconstruction Error vs. Number of PC\"." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "d9cb396b60b29eefec761bda97c95c8f", + "grade": false, + "grade_id": "cell-44992b7e2e2296cb", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "## PCA for Data Visualization\n", + "<a id=\"Q3\"></a>\n", + "\n", + "An important application of PCA is data visualization. By using PCA to reduce the dimensionality of higher dimensional data to two or three dimensions, it is possible to visualize high dimensional data along those directions in which the data exhibits the largest variance. \n", + "\n", + "For example, when using PCA with $n=2$ PCs, the resulting feature vectors $\\mathbf{x}^{(1)} = \\mathbf{W}_{\\rm PCA} \\mathbf{z}_c^{(1)}, \\ldots, \\mathbf{x}^{(m)} = \\mathbf{W}_{\\rm PCA} \\mathbf{z}_c^{(m)} \\in \\mathbb{R}^{2}$ can be visualized in a two-dimensional scatterplot whose x-axis represents the first PC $x_{1}^{(i)}$ and y-axis the second PC $x_{2}^{(i)}$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "6f881e659f25b3864276af435ce4bdd5", + "grade": false, + "grade_id": "cell-6fa5af31e9012065", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='dataprojection'></a>\n", + "<div class=\" alert alert-warning\">\n", + " <b>Student task.</b> Scatter Plots using Two PCs. <br/> \n", + "\n", + "The following code visualizes the data as a scatter plot using either the first two PCs $x_{1}$, $x_{2}$ or using the 8th and 9th PCs $x_{8}$ and $x_{9}$. Your task is to fit the PCA model, transform the data `Z` to the lower dimensional transformation `X`, and store the 1st and 2nd, and the 8th and 9th PCs in the variables `X_PC12` and `X_PC89` respectively. Once again, remember that the indexing starts from 0 in Python!\n", + " \n", + "The rest of the code plots two scatterplots using the selected pairs of features.\n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "ddc91348f72fe1ebdfefacce8c2f529e", + "grade": false, + "grade_id": "cell-0854759cfbcb921e", + "locked": false, + "schema_version": 3, + "solution": true, + "task": false + } + }, + "outputs": [], + "source": [ + "### STUDENT TASK ###\n", + "# Fit PCA with m components\n", + "# .\n", + "# .\n", + "# .\n", + "# X_PC12 = X[:,...]\n", + "# X_PC89 = X[:,...]\n", + "# YOUR CODE HERE\n", + "raise NotImplementedError()\n", + "\n", + "plt.rc('axes', labelsize=14) # Set fontsize of the x and y labels\n", + " \n", + "fig, ax = plt.subplots(2, 1, figsize=(8,12)) # Create two subplots\n", + "\n", + "# Scatterplot of the compressed data w.r.t PCs 1 and 2\n", + "ax[0].scatter(X_PC12[:15,0], X_PC12[:15,1], c='r',marker='o', label='Apple')\n", + "ax[0].scatter(X_PC12[15:,0], X_PC12[15:,1], c='y', marker='^', label='Banana')\n", + "ax[0].set_title('using first two PCs $x_{1}$ and $x_{2}$ as features')\n", + "ax[0].legend()\n", + "ax[0].set_xlabel('$x_{1}$')\n", + "ax[0].set_ylabel('$x_{2}$')\n", + " \n", + "# Scatterplot of the compressed data w.r.t PCs 7 and 8\n", + "ax[1].scatter(X_PC89[:15,0], X_PC89[:15,1], c='r', marker='o', label='Apple')\n", + "ax[1].set_title('using 8th and 9th PC as features')\n", + "ax[1].scatter(X_PC89[15:,0], X_PC89[15:,1], c='y', marker='^', label='Banana')\n", + "ax[1].legend()\n", + "ax[1].set_xlabel('$x_{8}$')\n", + "ax[1].set_ylabel('$x_{9}$')\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "670042342d204d202cf699b282e42bc4", + "grade": true, + "grade_id": "cell-a2687bd41e35d92d", + "locked": true, + "points": 1, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "outputs": [], + "source": [ + "# Perform some sanity checks on the results\n", + "assert X_PC12.shape == (30,2), f\"X_PC12 is of the wrong shape!\"\n", + "assert X_PC89.shape == (30,2), f\"X_PC12 is of the wrong shape!\"\n", + "\n", + "print(\"Sanity checks passed!\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From the plots above, it is easy to confirm that the data exhibits the largest variance along the first principal component, with successive components explaining a decreasing amount of variance.\n", + "\n", + "By visualizing the data with respect to the first principle components, we can obtain insights of the data. For example, the first scatterplot seems to indicate that the images containing apples and bananas can be separated quite well using the first principle component. This coincides with what we inferred earlier in the notebook - that the first principle component corresponds to some kind of \"appleness\" of the image.\n", + "\n", + "It is hardly surprising that we could find this kind of separation in the case of the fruit image dataset, so it can seem that the scatterplot is not tremendously useful. However, on other datasets one might find less obvious relationships with respect to some principal components that might provide insights that are not easily obtainable otherwise!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "f744ecfbb7e7da5f8816d496304e7390", + "grade": false, + "grade_id": "cell-b70e976f9acfca7f", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "## Using Linear regression with PCA for House Price Prediction \n", + "\n", + "<a id=\"Q2\"></a>\n", + "\n", + "Recall from round 4 that a high dimensional predictor (e.g. linear regression with many features) is prone to overfitting the training set which results in a poor predictive capability on new data. In round 4, we examined two methods of mitigating overfitting - choosing a model with only some features, and regularization. PCA compression provides a possible alternative to the former. Instead of selecting $n$ of the original features to be used in our model, we can use the $n$ first principle components as the features with respect to which we train our linear classifier.\n", + "\n", + "We now show how PCA can be used as a pre-processing step for another ML method, such as linear regression. To this end, we consider the task of predicting the price $\\mathbf{y}$ of a house based on several features $x_1,\\ldots, x_n$ of this house. In \"ML language\", the goal is to learn a good predictor $h(\\mathbf{x})$ for the price $y$ of a house. The prediction $h(\\mathbf{x})$ is based on the features $\\mathbf{x} = \\big(x_{1},\\ldots,x_{n}\\big)^{T}$ such as the average number of rooms per dwelling $x_{1}$ or the nitric oxides concentration $x_{2}$ near the house. \n", + "\n", + "### The Data\n", + "\n", + "To evaluate the quality of a predictor $h(\\mathbf{x})$, we evaluate its error (loss) on historic recordings of house sales (for which we know the price in hindsight). These recordings consist of $m$ data points. Each data point is characterized by the house features $\\mathbf{x}^{(i)} \\in \\mathbb{R}^{n}$ and the selling price $y^{(i)}$." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "08dccfd92012cc438183cb35fba1e002", + "grade": false, + "grade_id": "cell-20a3b32c473f955e", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='handsondata'></a>\n", + "<div class=\" alert alert-info\">\n", + "<p><b>Demo.</b> Loading the Data.</p>\n", + " \n", + "The following code snippet defines a function `Z,y= GetFeaturesLabels(m,D)` which reads in data of previous house sales. The input parameters are the number `m` of data points and the number `D` of features to be used for each data point. The function returns a matrix $\\mathbf{Z}$ and vector $\\mathbf{y}$. \n", + "\n", + "The features $\\mathbf{z}^{(i)}$ of the sold houses are stored in the rows of the numpy array `Z` (of shape (m,D)) and the corresponding selling prices $y^{(i)}$ in the numpy array `y` (shape (m,1)). The two arrays represent the feature matrix $\\mathbf{Z} = \\begin{pmatrix} \\mathbf{z}^{(1)} & \\ldots & \\mathbf{z}^{(m)} \\end{pmatrix}^{T}$ and the label vector $\\mathbf{y} = \\big( y^{(1)}, \\ldots, y^{(m)} \\big)^{T}$. \n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "fba769478734be45dccfd6741d5d199b", + "grade": false, + "grade_id": "cell-ac3ef96fb0f17261", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "# import \"Pandas\" library/package (and use shorthand \"pd\" for the package) \n", + "# Pandas provides functions for loading (storing) data from (to) files\n", + "import pandas as pd \n", + "from matplotlib import pyplot as plt \n", + "from IPython.display import display, HTML\n", + "import numpy as np \n", + "import random\n", + "\n", + "def GetFeaturesLabels(m=10, D=10):\n", + " house_dataset = load_boston()\n", + " house = pd.DataFrame(house_dataset.data, columns=house_dataset.feature_names) \n", + " x1 = house['TAX'].values.reshape(-1,1) # vector whose entries are the tax rates of the sold houses\n", + " x2 = house['AGE'].values.reshape(-1,1) # vector whose entries are the ages of the sold houses\n", + " x1 = x1[0:m]\n", + " x2 = x2[0:m]\n", + " np.random.seed(43)\n", + " Z = np.hstack((x1,x2,np.random.randn(m,D))) \n", + " \n", + " Z = Z[:,0:D] \n", + "\n", + " y = house_dataset.target.reshape(-1,1) # create a vector whose entries are the labels for each sold house\n", + " y = y[0:m]\n", + " \n", + " return Z, y" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "0512c8d4399d61c38a135509d52e5bf4", + "grade": false, + "grade_id": "cell-996a5a4dccea7328", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='pcaandlinreg'></a>\n", + "<div class=\" alert alert-warning\">\n", + " <p><b>Student Task.</b> PCA with Linear Regression. <br/> </p>\n", + "\n", + "In the code snippet below we combine PCA with linear regression, and calculate the training and validation errors for the linear model trained on PCA transformed data for different numbers of PCs.\n", + " \n", + "First, we read in $m=500$ data points representing house sales. Each house sale is represented by a long feature vector $\\mathbf{z}^{(i)}\\in \\mathbb{R}^{D}$ of length $D=10$. \n", + " \n", + "Next, we define the feature matrix `Z_pca` that contains the features 480 first data points and which will be used to fit the PCA model. We also define the feature matrix `Z_reg` and label vector `y_reg` containing the features and labels of the rest of the data points. These will be used for training and validating a linear regression model.\n", + "\n", + "Furthermore, in order be able to estimate of the generalization capability of the linear models trained on PCA transformed data with a different number of PCs we will split the regression dataset `Z_reg`, `y_reg` into a training set `Z_train`, `y_train` and validation set `Z_val`, `y_val`.\n", + " \n", + "**Your task** is to implement the contents of the loop that calculates the training and validation errors for the linear models trained on PCA transformed data with a different number $n=1,\\ldots,D$ of components. For each $n$ you should:\n", + " \n", + "- Fit a PCA model with $n$ components\n", + " \n", + "- Transform the feature matrices `Z_train` and `Z_val` to lower dimensional versions using PCA\n", + " \n", + "- Use the PCA transformed training data to fit a linear regression model. When initializating `LinearRegression` please use `LinearRegression(fit_intercept=True)`\n", + " \n", + "- Calculate the training and validation errors (MSE) of the linear regression and store these at index `n-1` in the arrays `err_train` and `err_val` respectively. Remember to use the PCA transformed validation features to predict the labels of the validation set!\n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "171e71637253700e008bf80c6cec8fb4", + "grade": false, + "grade_id": "cell-dd44ad8b1d0fc8db", + "locked": false, + "schema_version": 3, + "solution": true + } + }, + "outputs": [], + "source": [ + "from sklearn.linear_model import LinearRegression\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.datasets import load_boston\n", + "from sklearn.metrics import mean_squared_error\n", + "\n", + "m = 500 # total number of data points \n", + "D = 10 # length of raw feature vector \n", + "\n", + "Z, y = GetFeaturesLabels(m, D) # read in m data points from the house sales database \n", + " \n", + "# Use these features for PCA \n", + "Z_pca = Z[:480,:] # read out feature vectors of first 480 data points \n", + "\n", + "# Use these features and labels for linear regression (with transformed features)\n", + "Z_reg = Z[480:,:] # Read out feature vectors of last 20 data points \n", + "y_reg = y[480:,:] # Read out labels of last 20 data points \n", + "\n", + "# Datasets which will be preprocessed with PCA and used with linear regression\n", + "Z_train, Z_val, y_train, y_val = train_test_split(Z_reg, y_reg, test_size=0.2, random_state=42)\n", + "\n", + "err_train = np.zeros(D) # Array for storing training errors of the linear regression model\n", + "err_val = np.zeros(D) # Array for storing validation errors of the linear regression model\n", + "\n", + "for n in range(1, D+1):\n", + " # Create the PCA object and fit\n", + " # transform long feature vectors (length D) to short ones (length n)\n", + " ### STUDENT TASK ### \n", + " # -Fit a PCA model with n components\n", + " # -Compress Z_train and Z_val using PCA \n", + " # -Use the compressed features to fit a linear model\n", + " # -Calculate the training and validation errors of the \n", + " # YOUR CODE HERE\n", + " raise NotImplementedError()\n", + "\n", + "# Plot the training and validation errors\n", + "plt.figure(figsize=(8,5))\n", + "plt.plot(range(1, D+1), err_val, label=\"validation\")\n", + "plt.plot(range(1, D+1), err_train, label=\"training\")\n", + "plt.xlabel('number of PCs ($n$)')\n", + "plt.ylabel(r'error')\n", + "plt.legend()\n", + "plt.title('validation/training error vs. number of PCs')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "65c31f55422b5dc2330f6e4b4d9ac402", + "grade": true, + "grade_id": "cell-5bc71b4230da31ed", + "locked": true, + "points": 3, + "schema_version": 3, + "solution": false + } + }, + "outputs": [], + "source": [ + "# Perform sanity checks on the outputs\n", + "assert err_train[0] >= err_train[4], \"values in err_train wrong\"\n", + "assert err_val[1] <= err_val[4], \"values in err_val wrong\"\n", + "\n", + "\n", + "print('Sanity checks passed!')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "9b5426ab2ec358668c11ece8c032ac10", + "grade": false, + "grade_id": "cell-5c325e740c3059c3", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "If the task is correctly solved, one should see that while the training error decreases monotonically with an increasing number of components, the validation error does not. Rather, there is an optimum number of components to be used. This is similar to what was observed in round 4, where we searched for the optimum number of features to be used for predicting the grayscale values of pixels. The difference is that in the task above we use PCA for each $n$ to obtain the $n$ features that capture as much of the variance of the original data as possible (for $n$ features).\n", + "\n", + "While this technique seems very convenient based on this example, there are some caveats that are good to be aware of. The first, and perhaps most important in an ML setting, is that **even though the PCA features maximize the explained variance in the data $\\mathbf{Z}$ they do not necessarily contain the most infomation w.r.t. to the target $\\mathbf{y}$**. In fact, PCA pre-preprocessing might even worsen outcomes by losing information that is relevant w.r.t. the target variable.\n", + "\n", + "The second major caveat is that PCA pre-processing might result in features that are difficult to interpret, which reduces the ability to infer relationships between real-life features and the target. This is perhaps less important in the realm of machine learning where interpretibility is not always a priority, but it is nevertheless good to be aware of this.\n", + "\n", + "Overall, Pre-processing with PCA can be useful in many situations, but due to its caveats it is best not to use it indiscriminately. It is better to evaluate its use on a case-by-case basis, and apply it in your final model if its use results in a better predictive capability." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "cedecf02062de2c3d568bcaae3eada7b", + "grade": false, + "grade_id": "cell-1a277e08a4fc39b4", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "## The Final Quiz" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "4c044d28435b47af480979cd2b731fff", + "grade": false, + "grade_id": "cell-75f0e24b621cdbfa", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='QuestionR6_1'></a>\n", + "<div class=\" alert alert-warning\">\n", + " <p><b>Student Task.</b> Question R6.1. </p>\n", + "\n", + " <p>Which of the following statements is true?</p>\n", + "\n", + "<ol>\n", + " <li> Dimensionality reduction helps to avoid overfitting.</li>\n", + " <li> Dimensionality reduction can only be used for labeled data points.</li>\n", + " <li> Dimensionality reduction can only be used for vectors having no more than $100$ entries.</li>\n", + " <li> Dimensionality reduction is a classification method.</li>\n", + "</ol> \n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "ae20ca729f574283b39af1abc46299bd", + "grade": false, + "grade_id": "cell-8502c4de767cf645", + "locked": false, + "schema_version": 3, + "solution": true + } + }, + "outputs": [], + "source": [ + "# answer_R6_Q1 = ...\n", + "# YOUR CODE HERE\n", + "answer_R6_Q1 = 1" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "34e3a546a60f44a88893da1a83a234ed", + "grade": true, + "grade_id": "cell-5eae8fef7c012c4e", + "locked": true, + "points": 1, + "schema_version": 3, + "solution": false + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Sanity check tests passed!\n" + ] + } + ], + "source": [ + "# this cell is for tests\n", + "assert answer_R6_Q1 in [1,2,3,4], '\"answer_R6_Q1\" Value should be an integer between 1 and 4.'\n", + "print('Sanity check tests passed!')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "f22f74e5bcbc9da10d1fcf7032ae59c4", + "grade": false, + "grade_id": "cell-74fbeaadae128433", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='QuestionR6_2'></a>\n", + "<div class=\" alert alert-warning\">\n", + " <p><b>Student Task.</b> Question R6.2. </p>\n", + "\n", + " <p>Which of the following statements is true?</p>\n", + "\n", + "<ol>\n", + " <li> In general, dimensionality reduction increases the computational requirements of fitting a model on the data.</li>\n", + " <li> No information in the original data is lost when performing dimensionality reduction via PCA.</li>\n", + " <li> In general, dimensionality reduction reduces the computational requirements of fitting a model on the data.</li>\n", + " <li> PCA cannot be used for data visualization.</li>\n", + "</ol> \n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "db84b2dc8c17c572aa42b0b78e599a61", + "grade": false, + "grade_id": "cell-aa7254f504d06bd6", + "locked": false, + "schema_version": 3, + "solution": true + } + }, + "outputs": [], + "source": [ + "# answer_R6_Q2 = ...\n", + "# YOUR CODE HERE\n", + "answer_R6_Q2 = 3" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "86043b4a16ac8e0f7f67e5361f987c1a", + "grade": true, + "grade_id": "cell-d73c926ba7d72ae2", + "locked": true, + "points": 1, + "schema_version": 3, + "solution": false + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Sanity check tests passed!\n" + ] + } + ], + "source": [ + "# this cell is for tests\n", + "assert answer_R6_Q2 in [1,2,3,4], '\"answer_R6_Q2\" Value should be an integer between 1 and 4.'\n", + "print('Sanity check tests passed!')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "e6e1dbff9a9ea55a9171ff50cc79e5d2", + "grade": false, + "grade_id": "cell-d5d1e56fd6f7109c", + "locked": true, + "schema_version": 3, + "solution": false + } + }, + "source": [ + "<a id='QuestionR6_3'></a>\n", + "<div class=\" alert alert-warning\">\n", + " <p><b>Student Task.</b> Question R6.3. </p>\n", + "\n", + " <p>Consider using PCA as preprocessing to transform long raw feature vectors $\\mathbf{z} \\in \\mathbb{R}^{D}$ to shorter feature vectors $\\mathbf{x} \\in \\mathbb{R}^{n}$ that are then used in linear regression. What is the effect of using a larger number $n$ of principal components in PCA?</p>\n", + "\n", + "<ol>\n", + " <li> The training error obtained from linear regression will decrease.</li>\n", + " <li> The training error obtained from linear regression will increase.</li>\n", + "</ol> \n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "8cc75b2e0c8f17ff230c0cb0d35d125a", + "grade": false, + "grade_id": "cell-5910f33ccb2a7e21", + "locked": false, + "schema_version": 3, + "solution": true + } + }, + "outputs": [], + "source": [ + "# answer_R6_Q3 = ...\n", + "# YOUR CODE HERE\n", + "answer_R6_Q3 = 1" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "7f280ab4c9f5d78b17e154b16de1d4ed", + "grade": true, + "grade_id": "cell-c839258ed02d6407", + "locked": true, + "points": 1, + "schema_version": 3, + "solution": false + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Sanity check tests passed!\n" + ] + } + ], + "source": [ + "# this cell is for tests\n", + "assert answer_R6_Q3 in [1,2], '\"answer_R6_Q3\" Value should be an integer between 1 and 2.'\n", + "print('Sanity check tests passed!')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "markdown", + "checksum": "bf0ebaa2de1491ab72ea9ac21fc572bb", + "grade": false, + "grade_id": "cell-719534be8870a6d6", + "locked": true, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "source": [ + "<a id='QuestionR6_3'></a>\n", + "<div class=\" alert alert-warning\">\n", + " <p><b>Student Task.</b> Question R6.4. </p>\n", + "\n", + " <p>Consider using PCA as preprocessing to transform long raw feature vectors $\\mathbf{z} \\in \\mathbb{R}^{D}$ to shorter feature vectors $\\mathbf{x} \\in \\mathbb{R}^{n}$ that are then used in linear regression. Will this always result in better performance on new data for a linear regression model?</p>\n", + "\n", + "<ol>\n", + " <li> No.</li>\n", + " <li> Yes.</li>\n", + "</ol> \n", + "\n", + "</div>" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "deletable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "90604076301c53038cb9592871cf264f", + "grade": false, + "grade_id": "cell-815f1b53c548da39", + "locked": false, + "schema_version": 3, + "solution": true, + "task": false + } + }, + "outputs": [], + "source": [ + "# answer_R6_Q4 = ...\n", + "# YOUR CODE HERE\n", + "answer_R6_Q4 = 1" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "cell_type": "code", + "checksum": "e132ae8bd281328dbc81071e63b7c20a", + "grade": true, + "grade_id": "cell-c2e7f22d5a3da9ee", + "locked": true, + "points": 1, + "schema_version": 3, + "solution": false, + "task": false + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Sanity check tests passed!\n" + ] + } + ], + "source": [ + "# this cell is for tests\n", + "assert answer_R6_Q4 in [1,2], '\"answer_R6_Q4\" Value should be an integer between 1 and 2.'\n", + "print('Sanity check tests passed!')\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + }, + "varInspector": { + "cols": { + "lenName": 16, + "lenType": 16, + "lenVar": 40 + }, + "kernels_config": { + "python": { + "delete_cmd_postfix": "", + "delete_cmd_prefix": "del ", + "library": "var_list.py", + "varRefreshCmd": "print(var_dic_list())" + }, + "r": { + "delete_cmd_postfix": ") ", + "delete_cmd_prefix": "rm(", + "library": "var_list.r", + "varRefreshCmd": "cat(var_dic_list()) " + } + }, + "types_to_exclude": [ + "module", + "function", + "builtin_function_or_method", + "instance", + "_Feature" + ], + "window_display": false + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}