From 63b5e90feb47ce7103ac08f7d82688a6e30b7481 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Elias=20Ervel=C3=A4?= <elias.m.ervela@utu.fi>
Date: Wed, 12 Jan 2022 15:29:12 +0000
Subject: [PATCH] Upload New File

---
 Round_3_-_Classification.ipynb | 2390 ++++++++++++++++++++++++++++++++
 1 file changed, 2390 insertions(+)
 create mode 100644 Round_3_-_Classification.ipynb

diff --git a/Round_3_-_Classification.ipynb b/Round_3_-_Classification.ipynb
new file mode 100644
index 0000000..ef32e69
--- /dev/null
+++ b/Round_3_-_Classification.ipynb
@@ -0,0 +1,2390 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "ff90179949bfd98c0feae3316c9eb419",
+     "grade": false,
+     "grade_id": "cell-9d7b14515f624f0c",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "source": [
+    "#  Machine Learning with Python - Classification\n",
+    "\n",
+    "In this exercise, you will learn how to formulate and solve a **classification problem**. A classification problem amounts to finding a good predictor or classifier which maps a given data point via its features to a predicted value of its label (which is the quantity of interest). \n",
+    "\n",
+    "Remember that **regression problems** are machine learning problems that involve data points with a numeric label such as the grayscale level of a pixel. In contrast, **classification problems** arise from data points whose labels have only a finite number of different values. The most simple classification problem is a **binary classification problem** where the label can take on only two different values such as $y=0$ vs. $y=1$ or $y$=\"picture includes a pedestrian crossing\" vs. $y$=\"picture does not include pedestrian crossing\". The label $y$ of a data point indicates to which class (or category) the data point belongs to. \n",
+    "\n",
+    "We consider two widely used methods for solving classification problems: **logistic regression** and **decision trees**. These two methods differ in the choice of hypothesis space, i.e., the set of predictor functions $h(\\mathbf{x})$ that map the features $\\mathbf{x}$ of a data point to a predicted label $\\hat{y}=h(\\mathbf{x})$ (which is hopefully a good approximation of the true label $y$). \n",
+    "\n",
+    "We mainly consider binary classification problems with data points having labels from a set of size two such as $\\{0,1\\}$ or {\"image shows a crossing\", \"image shows no crossing\"}. However, we will also discuss a simple approach to upgrade any binary classification method to solve classification problems with more than two label values such as {\"image shows one crossing\", \"image shows more than one crossing\", \"image shows no crossing\"}. We refer to classification problems with more than two label values (or categories) as **multi-class classification problems**. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "fd6135f6e9830207ab8e836a9e36cf4f",
+     "grade": false,
+     "grade_id": "cell-02d2af48a0fcd974",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "source": [
+    "## Learning goals\n",
+    "\n",
+    "After this round, you should  \n",
+    "\n",
+    "- be able to model \"real-world\" applications as classification problems by identifying features and labels. \n",
+    "- be able to solve classification problems using logistic regression or decision trees. \n",
+    "- be able to assess the reliability of classifications provided by logistic regression. \n",
+    "- know about the differences between decision trees and logistic regression. \n",
+    "- know how to extend binary classification methods to multi-class problems where labels can take on more than two different values. \n",
+    "\n",
+    "\n",
+    "## Additional Material \n",
+    "* Relevant Sections in [Course Book](https://arxiv.org/abs/1805.05052)  (Chapter 2, 3.4 and 3.6)\n",
+    "* [video-lecture](https://www.youtube.com/watch?v=-la3q9d7AKQ) of Prof. Andrew Ng on classification problems and logistic regression \n",
+    "* [video-lecture](https://www.youtube.com/watch?v=ZvaELFv5IpM) of Prof. Andrew Ng on extending binary classification methods to multi-class problems "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "bdc75c1953229442a72721880429b7c1",
+     "grade": false,
+     "grade_id": "cell-a5592b3f1c6bf91d",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "## The Problem\n",
+    "\n",
+    "<img src=\"../../../coursedata/R3_Classification/CrossingDetection.png\" alt=\"Drawing\" style=\"width: 600px;\"/>\n",
+    "\n",
+    "The city planners of Helsinki are regularly sending small airplanes over Helsinki to take high-resolution aerial photographs of different city areas. These aerial photographs are available via the open data service at https://kartta.hel.fi. It is important to monitor the condition of pedestrian crossings to determine if a renewal is necessary. To this end, we want to find those areas which contain a pedestrian crossing. \n",
+    "\n",
+    "In this exercise, you will learn how to use classification methods to determine if a particular area contains a pedestrian (zebra) crossing or not. We model this pedestrian crossing detection as a machine learning problem. The problem amounts to learn a predictor (or classifier) map $h(\\mathbf{x})$ which delivers a predicted label $\\hat{y} = h(\\mathbf{x})$ which indicates if a certain area contains a pedestrian crossing or not. The classification is based on numeric features $\\mathbf{x}=\\big(x_{1},\\ldots,x_{n}\\big)^{T}$ that are computed from an aerial photograph of the area in question.  \n",
+    "\n",
+    "We will solve this binary classification problem using two different classification methods: logistic regression and decision trees. These two methods differ in the choice of hypothesis space. Decision tree classifiers use a flow-chart representation of the predictor function. In contrast, logistic regression uses the hypothesis space of linear predictor functions which is also used in linear regression (see Round 2 - Regression). \n",
+    "\n",
+    "The difference between logistic and linear regression is the set of label values, which is the real numbers for linear regression and a set of size two for logistic regression. Another difference between linear and logistic regression is the loss function. While linear regression is based on minizing the squared error loss, logistic regression minimizes the logistic loss function which will be explained below. \n",
+    "\n",
+    "As you might already know, most machine learning problems (and methods) consist of three components: \n",
+    "\n",
+    "* some **data** (a bunch of data points, each of which is characterized by features and labels) \n",
+    "* a **hypothesis space** (consisting of a set of predictor functions from features to labels)\n",
+    "* a **loss function** which is used to assess the quality of a particular predictor function \n",
+    "\n",
+    "In what follows, we will discuss particular choices for these three components to solve the pedestrian crossing detection problem. \n",
+    "\n",
+    "## The Data\n",
+    "\n",
+    "ML methods aim at finding a good predictor map (or classifier) $h(\\mathbf{x})$ which takes some features $\\mathbf{x}$ as input and outputs a guess (or estimate) for the label $y$ of the data point (which represents an area of Helsinki in our application). To measure the quality of a particular predictor $h(\\mathbf{x})$ we try it out on data points for which we know already the true label values $y$. The basic principle of classification methods is then to find (or learn) the best predictor function out of a set of computationally feasible functions (the hypothesis space). \n",
+    "\n",
+    "We have access to a data set consisting of $m=178$ data points $\\big(\\mathbf{x}^{(i)},y^{(i)}\\big)$ for $i=1,\\ldots,m$. Each data point $\\big(\\mathbf{x}^{(i)},y^{(i)}\\big)$ represents a particular city area. This city area is characterized by several features $\\mathbf{x}^{(i)}=\\big(x^{(i)}_{1},\\ldots,x^{(i)}_{n}\\big)^{T}$ that are computed from an aerial photograph of that area. You can learn more about some efficient methods for automatically determining relevant features of an image [here](https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_feature2d/py_table_of_contents_feature2d/py_table_of_contents_feature2d.html). The data point $\\big(\\mathbf{x}^{(i)},y^{(i)}\\big)$ is also characterized by the true label $y^{(i)}$ which has been found out by a city planner who manually inspected the areal photograph. \n",
+    "\n",
+    "We can use the **labeled data** $\\big(\\mathbf{x}^{(i)},y^{(i)}\\big)$ for $i=1,\\ldots,m$, to find a good predictor map $h(\\mathbf{x})$. In contrast to regression problems, where the ouput $h(\\mathbf{x})$ of a predictor map is a (real) number, here the ouput $h(\\mathbf{x})$ is a discrete label value or category. In this case it is customary to use the term **classifier** for the prediction map $h(\\mathbf{x})$.\n",
+    "\n",
+    "A good classifier $h(\\mathbf{x})$ should at least agree well with similar human judgment,\n",
+    "\\begin{equation} \n",
+    "\\underbrace{y^{(i)}}_{\\mbox{label by human}} \\approx \\underbrace{h(\\mathbf{x}^{(i)})}_{\\mbox{predicted label } \\hat{y}^{(i)}}  \\mbox{ for all } i =1,\\ldots,m. \n",
+    "\\end{equation}\n",
+    "\n",
+    "To sum up, \n",
+    "* The dataset contains information about $m=178$ areas in city of Helsinki.  \n",
+    "* For each area, a feature vector $\\mathbf{x}^{(i)}$ containing $n=13$ features has been determined. \n",
+    "* For each area, a city planner determined the class $c^{(i)}$ which is either \n",
+    "     * $c^{(i)} = 0$ (area has no pedestrian crossing)\n",
+    "     * $c^{(i)}=1$ (area has one pedestrian crossing) \n",
+    "     * $c^{(i)}=2$ (area has more than one pedestrian crossing)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "455e2bf213d421aed7be0e59bd9b82b3",
+     "grade": false,
+     "grade_id": "cell-412f596e08777d50",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='demoboundary'></a>\n",
+    "<div class=\"alert alert-info\">\n",
+    "    \n",
+    "### Demo. Load Data.\n",
+    "\n",
+    "The code snippet below loads the dataset with information of $m=178$ city areas. The $i$-th data point is characterized by the feature vector $\\mathbf{x}^{(i)} \\in \\mathbb{R}^{n}$ and the category $c^{(i)} \\in \\{0,1,2\\}$ as determined by a human eye. The features and categories for the first five images $i=1,\\dots,5$ are displayed. \n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "65f403136e789158431cf3db05076f42",
+     "grade": false,
+     "grade_id": "cell-76512dcd520fdc10",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "############################# IMPORTANT! #############################\n",
+    "# This cell needs to be run to load the necessary libraries and data #\n",
+    "######################################################################\n",
+    "\n",
+    "%matplotlib inline\n",
+    "import numpy as np\n",
+    "from numpy.linalg import norm\n",
+    "import pandas as pd\n",
+    "import seaborn as sns\n",
+    "import matplotlib.pyplot as plt\n",
+    "#from unittest.mock  import patch\n",
+    "#from plotchecker import ScatterPlotChecker\n",
+    "# import scikit-learn metrics module for accuracy calculation\n",
+    "from sklearn import metrics\n",
+    "from sklearn import datasets\n",
+    "from sklearn import preprocessing\n",
+    "from sklearn.tree import DecisionTreeClassifier\n",
+    "from sklearn.linear_model import LogisticRegression"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "8a4fe6b6bb4d261964404afca804c56c",
+     "grade": false,
+     "grade_id": "cell-f5276acd5385a727",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Load the dataset and store data and labels in variables\n",
+    "data = pd.read_csv(\"/coursedata/R3_Classification/image_data.csv\", header = None)\n",
+    "c = pd.read_csv(\"/coursedata/R3_Classification/image_labels.csv\", header = None)\n",
+    "\n",
+    "# Add column names to feature dataframe (x1,...,x13)\n",
+    "data.columns = [\"x\" + str(i) for i in range(data.shape[1])]\n",
+    "\n",
+    "# Add labels (target) column to the dataframe\n",
+    "data['target'] = c\n",
+    "\n",
+    "# Add column containing the class names corresponding to the target \n",
+    "category_names = [\"zero crossing\", \"one crossing\", \"multiple crossings\" ] # set possible categories for our labels\n",
+    "data['class'] = data['target'].map(lambda ind: category_names[ind])\n",
+    "\n",
+    "# Print information of dataset\n",
+    "print(f\"Data shape:\\t{data.shape} \\nLabels shape: \\t{c.shape}\")\n",
+    "print(f\"Number of samples from Class 0: {sum(c[0] == 0)}\")\n",
+    "print(f\"Number of samples from Class 1: {sum(c[0] == 1)}\")\n",
+    "print(f\"Number of samples from Class 2: {sum(c[0] == 2)}\")\n",
+    "\n",
+    "# Display first five datapoints using pandas\n",
+    "print(\"\\nThe first five data points:\")\n",
+    "display(data.head(5))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "7bc6a6ab0b4aadb780f1af31295755ba",
+     "grade": false,
+     "grade_id": "cell-4a084869afc0c5ec",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "### Features and Labels \n",
+    "\n",
+    "Remember our goal is to classify an area based on features of an areal photograph of that area. The $i$-th area is characterized by the features $x^{(i)}_{1},\\ldots,x^{(i)}_{13}$ which we collect into the **feature vector** $\\mathbf{x}^{(i)} = \\big(x_{1}^{(i)},x_{2}^{(i)}, ... x_{13}^{(i)} \\big)^{T} \\in \\mathbb{R}^{13}$. It will be convenient to stack the feature vectors $\\mathbf{x}^{(i)} \\in \\mathbb{R}^{13}$, obtained for all data points $i=1,\\dots,m$, into the feature matrix \n",
+    "\n",
+    "<a id='xm'></a>\n",
+    "\\begin{equation*}\n",
+    "    \\mathbf{X} = \\big(\\mathbf{x}^{(1)},\\dots,\\mathbf{x}^{(178)}\\big)^T=\\begin{bmatrix}\n",
+    "    x^{(1)}_{1}  & \\dots & x^{(1)}_{13} \\\\\n",
+    "    \\vdots & \\ddots & \\vdots\\\\\n",
+    "    x^{(178)}_{1} & \\dots & x^{(178)}_{13}\n",
+    "    \\end{bmatrix},\\ \\mathbf{X} \\in \\mathbb{R}^{m \\times n},\\ \\text{where } m=178, n=13.\n",
+    "    \\tag{1}\n",
+    "\\end{equation*}\n",
+    "\n",
+    "Besides its features $\\mathbf{x}^{(i)}$, the $i$-th area is characterized by the category $c^{(i)} \\in \\{0,1,2\\}$ which has been determined by a human expert. In principle, we could directly use the category $c^{(i)}$ as the label or quantity of interest. However, we will first consider the simpler problem of determining if a particular area does not have a pedestrian crossing ($c^{(i)} = 0$) of if it has some pedestrian crossing ($c^{(i)}=1$ or $c^{(i)}=2$). Thus, we define the label of an area as $y^{(i)}=1$ if $c^{(i)} = 0$. Otherwise, we define the label $y^{(i)}=0$ if the area shows at least one pedestrican crossing, corresponding to $c^{(i)} = 1$ or $c^{(i)}=2$. \n",
+    "\n",
+    "It will be convenient to collect the labels of all images into the label vector \n",
+    "\n",
+    "<a id='vy'></a>\n",
+    "\\begin{equation*}\n",
+    "    \\mathbf{y}=\\big(y^{(1)},y^{(2)},\\ldots,y^{(m)} \\big)^{T} = \\begin{bmatrix}\n",
+    "    y^{(1)}\\\\\n",
+    "    y^{(2)}\\\\\n",
+    "    \\vdots\\\\\n",
+    "    y^{(m)}\n",
+    "    \\end{bmatrix} \\in \\mathbb{R}^{m}.\n",
+    "    \\tag{2}\n",
+    "\\end{equation*}\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "6044a86148b78e93ecf17843569e8fdd",
+     "grade": false,
+     "grade_id": "cell-17287368c834f17a",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='featurefunction'></a>\n",
+    "<div class=\" alert alert-warning\">\n",
+    "    <b>Student Task.</b> Feature Matrix. \n",
+    "\n",
+    "Implement a Python function `X=feature_matrix()` which loads the image dataset and returns the feature matrix ([1](#xm)) of size $178 \\times 13$ containing $n=13$ features for each of the $m=178$ areas. The $i$-th row of the feature matrix contains the features $x^{(i)}_{1},\\ldots,x^{(i)}_{n}$ of the $i$-th area. \n",
+    "\n",
+    "\n",
+    "* Use `pandas.read_csv()` function ([link to docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)) to load .csv file. \n",
+    "    \n",
+    "* We also need to use preprocessing step ([read more here](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-data)) as our features are in different scales (some of the features are in the range of 0-1 and some in the order of hundreds and thousands). Large difference in the range of the features may negatively affect learning predictor with certain algorithms (which may be expressed as `ConvergenceWarning` error). To preprocess data matrix X use `preprocessing.scale(X)` function ([see examples here](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling)). \n",
+    "    \n",
+    "* Function should return preprocessed (scaled) data matrix.\n",
+    "\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "d962feba819f5d4cc242a54348deec16",
+     "grade": false,
+     "grade_id": "cell-577310fb782f61d2",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def feature_matrix():\n",
+    "    \"\"\"\n",
+    "    Generate a feature matrix representing different descriptive statistics of an image in the dataset.\n",
+    "\n",
+    "    :return: array-like, shape=(m, n), feature-matrix with n features for each of m images. \"\"\"  \n",
+    "    \n",
+    "    file = \"/coursedata/R3_Classification/image_data.csv\"   \n",
+    "    \n",
+    "    ### STUDENT TASK ###   \n",
+    "    # X = ...\n",
+    "    # X_scaled = ...\n",
+    " \n",
+    "    # YOUR CODE HERE\n",
+    "    raise NotImplementedError()\n",
+    "    \n",
+    "    return X_scaled"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "289ccac4feb5b7813e60b5ed5957039c",
+     "grade": true,
+     "grade_id": "cell-840700cb14517bbf",
+     "locked": true,
+     "points": 1,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "test_matrix = feature_matrix()\n",
+    "assert test_matrix.shape == (178,13), f'feature_matrix returns wrong matrix for m=1. It should be shape (178,13), but you gave {test_matrix.shape}'\n",
+    "print('All tests passed!')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "b01d4c1bb805292960fcddb52617af15",
+     "grade": false,
+     "grade_id": "cell-acac139fee4f522f",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='labelfunction'></a>\n",
+    "<div class=\" alert alert-warning\">\n",
+    "    <b>Student Task.</b> Label Vector. \n",
+    "\n",
+    "Implement a Python function `y=labels()` which loads the image dataset and returns the label vector ([2](#vy)) of length $m$ where `m` is the number of images described in the dataset. The $i$th entry $y^{(i)}$ of the returned vector should be $y^{(i)}=1$ if the $i$th image is from Class 0 and $y^{(i)}=0$ otherwise.\n",
+    "   \n",
+    "* Use `pandas.read_csv()` function ([link to docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)) to load .csv file. \n",
+    "\n",
+    "* The shape of the label vector `y` should be (178,) not (178,1) as later on we will fit our data to LogisticRegression classifier and required shape of `y` is (n_samples,) [see more here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit).\n",
+    "    \n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "1eaedc2c498f7f62aa42d96fbb755f31",
+     "grade": false,
+     "grade_id": "cell-b67c7f64aa86fbba",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def labels():\n",
+    "    \"\"\" \n",
+    "    :return: array-like, shape=(m,), label-vector\n",
+    "    \"\"\"    \n",
+    "    file = \"/coursedata/R3_Classification/image_labels.csv\"\n",
+    "    \n",
+    "    ### STUDENT TASK ###\n",
+    "    # YOUR CODE HERE\n",
+    "    raise NotImplementedError()\n",
+    "    return y"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "88f245e286b97bb357ed8e7e5101ffb7",
+     "grade": true,
+     "grade_id": "cell-1230c58d375ceb1a",
+     "locked": true,
+     "points": 3,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "test_labels = labels()\n",
+    "assert test_labels.shape == (178, ), f'Your label vector is incorrect shape. It should be (178,), but you gave {test_labels.shape}'\n",
+    "for i in [1,20,40,58]:\n",
+    "    assert test_labels[i] == 1, f'Image sample should be from class 0, but you labeled it as other class'\n",
+    "for i in [59,80,100,150,177]:\n",
+    "    assert test_labels[i] == 0, f'Image sample should be from class 1 or 2, but you labeled it as from class 0'\n",
+    "\n",
+    "print('All tests passed!')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "53dc6f8242f98cd295dd1b943b0adff2",
+     "grade": false,
+     "grade_id": "cell-e6a842c140cb4d36",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='demoboundary'></a>\n",
+    "<div class=\"alert alert-info\">\n",
+    "    \n",
+    "### Demo. Visualize Data Points.\n",
+    "\n",
+    "The code snippet below uses the functions from the previous tasks to load the features $\\mathbf{x}^{(i)}$ and labels $y^{(i)}$ of the images. We then visualize these data points using a scatter plot. In this scatter plot, the $i$-th data point is represented by either a dot (when $y^{(i)} =0$) or a cross ($y^{(i)}=1$) located at the coordinates given by the first two features $x_{1}^{(i)}$ and $x_{2}^{(i)}$.  \n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "a94919e5021eff6cfe51b34d82e8dcce",
+     "grade": false,
+     "grade_id": "cell-062120c0d0af4eb0",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Load the features\n",
+    "X = feature_matrix()\n",
+    "# Load labels\n",
+    "y = labels() \n",
+    "\n",
+    "idx_1 = np.where(y == 1) # indices of class 0 images\n",
+    "idx_2 = np.where(y == 0) # indices of not class 0 images\n",
+    "\n",
+    "# Plot scatterplot of dataset with different markings for class 0 images\n",
+    "fig, axes = plt.subplots(figsize=(10, 6))\n",
+    "axes.scatter(X[idx_1, 0], X[idx_1, 1], c='green', marker ='x', label='y =1; Class 0 image')\n",
+    "axes.scatter(X[idx_2, 0], X[idx_2, 1], c='brown', marker ='o', label='y =0; Class 1 or Class 2 image')\n",
+    "\n",
+    "# Set axis labels and legend\n",
+    "axes.legend(loc='upper left', fontsize=12)\n",
+    "axes.set_xlabel('feature x1', fontsize=16)\n",
+    "axes.set_ylabel('feature x2', fontsize=16)\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "f95e7c5ec373f6e8cb5b96ec6b78974c",
+     "grade": false,
+     "grade_id": "cell-8a2013786c9827ce",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "## Logistic Regression\n",
+    "<img src=\"../../../coursedata/R3_Classification/Log_Reg2.png\" alt=\"Drawing\" style=\"width: 600px;\"/>\n",
+    "\n",
+    "Our goal is to determine the label $y$ of an image based on its features $\\mathbf{x}=\\big(x_{1},\\ldots,x_{n}\\big)^{T}$. The label is $y=1$ if the image belongs to class 0 and $y=0$ otherwise (either class 1 or class 2). \n",
+    "\n",
+    "Similar to linear regression, **logistic regression** uses a linear function of the form $h^{(\\mathbf{w})}(\\mathbf{x})= \\mathbf{w}^{T} \\mathbf{x}$, with some weight vector $\\mathbf{w} \\in \\mathbb{R}^{n}$, to predict the label $y$ based on the features $\\mathbf{x}$. \n",
+    "\n",
+    "At this point it might seem strange to use the real-valued function $h^{(\\mathbf{w})}(\\mathbf{x})=\\mathbf{w}^{T} \\mathbf{x}$ for predicting the binary label $y \\in \\{0,1\\}$. Indeed, while the label $y$ can take on only values $0$ or $1$, the function $h^{(\\mathbf{w})}(\\mathbf{x})$ can take on any real number. \n",
+    "\n",
+    "However, it turns out to be useful to use the real-valued function $h^{(\\mathbf{w})}(\\mathbf{x})$ for predicting binary labels. First of all, we can easily obtain a predicted label $\\hat{y} \\in \\{0,1\\}$ simply by using the sign of $h^{(\\mathbf{w})}(\\mathbf{x})$, \n",
+    "\\begin{equation}\n",
+    "\\hat{y} = \\begin{cases} 1 & \\mbox{ for } h^{(\\mathbf{w})}(\\mathbf{x}) \\geq 0 \\\\ 0 & \\mbox{ for } h^{(\\mathbf{w})}(\\mathbf{x}) < 0. \\end{cases}\n",
+    "\\end{equation} \n",
+    "What is more, we can accompany the predicted label by a **measure of the confidence (or reliability)** in the classification $\\hat{y}$ using the magnitude $|h^{(\\mathbf{w})}(\\mathbf{x})|$.\n",
+    "\n",
+    "This, rather intuitive but informal, interpretation of the predictor $h^{(\\mathbf{w})}(\\mathbf{x})=\\mathbf{w}^{T} \\mathbf{x}$ can be made mathematically precise by using a **probabilistic model** for the labels of data points. In particular, we could  model the true label $y$ of an image as a **random variable**. In particular, we consider $y$ as a (realization of a) binary random variable taking on the value $y=1$ with probability \n",
+    "\\begin{align} \n",
+    "{\\rm Prob}(y=1; \\mathbf{w}) & = \\frac{1}{1+{\\rm exp}(-h^{(\\mathbf{w})}(\\mathbf{x}))} = \\frac{1}{1+{\\rm exp}(-\\mathbf{w}^{T}\\mathbf{x})}. \n",
+    "\\end{align}\n",
+    "\n",
+    "Note that the probability depends on the weight vector $\\mathbf{w}$ which has to be determined (learnt) from data.\n",
+    "\n",
+    "Since the label $y$ must take on either $1$ or $0$, which implies ${\\rm Prob}(y=0;\\mathbf{w}) + {\\rm Prob}(y=1;\\mathbf{w})=1$, we have  \n",
+    "\n",
+    "\\begin{align} \n",
+    "{\\rm Prob}(y=0; \\mathbf{w}) & = 1- \\frac{1}{1+{\\rm exp}(-\\mathbf{w}^{T}\\mathbf{x})}  = \\frac{1}{1+{\\rm exp}(\\mathbf{w}^{T}\\mathbf{x})}. \n",
+    "\\end{align}\n",
+    "\n",
+    "To evaluate the probability ${\\rm Prob}(y=1;\\mathbf{w})$, we need to specify the weight vector $\\mathbf{w}$ and we need to know the feature vector $\\mathbf{x}$ of an image. The feature vector $\\mathbf{x}$ of an image is available via data analysis. The more challenging part is to come up with a good choice for the weight vector $\\mathbf{w}$. \n",
+    "\n",
+    "A principled approach to find or **learn** a good choice for the weight vector $\\mathbf{w}$ is to maximize the probability (or likelihood) of the labels $y^{(i)}$, $i=1,\\ldots,m$, for the image-samples in our dataset. \n",
+    "This **maximum likelihood** approach amounts to the following optimization problem  \n",
+    "\\begin{equation}\n",
+    "\\tag{3}\n",
+    "\\widehat{\\mathbf{w}} = {\\rm argmax} \\prod_{i=1}^{m} {\\rm Prob}(y = y^{(i)}; \\mathbf{w}),\n",
+    "\\label{logloss_ml}\n",
+    "\\end{equation} \n",
+    "\n",
+    "where the likelihood is maximized with respect to $\\mathbf{w}$.\n",
+    "\n",
+    "The product over all samples $i=1,\\ldots,m$ arises from the assumption that the samples are realizations of independent and identically distributed (i.i.d.) random variables. We will not use this proabilistic interpretation in what follows but instead we will show that this maximum likelihood approach is equivalent to the minimization a certain loss function, the **logistic loss**. \n",
+    "\n",
+    "As detailed in the course book (Section 3.4), solving the above maximum likelihood problem is equivalent to minimizing the average **logistic loss**. The logistic loss incurred by a linear predictor $h^{(\\mathbf{w})}(\\mathbf{x})=\\mathbf{w}^{T} \\mathbf{x}$, using the weight vector $\\mathbf{w}$, when applied to a data point with features $\\mathbf{x}$ and true label $y$ is defined as:\n",
+    "\n",
+    "\\begin{equation*}\n",
+    "    \\mathcal{L}\\big((\\mathbf{x},y);\\mathbf{w}\\big) = -y\\ln\\big(\\sigma\\big( \\mathbf{w}^{T} \\mathbf{x} \\big)\\big)-(1-y)\\ln\\big(1-\\sigma\\big(\\mathbf{w}^{T}\\mathbf{x}\\big) \\big). \n",
+    "    \\label{loss}\n",
+    "    \\tag{4}\n",
+    "\\end{equation*}\n",
+    "Here, we use the **sigmoid function** \n",
+    "\\begin{equation*}\n",
+    "    \\sigma(z)= \\frac{1}{1+{\\rm exp}(-z)}\n",
+    "    \\label{sigmoid}\n",
+    "    \\tag{5}\n",
+    "\\end{equation*}\n",
+    "\n",
+    "Since we have $m=178$ labeled samples with features $\\mathbf{x}^{(i)}$ and labels $y^{(i)}$, for $i=1,\\ldots,m$, we can evaluate the logistic loss for all those samples to obtain the average loss or **empirical risk** \n",
+    "\\begin{align}\n",
+    "\\mathcal{E}(\\mathbf{w}) \n",
+    "& = (1/m) \\sum_{i=1}^{m} \\mathcal{L}((\\mathbf{x}^{(i)},y^{(i)}),\\ h^{(\\mathbf{w})}) \\nonumber \\\\ \n",
+    "&  = (1/m) \\sum_{i=1}^{m}\\big[ -y^{(i)}\\ln\\big({\\rm Prob}(y=1; \\mathbf{w})\\big)-(1-y^{(i)})\\ln\\big(1-{\\rm Prob}(y=1; \\mathbf{w})\\big) \\big] \\\\\n",
+    "&  = (1/m) \\sum_{i=1}^{m}\\big[ -y^{(i)}\\ln\\big(\\sigma(\\mathbf{w}^{T}\\mathbf{x}^{(i)})\\big)-(1-y^{(i)})\\ln\\big(1-\\sigma(\\mathbf{w}^{T}\\mathbf{x}^{(i)})\\big) \\big]. \n",
+    "   \\label{erm}\n",
+    "    \\tag{6}\n",
+    "\\end{align}\n",
+    "The empirical risk $\\mathcal{E}(\\mathbf{w})$ is a measure for how well a classifier $h^{(\\mathbf{w})}=\\mathbf{w}^{T} \\mathbf{x}$ agrees with the labeled data points $\\big(\\mathbf{x}^{(i)},y^{(i)}\\big)$ for $i=1,\\ldots,m$. If the risk $\\mathcal{E}(\\mathbf{w})$ is small, then the classifier agrees well with the labeled data points. \n",
+    "\n",
+    "Naturally, we should chose the weight vector $\\mathbf{w}$ to make $\\mathcal{E}(\\mathbf{w})$ as small as possible. It turns out that chosing the weight vector in order to minimize the empirical risk is in fact the same as chosing the weight vector via the maximum likelihood estimate $\\widehat{\\mathbf{w}}$ \\eqref{logloss_ml}: \n",
+    "\n",
+    "\\begin{align}\n",
+    "\\widehat{\\mathbf{w}} & = {\\rm argmin}_{\\mathbf{w} \\in \\mathbb{R}^{n}} \\mathcal{E}(\\mathbf{w}) \\nonumber \\\\ \n",
+    "& = {\\rm argmin}_{\\mathbf{w} \\in \\mathbb{R}^{n}} (1/m) \\sum_{i=1}^{m} \\big[-y^{(i)}\\ln\\big(\\sigma(\\mathbf{w}^{T}\\mathbf{x}^{(i)})\\big)-(1-y^{(i)})\\ln\\big(1- \\sigma(\\mathbf{w}^{T}\\mathbf{x}^{(i)})\\big)\\big]. \n",
+    "\\end{align}\n",
+    "\n",
+    "\n",
+    "Note that the empirical risk $\\mathcal{E}( \\mathbf{w})$ is a differentiable convex function of the weight vector $\\mathbf{w}$. Such functions can be minimized efficiently using [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) (see [course book](https://arxiv.org/pdf/1805.05052.pdf), Chapter 5 for more details). Moreover, the Python library `scikit-learn` provides the class `LogisticRegression()` for linear classifiers that are optimizing using the logistic loss. In particular, the function [`LogisticRegression.fit(X, y)`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit) minimizes the empirical risk for data points whose features are stored in the numpy array `X` and labels are stored in the numpy array `y`.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "bd58a7bdff6fc4948d57b7d58ef12942",
+     "grade": false,
+     "grade_id": "cell-3f26e27d536e5c4f",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "source": [
+    "<b><font size=4>Summary. Logistic Regression</font></b>\n",
+    "\n",
+    "* Linear functions can be used in classification task\n",
+    "* One of the popular linear classifier is logistic regression\n",
+    "* For linear predictor $h^{(\\mathbf{w})}(\\mathbf{x})= \\mathbf{w}^{T} \\mathbf{x}$, we can assign a class label based on the sign of  $h^{(\\mathbf{w})}(\\mathbf{x})$: if $h^{(\\mathbf{w})}(\\mathbf{x})$ > 0 the assigned class is 1 and if $h^{(\\mathbf{w})}(\\mathbf{x})$ < 0 the assigned class is 0.\n",
+    "* We want to choose such a weight vector $\\mathbf{w}$ that maximizes the probability (or likelihood) of the labels belonging to a specific class\n",
+    "* Maximazing this probability is similar to minimizing the logistic loss\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "f424a6b7ebbe40c7ef08243ba95aeaa8",
+     "grade": false,
+     "grade_id": "cell-c7cdbcf0e4bd37f8",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "source": [
+    "<img src=\"../../../coursedata/R3_Classification/logreg1.jpg\" alt=\"Drawing\" style=\"width: 500px;\"/>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "7e2af29bbf3ebb801da2a6f575fbbddc",
+     "grade": false,
+     "grade_id": "cell-9ae08df3ffd3299d",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='demoboundary'></a>\n",
+    "<div class=\" alert alert-info\">\n",
+    "    \n",
+    "### Demo. Logistic Loss.\n",
+    "\n",
+    "The code snippet below plots the logistic loss $\\mathcal{L}\\big((\\mathbf{x},y);\\mathbf{w}\\big) = -y\\ln\\big(\\sigma\\big( \\mathbf{w}^{T} \\mathbf{x} \\big)\\big)-(1-y)\\ln\\big(1-\\sigma\\big(\\mathbf{w}^{T}\\mathbf{x}\\big) \\big)$ as a function of the predictor value $h(\\mathbf{x}) = \\mathbf{w}^{T} \\mathbf{x}$. The value $\\mathbf{w}^{T} \\mathbf{x}$ is interpreted as the confidence in the true label being equal to $1$. As soon as $\\mathbf{w}^{T} \\mathbf{x}> 0$ we classify a data point as $\\hat{y} = 1$ and the absolute value $|\\mathbf{w}^{T} \\mathbf{x}|$ quantifies the confidence in this classification result. If the true label is $y=1$ then we would like the loss function to decrease $\\mathcal{L}\\big((\\mathbf{x},y);\\mathbf{w}\\big)$ as $\\mathbf{w}^{T} \\mathbf{x}$ increases (towards $+\\infty$). Similarly, if the true label of the data point is $y=0$, we would like the loss to decrease $\\mathbf{w}^{T} \\mathbf{x}$ as $\\mathbf{w}^{T} \\mathbf{x}$ decreases towards $- \\infty$, since we are increasingly confident in the correct classification $\\hat{y}=0$. \n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "a078c98ab28df33f7297296b5b23c499",
+     "grade": false,
+     "grade_id": "cell-75e9af43f11ba370",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Define sigmoid function according to formula (5)\n",
+    "def sigmoid_func(x):\n",
+    "    return 1/(1 + np.exp(-x))\n",
+    "\n",
+    "# Choose values (w^T*x) for calculating loss \n",
+    "range_x = np.arange(-5 , 5 , 0.01)\n",
+    "\n",
+    "# Calculate logistic loss for y=1 and y=0\n",
+    "logloss_y1 = -np.log(sigmoid_func(range_x))\n",
+    "logloss_y0 = -np.log(1-sigmoid_func(range_x))\n",
+    "\n",
+    "# Set fontsizes for matplotlib\n",
+    "plt.rc('legend', fontsize=16) \n",
+    "plt.rc('axes', labelsize=16) \n",
+    "plt.rc('xtick', labelsize=12) \n",
+    "plt.rc('ytick', labelsize=12) \n",
+    "     \n",
+    "# Plot the results, using the plot function in matplotlib.pyplot.\n",
+    "fig, axes = plt.subplots(1, 1, figsize=(10, 6)) \n",
+    "axes.plot(range_x, logloss_y1, linestyle=':', label=r'$y=1$', linewidth=3.0)\n",
+    "axes.plot(range_x, logloss_y0, label=r'$y=0$', linewidth=2.0)\n",
+    "\n",
+    "# Set axis labels and title\n",
+    "axes.set_xlabel(r'$\\mathbf{w}^{T}\\mathbf{x}$')\n",
+    "axes.set_ylabel(r'$\\mathcal{L}((y,\\mathbf{x});\\mathbf{w})$')\n",
+    "axes.set_title(\"logistic loss\", fontsize=20)\n",
+    "axes.legend()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "3ea1be584b1d3d12f742d10f5bcaba36",
+     "grade": false,
+     "grade_id": "cell-2e9a1e2bdd68d5a2",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='demoboundary'></a>\n",
+    "<div class=\" alert alert-warning\">\n",
+    "<b>Student Task.</b> Logistic vs. Squared Error Loss.\n",
+    "    \n",
+    "Extend the demo above by adding the squared error loss $(y-\\mathbf{w}^{T}\\mathbf{x})^{2}$ for cases $y=1$. Store the values of the loss function for various values of $\\mathbf{w}^{T} \\mathbf{x}$ in the numpy array `squaredloss_y1`. These numpy array should have the same shape as the numpy array `range_x` which is already created in the code snippet below. \n",
+    "\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "c79af42d4bcad4f31318b09915a254d3",
+     "grade": false,
+     "grade_id": "cell-22fedfdf6e7061dc",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# define sigmoid function according to formula (5)\n",
+    "def sigmoid_func(x):\n",
+    "    return 1/(1 + np.exp(-x))\n",
+    "\n",
+    "# Choose values (w^T*x) for calculating loss \n",
+    "range_x = np.arange(-2 , 4 , 0.01)\n",
+    "\n",
+    "# Calculate logistic loss for y=1\n",
+    "logloss_y1 = -np.log(sigmoid_func(range_x))\n",
+    "\n",
+    "### STUDENT TASK ###\n",
+    "# squaredloss_y1 = ...\n",
+    "# YOUR CODE HERE\n",
+    "raise NotImplementedError()\n",
+    "\n",
+    "# Font sizes for matplotlib\n",
+    "plt.rc('legend', fontsize=12) \n",
+    "plt.rc('axes', labelsize=16) \n",
+    "plt.rc('xtick', labelsize=12) \n",
+    "plt.rc('ytick', labelsize=12) \n",
+    "\n",
+    "# Plot the results\n",
+    "# IMPORTANT!: Please don't change below code for plotting, else the tests will fail and you will lose points\n",
+    "fig, axes = plt.subplots(1, 1, figsize=(10, 6))\n",
+    "axes.plot(range_x, logloss_y1, linestyle=':', label=r'logistic loss $y=1$', linewidth=3.0)\n",
+    "axes.plot(range_x, squaredloss_y1/2, label=r'squared error for $y=1$', linewidth=2.0)\n",
+    "\n",
+    "# Set axes label and legend\n",
+    "axes.set_xlabel(r'$\\mathbf{w}^{T}\\mathbf{x}$')\n",
+    "axes.set_ylabel(r'$\\mathcal{L}((y,\\mathbf{x});\\mathbf{w})$')\n",
+    "axes.legend()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "874e312cec51278d5db79e64f7ba360c",
+     "grade": false,
+     "grade_id": "cell-02571ba51c2e57d6",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "source": [
+    "We can see that logistic loss for y=1 (true label) decreasing with  $\\mathbf{w}^{T}\\mathbf{x}$ getting larger.\n",
+    "\n",
+    "In contrast, squared error loss increasing when the $\\mathbf{w}^{T}\\mathbf{x}$ is far from 1 (true label). Therefore, squared error loss for the datapoints which are $\\mathbf{w}^{T}\\mathbf{x}$>0, but far from true label (in this case y=1) will be large, which is bad loss function for assessing the quality of a predictor $h^{(\\mathbf{w})}(\\mathbf{x})$\n",
+    "\n",
+    "\\begin{equation}\n",
+    "\\hat{y} = \\begin{cases} 1 & \\mbox{ for } h^{(\\mathbf{w})}(\\mathbf{x}) \\geq 0 \\\\ 0 & \\mbox{ for } h^{(\\mathbf{w})}(\\mathbf{x}) < 0. \\end{cases}\n",
+    "\\end{equation} \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "d8b0d62a427b85c130ea12748139e592",
+     "grade": true,
+     "grade_id": "cell-7edf2b5d6d7d517e",
+     "locked": true,
+     "points": 1,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Perform some sanity checks on the outputs\n",
+    "print(f\"First entry of squaredloss_y1: {squaredloss_y1[0]}\")\n",
+    "print(f\"Last entry of squaredloss_y1: {squaredloss_y1[-1]}\")\n",
+    "np.testing.assert_allclose(squaredloss_y1[0], 9.0, atol=1e-2, err_msg=\"First entry of squaredloss_y1 should be equal to approximately 9.0\")\n",
+    "np.testing.assert_allclose(squaredloss_y1[-1], 8.94, atol=1e-2, err_msg=\"Last entry of squaredloss_y1 should be equal to approximately 8.94\")\n",
+    "print('Sanity check tests passed!')\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "4808af6bceba6a8d2e7de0623bb76f48",
+     "grade": false,
+     "grade_id": "cell-4e550d3a580bea73",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='logisticregression'></a>\n",
+    "<div class=\" alert alert-warning\">\n",
+    "<b>Student Task.</b> Logistic Regression. \n",
+    "\n",
+    "Use the Python function `sklearn.linear_model.LogisticRegression` to:\n",
+    "\n",
+    "* Initialize the logistic regression model with `LogisticRegression(random_state=0,C=1e6)`. Refer to [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).\n",
+    "* Compute the optimal weight vector $\\widehat{\\mathbf{w}}$ which minimizes the average logistic loss on the training data $(\\mathbf{x}^{(i)},y^{(i)})$ for $i=1,\\ldots,m$. You can use the function [`LogisticRegression.fit(X, y)`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit) which uses as input the feature matrix $\\mathbf{X} \\in \\mathbb{R}^{m \\times n}$ and the label vector $\\mathbf{y}=\\big(y^{(1)},\\ldots,y^{(m)}\\big)^{T}$.\n",
+    "    \n",
+    "* Predict labels for the image data using the function [`predict(X)`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict) which uses as input the feature matrix $\\mathbf{X} \\in \\mathbb{R}^{m \\times n}$. The obtained predicted labels should be stored in the variable `y_pred`. Refer to [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict).    \n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "e406bb2e26f93e7ef8fb2939ae6ec540",
+     "grade": false,
+     "grade_id": "cell-c2e7b88452b27df0",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Load the features and labels\n",
+    "X = feature_matrix()\n",
+    "y = labels() \n",
+    "\n",
+    "### BEGIN STUDENT TASK ###\n",
+    "# log_reg = ...\n",
+    "# log_reg. ...\n",
+    "# y_pred = ...\n",
+    "# YOUR CODE HERE\n",
+    "raise NotImplementedError()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "783c379d334d06ebd30eeaf6b9774bab",
+     "grade": true,
+     "grade_id": "cell-510a06d027433f2c",
+     "locked": true,
+     "points": 3,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Tests\n",
+    "assert y_pred.shape == (178,), \"y_pred has wrong dimensions.\"\n",
+    "print('Sanity check tests passed!')\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "0f5cf46c401732b543c9daea263948a6",
+     "grade": false,
+     "grade_id": "cell-a3956e310092c75b",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "### Decision Boundary of Logistic Regression \n",
+    "\n",
+    "After we have learnt the predictor function, or its weight $\\widehat{\\mathbf{w}}$, we can classify any new image according to $\\hat{y}=1$ if $h(\\mathbf{x}) = \\widehat{\\mathbf{w}}^{T} \\mathbf{x} > 0$ and $\\hat{y}=0$ otherwise. Thus, our classifier divides all image-samples into two halfspaces of $\\mathbb{R}^{2}$ (since we are only using two features to characterize an image). \n",
+    "\n",
+    "The image-samples which are classified as $\\hat{y} =1$ belong to the halfspace $\\{ \\mathbf{x}: \\hat{\\mathbf{w}}^{T} \\mathbf{x} > 0 \\}$, while all the images which are classified as $\\hat{y} = 0$ belong to the half-space $\\{ \\mathbf{x}: \\hat{\\mathbf{w}}^{T} \\mathbf{x} < 0 \\}$. These two half-spaces are separated by the decision boundary $\\{ \\mathbf{x}: \\hat{\\mathbf{w}}^{T} \\mathbf{x} =0\\}$. \n",
+    "\n",
+    "For most training data, the decision boundary determined by logistic regression will not perfectly separate the training data points according to $y^{(i)}=1$ and $y^{(i)}=0$. Thus, we typically have training samples with the same true label but which are located at oposite sides of the decision boundary. However, the decision boundary will be chosen such that on each side one class dominates. \n",
+    "\n",
+    "The decision boundary provides also a geometric interpretation of the magnitude $|\\widehat{\\mathbf{w}}^{T} \\mathbf{x}|$ as the normal distance of a data point with features $\\mathbf{x}$ to the decision boundary. Thus, the larger $|\\widehat{\\mathbf{w}}^{T} \\mathbf{x}|$, the farther away is the data point from the decision boundary and, in turn, the more reliable is the predicted label $\\hat{y}$ for this data point. On the other hand, if $|\\widehat{\\mathbf{w}}^{T} \\mathbf{x}| \\approx 0$, then the data point with features $\\mathbf{x}$ is close to the decision boundary, i.e., it is a border case which cannot be classified reliably.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "58ddc6468608b8aeda9fa5ac8dc79799",
+     "grade": false,
+     "grade_id": "cell-72420c22ddf65bb7",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='demoboundary'></a>\n",
+    "<div class=\" alert alert-info\">\n",
+    "    \n",
+    "### Demo. Linear Decision Boundary.\n",
+    "\n",
+    "The code snippet below learns a linear predictor function $h(\\mathbf{x}) =\\widehat{\\mathbf{w}}^{T} \\mathbf{x}$ using logistic regression using only the first two features $x_{1}$ and $x_{2}$ of an image. It then creates a scatter plot of the image samples $\\big(\\mathbf{x}^{(i)},y^{(i)}\\big)$ for $i=1,\\ldots,m$. All samples with $y^{(i)} = 1$ are indicated by \"x\" while all samples with true label $y^{(i)} =0$ are indicated by \"o\". The scatter plot also indicates the decision boundary $\\{\\mathbf{x}: \\widehat{\\mathbf{w}}^{T} \\mathbf{x}=0 \\}$. \n",
+    "\n",
+    "<p>Note that the training data is not perfectly separable by a linear decision boundary. In other words, there is no straight line such that on each side of the line are only data points with the same label.  \n",
+    "</p>\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "01975a22091b4aac89a41c988b72d14c",
+     "grade": false,
+     "grade_id": "cell-999d7b972b0c0540",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Load the features and labels\n",
+    "X = feature_matrix()\n",
+    "y = labels() \n",
+    "\n",
+    "# Use only the first two features\n",
+    "X = X[:,:2]  \n",
+    "\n",
+    "# Create an instance of Logistic Regression Classifier and fit the data.\n",
+    "log_reg = LogisticRegression(C=1e6, random_state=0)\n",
+    "log_reg.fit(X, y)\n",
+    "\n",
+    "# Get the weights of the fitted model\n",
+    "w = log_reg.coef_ \n",
+    "w = w.reshape(-1)\n",
+    "\n",
+    "# Get the predicted labels for X\n",
+    "y_pred = log_reg.predict(X)\n",
+    "\n",
+    "# Calculate the accuracy of the predictions\n",
+    "accuracy = metrics.accuracy_score(y, y_pred)\n",
+    "print(f\"Accuracy of classification: {round(100*accuracy, 2)}%\")\n",
+    "\n",
+    "# Minimum and maximum values of features x1 and x2\n",
+    "x1_min, x2_min = np.min(X, axis=0)\n",
+    "x1_max, x2_max = np.max(X, axis=0)\n",
+    "\n",
+    "# Plot the decision boundary h(x) = 0\n",
+    "# for data with 2 features this means w1x1 + w2x2 + bias = 0 --> x2 = (-1/w2)*(w1x1+bias)\n",
+    "x_grid = np.linspace(x1_min, x1_max, 100)\n",
+    "y_boundary = (-1/w[1])*(x_grid*w[0] + log_reg.intercept_)\n",
+    "\n",
+    "plt.figure(figsize=(10, 6))\n",
+    "idx_1 = np.where(y == 1)[0] # index of each class 0 image.\n",
+    "idx_2 = np.where(y == 0)[0] # index of each not class 0 image.\n",
+    "plt.scatter(X[idx_1, 0], X[idx_1, 1], marker='x', s=50, label=r'$c^{(i)}=0$')\n",
+    "plt.scatter(X[idx_2, 0], X[idx_2, 1], marker='o', s=50, label=r'$c^{(i)} \\in \\{1,2\\}$')\n",
+    "plt.plot(x_grid, y_boundary, color='green', label=\"Decision boundary\")\n",
+    "plt.xlabel(r'$x_{1}$')\n",
+    "plt.ylabel(r'$x_{2}$')\n",
+    "plt.title(\"Decision boundary of logistic regression classifier\", fontsize=16)\n",
+    "plt.xlim(x1_min-.5, x1_max+.5)\n",
+    "plt.ylim(x2_min-1, x2_max+1)\n",
+    "plt.legend()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "4ae0233b38a7e732ad80a1ea75229bde",
+     "grade": false,
+     "grade_id": "cell-14d4b1ed797dc0ad",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "source": [
+    " <b><center><font size=4>Bonus Task!</font></center></b> \n",
+    " \n",
+    " Bonus task worth of 50 points.\n",
+    " \n",
+    " Explain what regularization parameter C does in scikit-learn LogisticRegression and how to choose appropriate C. Please send your answer to Alex or the course assistants via e-mail or on Slack!\n",
+    " "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "8508373fd699fb8a54719b4b7d758e5b",
+     "grade": false,
+     "grade_id": "cell-c57cc077bea09f6e",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "source": [
+    "### Accuracy - How well did we do?\n",
+    "After we have computed the optimal weight $\\widehat{\\mathbf{w}}$ using logistic regression, we can calculate the accuracy of the resulting classifier as the fraction of correctly labeled images (for which $y^{(i)} = \\hat{y}^{(i)}$):\n",
+    "\n",
+    "\\begin{equation*}\n",
+    "    \\text{Accuracy} =\\dfrac{1}{m} \\sum_{i=1}^{m} \\mathcal{I}(\\hat{y}^{(i)} = y^{(i)})\n",
+    "    \\label{acc}\n",
+    "    \\tag{7}\n",
+    "\\end{equation*}\n",
+    "\n",
+    "Here $\\mathcal{I}(\\hat{y}^{(i)} = y^{(i)})$ denotes the indicator function which is equal to one if $\\hat{y}^{(i)} = y^{(i)}$ and equal to zero otherwise. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "d7461a5ddd138df58313724a8c12785b",
+     "grade": false,
+     "grade_id": "cell-c26db4492eed8d77",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='logregaccuracy'></a>\n",
+    "<div class=\" alert alert-warning\">\n",
+    "<b>Student Task.</b> Compute Accuracy. \n",
+    " \n",
+    "Implement a Python function `calculate_accuracy(y, y_hat)` which takes as inputs a vector $\\mathbf{y}=\\big(y^{(1)},\\ldots,y^{(m)}\\big)^{T}$ of true labels and another vector $\\mathbf{\\hat{y}}=\\big(\\hat{y}^{(1)},\\ldots,\\hat{y}^{(m)}\\big)^{T}$ containing predicted labels.\n",
+    "The function should return the accuracy according to above definition as a **percentage**. Thus, if all samples are classified correctly, the returned value should be $100$. \n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "b33a9905b54a83907ec17e7221c313ce",
+     "grade": false,
+     "grade_id": "cell-3b89e9bd39367cc1",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def calculate_accuracy(y, y_hat):\n",
+    "    \"\"\"\n",
+    "    Calculate accuracy of your prediction\n",
+    "    \n",
+    "    :param y: array-like, shape=(m,), correct label vector\n",
+    "    :param y_hat: array-like, shape=(m,), label-vector prediction\n",
+    "    \n",
+    "    :return: scalar-like, percentual accuracy of your prediction\n",
+    "    \"\"\"\n",
+    "    ### STUDENT TASK ###\n",
+    "    # YOUR CODE HERE\n",
+    "    raise NotImplementedError()\n",
+    "    return accuracy"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "cafacfceee12d9f2f6c2bb2711041b2b",
+     "grade": false,
+     "grade_id": "cell-4df84611b1439fcd",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "In the next cell we execute the implemented function and test that it works properly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "28b283a66b6e19f22f87d501fd01af74",
+     "grade": true,
+     "grade_id": "cell-62f57c5e32ee62b6",
+     "locked": true,
+     "points": 3,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Get data and select 10 features to be used for prediction\n",
+    "X = feature_matrix()\n",
+    "X = X[:,:10]\n",
+    "y = labels()\n",
+    "\n",
+    "# Train logistic regression model and get predicted labels\n",
+    "log_reg = LogisticRegression(random_state=0)\n",
+    "log_reg = log_reg.fit(X, y)\n",
+    "y_pred = log_reg.predict(X)\n",
+    "            \n",
+    "# Perform some sanity checks on the outputs\n",
+    "test_acc = calculate_accuracy(y, y_pred)\n",
+    "print (f'Accuracy of the result is: {test_acc}')\n",
+    "assert 80 < test_acc < 100, \"Your accuracy should be above 80% and less than 100%\"\n",
+    "assert test_acc < 99, \"Your accuracy was too good. You are probably not using correct methods.\"\n",
+    "print('Sanity check tests passed!')\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "2ee589f2b9f058bbfd5c65a461f4512f",
+     "grade": false,
+     "grade_id": "cell-7ffafb13858edab3",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "## Multiclass Classification\n",
+    "\n",
+    "So far, we have considered the problem of classifying an image as $y=1$ if it belongs to \"class 0\" and as $y=0$ if not, i.e., if it belongs to \"class 1\" or \"class 2\". We have solved this binary classification problem using logistic regression. However, our ultimate goal is to classify an image according to all three categories of the images. \n",
+    "\n",
+    "There is a simple but useful trick for **upgrading** any binary classification method to handle more than two different label values or classes. The idea behind this trick which is known as **one vs. rest** is quite simple: just split the multiclass classification problem into several subproblems, each subproblem being one binary classification problem. We then apply a binary classification method (such as logistic regression) to each of the subproblems and combine their results to obtain a predicted label for the multiclass problem. \n",
+    "\n",
+    "For the image classification problem, using the three classes \"0\", \"1\" or \"2\", we obtain the following binary classification subproblems: \n",
+    "\n",
+    "- subproblem 0: classify samples into \"Class 0\" $(y=1)$ vs. \"not Class 0\" $(y=0)$  \n",
+    "- subproblem 1: classify samples into \"Class 1\" $(y=1)$ vs. \"not Class 1\" $(y=0)$  \n",
+    "- subproblem 2: classify samples into \"Class 2\" $(y=1)$ vs. \"not Class 2\" $(y=0)$\n",
+    "\n",
+    "Each subproblem amounts to testing if the image belongs to a particular class or not. The $k$th subproblem can be solved using logistic regression yielding a predictor $h^{(\\mathbf{w}_{k})}(\\mathbf{x})= (\\mathbf{w}_{k})^{T} \\mathbf{x}$. The predictor $h^{(\\mathbf{w}_{k})}(\\mathbf{x})= (\\mathbf{w}_{k})^{T} \\mathbf{x}$ indicates how likely the image belongs to the class $k$. We then assign the image to those class $k$ for which $h^{(\\mathbf{w}_{k})}(\\mathbf{x})$ is largest. \n",
+    "\n",
+    "### Example\n",
+    "\n",
+    "Assume we want to classify a new data point (which is different from the $m$ data points in our dataset). To this end, we compute the feature vector $\\mathbf{x}=(x_{1},x_{2},...,x_{13})^{T}$ of this new data point and apply the three subproblem predictors, yielding the following prediction values: \n",
+    "\n",
+    "* subproblem 0: $h^{(\\mathbf{w}_{0})}(\\mathbf{x}) = 0.1$ (\"Class 0 vs. not Class 0\")\n",
+    "* subproblem 1: $h^{(\\mathbf{w}_{1})}(\\mathbf{x}) = 0.4$ (\"Class 1 vs. not Class 1\") \n",
+    "* subproblem 2: $h^{(\\mathbf{w}_{2})}(\\mathbf{x}) = 0.8$ (\"Class 2 vs. not Class 2\")\n",
+    "\n",
+    "From these results, we can see that the predictor $h^{(\\mathbf{w}^{(\\rm Class 2)})}(x)$ for subproblem 3 (`Class 2` vs. `not Class 2`) yields the highest confidence. Hence, we classify this new data point as `Class 2`. \n",
+    "\n",
+    "<img src=\"../../../coursedata/R3_Classification/Regression_Zebra.png\" alt=\"Drawing\" style=\"width: 400px;\"/>\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "09b61d342ecdb6a546c83d5a19b9a7b8",
+     "grade": false,
+     "grade_id": "cell-07f770c2370f39a3",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='demoboundary'></a>\n",
+    "<div class=\" alert alert-info\">\n",
+    "    \n",
+    "### Demo. Multiclass Classification.\n",
+    "\n",
+    "The code snippet below illustrates how multiclass classification via logistic regression can be implemented using the `scikit-learn` Python library ([click here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html))\n",
+    "\n",
+    "</div>\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "8333826239cdc3bee137ae914a53f08e",
+     "grade": false,
+     "grade_id": "cell-68c53dcd4442a408",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Load features and choose the first 8 features\n",
+    "X = feature_matrix()\n",
+    "X = X[:,:8]\n",
+    "\n",
+    "# Load labels. This time, keep all classes as is (0,1,2)\n",
+    "y = pd.read_csv(\"/coursedata/R3_Classification/image_labels.csv\", header=None).to_numpy()\n",
+    "y = y.reshape(-1)\n",
+    "\n",
+    "# Fit logistic regression model\n",
+    "log_reg = LogisticRegression(random_state=0, multi_class=\"ovr\") # set multi_class to one versus rest ('ovr')\n",
+    "log_reg = log_reg.fit(X, y)\n",
+    "\n",
+    "# Predict labels and probabilities\n",
+    "y_pred = log_reg.predict(X)\n",
+    "pred_probabilities = log_reg.predict_proba(X)\n",
+    "\n",
+    "print(f\"Predicted classes: {set(y_pred)}\")\n",
+    "print(f\"Predicted probability of each class for first data point: {pred_probabilities[0]}\")\n",
+    "print(f\"Predicted class for first data point: {y_pred[0]}\")\n",
+    "print(f\"True class for first data point: {y[0]}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "e755bb8899284c8ce6f3c320aaee5919",
+     "grade": false,
+     "grade_id": "cell-bd7b40d03f750e19",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "### Confusion Matrix\n",
+    "\n",
+    "Computing the accuracy, as the fraction of correctly classified data points for which $\\hat{y}^{(i)}=y^{(i)}$, is only one possible way to check how well you did. In some applications the accuracy is not very useful as a quality measure. In particular, for applications where the different classes occur with significantly different frequency (\"imbalanced data\"). A more fine-grained assessment of a classification method is provided by computing the confusion matrix. The confusion matrix considers the perfomance of a classifier individually for each possible value of the true label. In contrast, the accuracy is an average measure that averages over all possible label values.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "57f501bdcd59fc1cdd6a35b5f345d850",
+     "grade": false,
+     "grade_id": "cell-2ca0ad43155b3cf7",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='logregconf'></a>\n",
+    "<div class=\" alert alert-info\">\n",
+    "    \n",
+    "### Demo. Confusion Matrix. \n",
+    "\n",
+    "\n",
+    "Confusion matrix can be visualized with [`sklearn.metrics.plot_confusion_matrix`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html#sklearn-metrics-plot-confusion-matrix). This function returns object `display`, where confusion matrix can be accessed with `display.confusion_matrix`.\n",
+    "\n",
+    "The entry in the $i$th row and $j$th colums of the confusion matrix is the number of images that have true label $y=i$ but are classified as $\\hat{y}=j$.   \n",
+    "\n",
+    "You can read more about the confusion matrix and why it is useful at: https://en.wikipedia.org/wiki/Confusion_matrix\n",
+    "    \n",
+    "</div> "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.metrics import plot_confusion_matrix\n",
+    "\n",
+    "# Define class labels for confusion matrices\n",
+    "classes = ['Class 0','Class 1','Class 2']\n",
+    "\n",
+    "# Define plotting options (title, normalization, axes index)\n",
+    "options = [(\"Confusion matrix\", None, 0),\n",
+    "           (\"Normalized confusion matrix\", 'true', 1)]\n",
+    "\n",
+    "# Plot confusion matrices\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(15, 5)) \n",
+    "plt.rc('font', size=14)\n",
+    "for title, normalize, ax_idx in options:\n",
+    "    # main parameters of function `plot_confusion_matrix` are:\n",
+    "    # trained classifier (log_reg), data (X, y)\n",
+    "    disp = plot_confusion_matrix(log_reg, X, y,\n",
+    "                                 display_labels=classes,\n",
+    "                                 cmap=plt.cm.Blues,\n",
+    "                                 normalize=normalize, ax=axes[ax_idx])\n",
+    "    disp.ax_.set_title(title)\n",
+    "    print(title + \":\")\n",
+    "    # Disp.confusion_matrix returns confusion matrix as np.array\n",
+    "    print(f\"{disp.confusion_matrix}\\n\")\n",
+    "\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "41c3011c8054325ae82f7010837484df",
+     "grade": false,
+     "grade_id": "cell-a0ee4435a6c55ab0",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "## Reliability of (Confidence in) Classifications\n",
+    "\n",
+    "We now show how to use logistic regression to get a measure of reliability of the predicted label $\\hat{y}$. \n",
+    "\n",
+    "In particular, given an image with features $\\mathbf{x}$, logistic regression computes the predicted label $\\hat{y}$ using the sign of $h^{(\\mathbf{w})}(\\mathbf{x})=\\mathbf{w}^{T} \\mathbf{x}$. Moreover, we can use the magnitude of $h^{(\\mathbf{w})}=\\mathbf{w}^{T} \\mathbf{x}$ as a measure for the reliability of the classification. If this value is too small, we conclude that logistic regression was not able to reliably classify the image and we should send it to a human for a more rigorous analysis. \n",
+    "\n",
+    "In what follows, we will apply a Python library function `log_reg.predict_proba(X)` to compute a confidence measure for the resulting classification $\\hat{y}$. Instead of using the magnitude of $h^{(\\mathbf{w})}=\\mathbf{w}^{T} \\mathbf{x}$, they use a related but different measure for the reliability. In particular, for a binary classification problem, this method computes the (estimated) probability ${\\rm Prob}(y=1; \\mathbf{w})= \\frac{1}{1+{\\rm exp}(-\\mathbf{w}^{T}\\mathbf{x})}$ that the true label is $1$. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "7e22e87ff96bfc8261dd47e904977d8d",
+     "grade": false,
+     "grade_id": "cell-5e63cf757270a439",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='logregprobs'></a>\n",
+    "<div class=\" alert alert-warning\">\n",
+    "<b>Student Task.</b> Confidence in Classifications. \n",
+    "\n",
+    "Remember that logistic regression computes an optimal weight vector $\\widehat{\\mathbf{w}}$ for a linear predictor funtion $h(\\mathbf{x}) = \\mathbf{w}^{T} \\mathbf{x}$ in order to minimize the average logistic loss incurred over some given labeled data points $(\\mathbf{x}^{(i)},y^{(i)})$, $i=1,\\ldots,m$ for which we already know the true labels $y^{(i)}$. \n",
+    "\n",
+    "One of the appealing properties of logistic regression is that it not only provides a tool for classifying data points, i.e., computing a predicted label $\\hat{y}$, but also allows to quantify the reliability of (confidence in) the predicted label $\\hat{y}$. \n",
+    "\n",
+    "Logistic regression uses a probabilistic model that allows to compute the (estimated) probabilities ${\\rm Prob}(y=c|\\widehat{\\mathbf{w}})$ that the true label $y$ takes on a particular value $c$, e.g., $c=0,1,2$ in the image labelling application. Given an image with features $\\mathbf{x}$, we choose the predicted label $\\hat{y}$ as the particular value $c \\in \\{0,1,2\\}$ which yields the maximum probability ${\\rm Prob}(y=c|\\widehat{\\mathbf{w}})$. However, if this maximum probability ${\\rm Prob}(y=\\widehat{y}|\\widehat{\\mathbf{w}})$ is close to $1/2$ then the classification should be considered highly unreliable. \n",
+    " \n",
+    "- Use the Python function `log_reg.predict_proba(X)`, which reads in the feature matrix $\\mathbf{X} \\in \\mathbb{R}^{m \\times n}$ whose rows contain the feature vectors $\\mathbf{x}^{(i)}$. For each data point with features $\\mathbf{x}^{(i)}$, the function computes the probabilities ${\\rm Prob}(y^{(i)}=c|\\widehat{\\mathbf{w}})$ of the true label $y^{(i)}$ belonging to the classes $c=\\{0,1,2\\}$.\n",
+    "- The Python function `log_reg.predict_proba(X)` returns a numpy array of shape (m,3) which represents a matrix \n",
+    "$\\mathbf{T} \\in \\mathbb{R}^{m \\times 3}$. The $i$th row of $\\mathbf{T}$ represents the probabities ${\\rm Prob}(y^{(i)}=0|\\widehat{\\mathbf{w}})$, ${\\rm Prob}(y^{(i)}=1|\\widehat{\\mathbf{w}})$ and ${\\rm Prob}(y^{(i)}=2|\\widehat{\\mathbf{w}})$. The predicted label $\\hat{y}$ is then the label $c$ with the maximum probability, \n",
+    "$$ {\\rm Prob}(y^{(i)}= \\hat{y}^{(i)}|\\widehat{\\mathbf{w}}) = \\max_{c\\in \\{0,1,2\\}} {\\rm Prob}(y^{(i)}= c|\\widehat{\\mathbf{w}}).$$\n",
+    "\n",
+    "- Count the data points for which the predictions have a confidence of less than 90%. E.g., if predictions for a data point are \"class 0\": 89%, \"class 1\": 6% and \"class 2\": 5%, then the sample is discarded since we are not confident enough in the classification (which would be $\\hat{y}=0$ in this case). Store the total number of discarded data points in the variable `n_discarded`. \n",
+    "\n",
+    "\n",
+    "Hint: For more information, we refer to the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba).\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "6cef53f4044fa6fbd2c6754fa4d7508f",
+     "grade": false,
+     "grade_id": "cell-dfba9a6dac8bc476",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Predict the probabilities\n",
+    "# y_probs = ...\n",
+    "# YOUR CODE HERE\n",
+    "raise NotImplementedError()\n",
+    "\n",
+    "# Show the predicted probabilities of the first five data points\n",
+    "print('First five samples and their probabilities of belonging to classes 0, 1 and 2:')\n",
+    "for i in range(5):\n",
+    "    print(f\"Probabilities of Sample {i+1}: Class 0: {round(100*y_probs[i][0],2)}%, Class 1: {round(100*y_probs[i][1],2)}%, Class 2: {round(100*y_probs[i][2],2)}%\")\n",
+    "\n",
+    "n_discarded = 0\n",
+    "# YOUR CODE HERE\n",
+    "raise NotImplementedError()\n",
+    "print('Number of discarded samples:', n_discarded)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "ff82f9dc77b1d3d45894c1a79140a041",
+     "grade": true,
+     "grade_id": "cell-ef79bd22eecd6d3d",
+     "locked": true,
+     "points": 3,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Perform some sanity checks on the outputs\n",
+    "assert n_discarded > 10, 'Number of discarded samples should be above 10.'\n",
+    "print('Sanity check tests passed!')\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "b2da595fb185180348a99bdcf66dcf8e",
+     "grade": false,
+     "grade_id": "cell-ed0fbfd2850ca1d9",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "## Decision Trees\n",
+    "\n",
+    "We will learn another classification method which is referred to as **decision trees**. A decision tree is a flowchart-like representation of a predictor function $h(\\mathbf{x})$ that reads in the features $\\mathbf{x}$ of a data point and ouputs a predicted label $\\hat{y}=h(\\mathbf{x})$. The decision tree consists of **nodes** which represent certain tests, e.g., \"is the first feature $x_{1}$ larger than 10?\". The nodes are connected by **branches** that correspond to the result or outcome of a test (there is one outgoing branch for each possible answer of a test node). By following the branches, we end up at a leaf node (which has no further branches). Each leaf node is associated with a certain output value $h(\\mathbf{x})$. The picture below depicts a decision tree with test nodes colored blue and leaf nodes colored orange and green. \n",
+    "\n",
+    "<img src=\"../../../coursedata/R3_Classification/Decision_Tree.png\" alt=\"Drawing\" style=\"width: 400px;\"/>\n",
+    "\n",
+    "Now you might wonder how do we choose the test nodes? The basic idea is the same as in linear or logistic regression, we try out many different decision trees (using different choices of test nodes) and pick the one which results in the smallest average loss incurred on some labeled training data points $(\\mathbf{x}^{(i)},y^{(i)})$. However, in contrast to logistic regression, this learning or optimization problem involves searching over a discrete set of different configurations of test nodes instead of a continuous convex optimization of a weight vector $\\mathbf{w}$. This makes learning decision trees compupationally more challenging compared to logistic regression which allows to use efficient convex optimization methods (such as plain gradient descent). However, there have been developed clever ways to learn good decision trees with a reasonable amout of computational resources.  \n",
+    "\n",
+    "Video on basic concept of decision trees:\n",
+    "\n",
+    "- https://www.youtube.com/watch?v=9w16p4QmkAI\n",
+    "\n",
+    "If you want to learn more details about decision trees, beyond the requirements of this course, we refer you to: \n",
+    "\n",
+    "- https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity\n",
+    "- https://en.wikipedia.org/wiki/Decision_tree_learning#Information_gain\n",
+    "- https://sebastianraschka.com/faq/docs/decisiontree-error-vs-entropy.html"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "fa2a354420dc2b46d9d005d4560de6f8",
+     "grade": false,
+     "grade_id": "cell-9970cfed7abbad39",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='demotreeboundary'></a>\n",
+    "<div class=\" alert alert-info\">\n",
+    "<b>Demo.</b> Decision Boundary of a Decision Tree.\n",
+    "\n",
+    "The code snippet below learns a predictor function $h(\\mathbf{x})$ using decision trees based on the first two features $x_{1}$ and $x_{2}$ of the images. It then creates a scatter plot of the training samples $(\\mathbf{x}^{(i)},y^{(i)}$. All samples with $y^{(i)} = 1$ are indicated by \"x\" while all samples with true label $y^{(i)} =0$ are indicated by \"o\". The scatter plot also indicates the decision boundary $\\{\\mathbf{x}: \\widehat{\\mathbf{w}}^{T} \\mathbf{x}=0 \\}$. \n",
+    "\n",
+    "Note that the training data is not perfectly separable by a linear decision boundary. \n",
+    "\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def plot_decision_boundary(clf, X, Y, cmap='Paired_r'):\n",
+    "    \"\"\"Function with which to plot decision boundary\"\"\"\n",
+    "    h = 0.02\n",
+    "    x_min, x_max = X[:,0].min() - 10*h, X[:,0].max() + 10*h\n",
+    "    y_min, y_max = X[:,1].min() - 10*h, X[:,1].max() + 10*h\n",
+    "    x = np.arange(x_min, x_max, h)\n",
+    "    y = np.arange(y_min, y_max, h)\n",
+    "    xx, yy = np.meshgrid(x, y)\n",
+    "    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n",
+    "    Z = Z.reshape(xx.shape)\n",
+    "    \n",
+    "    indx_1 = np.where(Y == 1)[0] # index of each class 0 iamge.\n",
+    "    indx_2 = np.where(Y == 0)[0] # index of each not class 0 image\n",
+    "    \n",
+    "    plt.figure(figsize=(10,6))\n",
+    "    plt.contourf(xx, yy, Z, cmap=cmap, alpha=0.25)\n",
+    "    plt.contour(xx, yy, Z, colors='k', linewidths=0.5)\n",
+    "    plt.scatter(X[indx_1, 0], X[indx_1, 1],marker='x',label='class 0', edgecolors='k')\n",
+    "    plt.scatter(X[indx_2, 0], X[indx_2, 1],marker='o',label='class 1', edgecolors='k')\n",
+    "    plt.xlabel(r'Feature 1')\n",
+    "    plt.ylabel(r'Feature 2')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "8de6893ae2db2dbe2d9485b4a17d3195",
+     "grade": false,
+     "grade_id": "cell-ec23f55820ad7734",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Load data and select only the first two features\n",
+    "X = feature_matrix()\n",
+    "X = X[:,:2]\n",
+    "y = labels()\n",
+    "\n",
+    "clf = DecisionTreeClassifier()   # define object \"clf\" which represents a decision tree\n",
+    "clf.fit(X, y)                    # learn a decision tree that fits well the labeled images  \n",
+    "y_pred = clf.predict(X)          # compute the predicted labels for the images\n",
+    "\n",
+    "# Calculate the accuracy score of the predictions\n",
+    "accuracy = metrics.accuracy_score(y, y_pred)\n",
+    "print(f\"Accuracy: {round(100*accuracy, 2)}%\")\n",
+    "\n",
+    "# Plot decision boundary\n",
+    "plot_decision_boundary(clf, X, y)\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "acdbb2d675d2b87914b10a42a3a1c7ee",
+     "grade": false,
+     "grade_id": "cell-66550b25821ecf7d",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='dtclassifier'></a>\n",
+    "<div class=\" alert alert-warning\">\n",
+    "<b>Student Task.</b> Decision Tree Classifier.\n",
+    "    \n",
+    "Create a decision tree classifier using the sklearn DecisionTreeClassifier imported in the previous cell. Use the following parameters for the classifier: `DecisionTreeClassifier(random_state=0, criterion='entropy')`. The argument `criterion` corresponds to a particular choice for the loss function to be used. For background information consult the [`documentation`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).\n",
+    "\n",
+    "Choose or learn a good decision tree using the [`fit`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit) function. Using the learnt decision tree, compute the predicted labels $\\hat{y}^{(i)}$ for the training data using the function [`DecisionTreeClassifier.predict`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict) and save it to numpy array `y_pred`.\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "e02819db07fe3f9fbeeb2e2b13f34c5c",
+     "grade": false,
+     "grade_id": "cell-53c7cbd776f98a85",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Load data to feature matrix X and label vector y \n",
+    "X = feature_matrix()\n",
+    "y = pd.read_csv(\"/coursedata/R3_Classification/image_labels.csv\", header = None).to_numpy()\n",
+    "y = y.reshape(-1)\n",
+    "feature_cols = [\"x\" + str(i) for i in range(len(X[0,:]))] # needed for visualization\n",
+    "\n",
+    "### STUDENT TASK ###\n",
+    "# clf = ...\n",
+    "# clf. ...\n",
+    "# y_pred = ...\n",
+    "# accuracy = ...\n",
+    "# YOUR CODE HERE\n",
+    "raise NotImplementedError()\n",
+    "\n",
+    "# Model Accuracy, how often is the classifier correct?\n",
+    "print(\"Accuracy:\", round(100*accuracy, 2), '%')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "bb9b467f1ed53af7e66cdeedd1c24c07",
+     "grade": true,
+     "grade_id": "cell-785d28c1d47fecbf",
+     "locked": true,
+     "points": 3,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Perform some sanity checks on the outputs\n",
+    "assert X.shape == (178, 13), \"Training set label matrix has wrong dimensions.\"\n",
+    "assert y.shape == (178,), \"label vector has wrong dimensions.\"\n",
+    "assert y_pred.shape == (178,), \"Prediction vector has wrong dimensions.\"\n",
+    "print('Sanity check tests passed!')\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "a8f857e5eafd96f5545c0ca077339f3f",
+     "grade": false,
+     "grade_id": "cell-9967527de1fab4cc",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='dtcm'></a>\n",
+    "<div class=\" alert alert-warning\">\n",
+    "<b>Student Task.</b> Confusion Matrix.\n",
+    "\n",
+    "Use the `plot_confusion_matrix` of `sklearn.metrics` to calculate and visualize non-normalized confusion matrix for data (X, y) and classifier (clf) obtained from the previous student task. Store the resulting confusion matrix in a variable named `cm`.\n",
+    "\n",
+    "Hints:\n",
+    "\n",
+    "* Look at the Demo.Confusion Matrix. \n",
+    "* Object returned by `plot_confusion_matrix` (`display`) is used to access confusion matrix with `display.confusion_matrix`.\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "0e9169c6c170d87c5e5f8afe6b83e8e8",
+     "grade": false,
+     "grade_id": "cell-aaf67e0222724ba5",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true,
+     "task": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Define class names and title for confusion matrix\n",
+    "classes = ['Class 0','Class 1','Class 2']\n",
+    "title = \"Confusion matrix, without normalization\"\n",
+    "\n",
+    "fig, axes = plt.subplots(1, 1, figsize=(8, 5)) \n",
+    "\n",
+    "# Create a display object with plot_confusion_matrix\n",
+    "# Set parameters display_labels=classes, cmap=plt.cm.Blues, ax=axes \n",
+    "# disp = ...\n",
+    "# cm = ...\n",
+    "# YOUR CODE HERE\n",
+    "raise NotImplementedError()\n",
+    "\n",
+    "# Print title and confusion matrix\n",
+    "print(title)\n",
+    "print(cm)\n",
+    "\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "6f0b440e1502ac197d143ff4986bfd6a",
+     "grade": true,
+     "grade_id": "cell-695db031852aab15",
+     "locked": true,
+     "points": 1,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Perform some sanity check on the result\n",
+    "assert cm.shape == (3,3), \"Confusion Matrix has wrong dimensions.\"\n",
+    "print('Sanity check tests passed!')\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "a54370c6c030aaa26dc2dd0787dd2d90",
+     "grade": false,
+     "grade_id": "cell-77e3903a0aa76769",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='dtvis'></a>\n",
+    "<div class=\" alert alert-info\">\n",
+    "    \n",
+    "### Demo. Visualizing the decision tree. \n",
+    "- Run the below cell to visualize the decision tree.\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "3916efdba2124d92375f253862e79c41",
+     "grade": false,
+     "grade_id": "cell-cb4bc6e0bf21e9a6",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Load libraries\n",
+    "from sklearn.tree import export_graphviz\n",
+    "from sklearn.externals.six import StringIO  \n",
+    "from IPython.display import Image  \n",
+    "import pydotplus\n",
+    "\n",
+    "X = feature_matrix()\n",
+    "feature_cols = [\"x\" + str(i) for i in range(len(X[0,:]))] # needed for visualization\n",
+    "\n",
+    "# Visualize the decision tree\n",
+    "dot_data = StringIO()\n",
+    "export_graphviz(clf, out_file=dot_data,  \n",
+    "                filled=True, rounded=True,\n",
+    "                special_characters=True,feature_names = feature_cols,class_names=['0','1', '2'])\n",
+    "graph = pydotplus.graph_from_dot_data(dot_data.getvalue())\n",
+    "Image(graph.create_png())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "7816286f45ad7e13c51b5d7096cd5c28",
+     "grade": false,
+     "grade_id": "cell-d5344511aa2b9686",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "### Difference between logistic regression and decision trees\n",
+    "\n",
+    "The two classification methods logistic regression and decision trees both aim at learning a good predictor $h(\\mathbf{x})$ which allows determining the label $y$ of the data point based on some features $\\mathbf{x}$. These two classification methods differ in the form of predictor function $h(\\mathbf{x})$ they are using. Logistic regression uses linear predictor functions $h(\\mathbf{x})=\\mathbf{w}^{T} \\mathbf{x}$ (which are thresholded to get discrete label predictions $\\hat{y}$). \n",
+    "\n",
+    "In contrast to linear functions used in logistic regression, decision trees use predictor functions that are obtained from flow charts (decision trees) consisting of various tests on the features $\\mathbf{x}$. Using sufficiently large decision trees allows to represent highly non-linear functions $h(\\mathbf{x})$. In particular, decision trees can perfectly separate data points (according to their labels) which cannot be separated by any straight line (which are the only possible decision boundaries for logistic regression). \n",
+    "\n",
+    "<table><tr>\n",
+    "    <td><img src='../../../coursedata/R3_Classification/lr1.png' style=\"width: 300px;\"></td>\n",
+    "    <td><img src='../../../coursedata/R3_Classification/tree1.png' style=\"width: 300px;\"></td>\n",
+    "</tr></table>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "dced0ae96ea864b86ad43e22a09166ea",
+     "grade": false,
+     "grade_id": "cell-61864c669c276aa4",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "source": [
+    "<a id='Bonus trees'></a>\n",
+    "<div class=\" alert alert-warning\">\n",
+    "    <b>Bonus Task.</b> Descision trees. \n",
+    "    \n",
+    "Bonus task worth of 50 points.\n",
+    "    \n",
+    "We can see that with descision trees classification we can achieve 100% accuracy on training dataset. Not surprisingly, this can lead to overfitting and poor performance on validation datasets. Explain, how to regulate decision tree complexity and avoid overfitting?\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "54b116e3c5d467d4ce82d01dea5b4177",
+     "grade": false,
+     "grade_id": "cell-75e7a30646e71d67",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "## Take Home Quiz "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "2d3eff8671d4d80d7c8d9a924499490f",
+     "grade": false,
+     "grade_id": "cell-30450576e6be42cf",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "Try to answer the following questions by setting the `answer_R3_Q??` variable for each question to the number of the correct answer. For example, if you think that the second answer in the first quizz question is the right one, then set `answer_R3_Q1=2`. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "69fcb270573fce73b463fbc14e0f6511",
+     "grade": false,
+     "grade_id": "cell-fe3d2c11cdd46ede",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='QuestionR3_1'></a>\n",
+    "<div class=\" alert alert-warning\">\n",
+    "<b>Student Task.</b> Question R3.1. \n",
+    "\n",
+    "<p>How many features can be used for logistic regression?</p>\n",
+    "\n",
+    "<ol>\n",
+    "  <li>None</li>\n",
+    "  <li>One (1)</li>\n",
+    "  <li>Thirteen (13)</li>\n",
+    "  <li>Any number of features (given enough computational resources)</li>\n",
+    "</ol> \n",
+    "\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "1ae9538f2563e810a9f0d4d25c07683a",
+     "grade": false,
+     "grade_id": "cell-ed5e2813d39894af",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# answer_Q1\n",
+    "\n",
+    "# answer_R3_Q1  = ...\n",
+    "# YOUR CODE HERE\n",
+    "raise NotImplementedError()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "e5f46be87e9299a73eb90ae511eb3ce3",
+     "grade": true,
+     "grade_id": "cell-436d8f13df175189",
+     "locked": true,
+     "points": 1,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# this cell is for tests\n",
+    "assert answer_R3_Q1 in [1,2,3,4], '\"answer_R3_Q1\" Value should be an integer between 1 and 4.'\n",
+    "print('Sanity check tests passed!')\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "907253cb2a2c65b21cbf9950fa50d7c4",
+     "grade": false,
+     "grade_id": "cell-5241e1c99a943e6d",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='QuestionR3_2'></a>\n",
+    "<div class=\" alert alert-warning\">\n",
+    "<b>Student Task.</b> Question R3.2. \n",
+    "\n",
+    "<p>When performing logistic regression, we are trying to....</p>\n",
+    "\n",
+    "<ol>\n",
+    "  <li>Solving a minimum likelihood problem.</li>\n",
+    "  <li>Maximize the average logistic loss.</li>\n",
+    "  <li>Minimize the average logistic loss.</li>\n",
+    "</ol> \n",
+    "\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "669c8e1208a066aa66a4038378e12452",
+     "grade": false,
+     "grade_id": "cell-28b9b67d4ee831b7",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# answer_Q2\n",
+    "\n",
+    "# answer_R3_Q2 = ...\n",
+    "# YOUR CODE HERE\n",
+    "raise NotImplementedError()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "4cc7a691ecf1d0d41c44ff7a85cb4df8",
+     "grade": true,
+     "grade_id": "cell-4716309299410987",
+     "locked": true,
+     "points": 1,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# this cell is for tests\n",
+    "assert answer_R3_Q2 in [1,2,3], '\"answer_R3_Q2\" Value should be an integer between 1 and 3.'\n",
+    "print('Sanity check tests passed!')\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "52f59f5c6d5c98900b2dc15b1d7b9367",
+     "grade": false,
+     "grade_id": "cell-07adb8f984372edf",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "source": [
+    "<a id='QuestionR3_3'></a>\n",
+    "<div class=\" alert alert-warning\">\n",
+    "<b>Student Task.</b> Question R3.3. \n",
+    "\n",
+    "<p>Consider an arbitrary set of $m$ labeled data points having two features $\\mathbf{x}^{(i)} \\in \\big(x^{(i)}_{1},x^{(i)}_{2}\\big)^{T}$ and a binary label $y^{(i)} \\in \\{0,1\\}$. How large can the sample size $m$ be such that we can for sure always find a straight line such that all points $\\mathbf{x}^{(i)}$ with the same label $y^{(i)}$ lie on the same side (but not on top) of the line. </p>\n",
+    "\n",
+    "<ol>\n",
+    "  <li>$m \\leq 2$</li>\n",
+    "  <li>$m = 3$</li>\n",
+    "  <li>$m = 4$</li>\n",
+    "  <li>$m = 6$</li>\n",
+    "</ol> \n",
+    "\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "d970f3f38f8aeb5db7c864e7d34d4d6f",
+     "grade": false,
+     "grade_id": "cell-f9e25d95622fae74",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# answer_Q3\n",
+    "\n",
+    "# answer_R3_Q3  = ...\n",
+    "# YOUR CODE HERE\n",
+    "raise NotImplementedError()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "6a33cb12f6820e7e08123493a926e5af",
+     "grade": true,
+     "grade_id": "cell-909e127af7a42f2d",
+     "locked": true,
+     "points": 1,
+     "schema_version": 3,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# this cell is for tests\n",
+    "\n",
+    "\n",
+    "assert answer_R3_Q3 in [1,2,3,4], '\"answer_R3_Q3\" Value should be an integer between 1 and 4.'\n",
+    "print('Sanity check tests passed!')\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "markdown",
+     "checksum": "da050d974a7794221cff5be0318f1857",
+     "grade": false,
+     "grade_id": "cell-5c0377d1f8952d32",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "source": [
+    "<a id='QuestionR3.4.'></a>\n",
+    "<div class=\" alert alert-warning\">\n",
+    "<b>Student Task.</b> Question R3.4. \n",
+    "Maximizing the probability (or likelihood) of the labels $y^{(i)}$, for $i=1,\\ldots,m$ belonging to a class is:\n",
+    "\n",
+    "1. Equivalent to maximizing logistic loss\n",
+    "2. Equivalent to minimizing logistic loss\n",
+    "3. Maximum likelihood problem is not related to logistic loss\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "53e6f0a38adc3937e94adb8487c8d441",
+     "grade": false,
+     "grade_id": "cell-47b671d4c55f6b4d",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true,
+     "task": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# answer_Q4\n",
+    "\n",
+    "# answer_R3_Q4 = ...\n",
+    "# YOUR CODE HERE\n",
+    "raise NotImplementedError()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "92c49bdcd6815ad1582c963de90fecd6",
+     "grade": true,
+     "grade_id": "cell-39ba770c34d6dcb1",
+     "locked": true,
+     "points": 1,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# this cell is for tests\n",
+    "assert answer_R3_Q4 in [1,2,3], '\"answer_R3_Q4\" Value should be an integer between 1 and 3.'\n",
+    "print('Sanity check tests passed!')\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.3"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": false,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "304.8px"
+   },
+   "toc_section_display": true,
+   "toc_window_display": true
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
-- 
GitLab