diff --git a/exercise4.ipynb b/exercise4.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..09e9fcc54f36dd9b4a82096966cd366b441c6a84
--- /dev/null
+++ b/exercise4.ipynb
@@ -0,0 +1,1419 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Exercise 4"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as p\n",
+    "import numpy as np\n",
+    "import scipy\n",
+    "import scipy.stats"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Two-variable tests with toy data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "A = (34, 23, 51, 47, 34)\n",
+    "B = (48, 27, 33, 45, 41, 35)\n",
+    "C = (34, 53, 54, 28, 52, 29)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- Choose suitable statistical tests to compare pairs A&B, A&C and B&C. Justify your choices. What hypotheses do the tests concern? <br>\n",
+    "- Calculate the P-values. What can you conclude based on the observed p-values?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.6185050948808057 0.8776995343877783 0.04821299036621768\n"
+     ]
+    }
+   ],
+   "source": [
+    "#testing if the data is normally distributed\n",
+    "print(scipy.stats.shapiro(A)[1],\n",
+    "scipy.stats.shapiro(B)[1],\n",
+    "scipy.stats.shapiro(C)[1])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Datasets A and B are normally distributed, but C is not because the P-value < 0.05. <br>\n",
+    "The statistical test that I chose for A&B is the t-test, because they are both normally distributed and they are unpaired.<br>\n",
+    "For the A&C and B&C I chose the Mann-Whitney U test, because the variable C is not normally distributed and the variables are unpaired."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.9507932942353805"
+      ]
+     },
+     "execution_count": 43,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#A&B\n",
+    "scipy.stats.ttest_ind(A,B)[1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The variables are not significatly different, because the P-value is high."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.5189924682098411"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#A&C\n",
+    "scipy.stats.mannwhitneyu(A,C)[1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Because the P-value is high the two different variables are not statiscally significatly different."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.5887445887445888"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#B&C\n",
+    "scipy.stats.mannwhitneyu(B,C)[1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Because the P-value is high the two different variables are not statiscally significatly different."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- The description of the data is deliberately vague. Can you come up with other plausible tests for each pair?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We don't know if the pairs are related, because of the vagueness of the data description. If the pairs would be related we could use the paired t-test for A&B and the Wilcoxon signed-rank test for A&C and B&C."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. More two-variable tests"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- Select the correct statistical tests to compare the following pairs and calculate the P-values. What hypotheses do the tests concern? What can you conclude based on the observed p-values?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Group D\n",
+    "testD = (5.6, 3.1, 8.7, 4.5, 6.7, 4.5)\n",
+    "re_testD = (6.1, 5.8, 8.5, 5.3, 7.2, 5.1)\n",
+    "\n",
+    "#Group E\n",
+    "testE = (4.5, 3.9, 7.1, 4.3, 6.9, 8.2, 7.6)\n",
+    "re_testE = (4.9, 4.7, 7.8, 4.8, 7.5, 7.8, 8.1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "testD: 0.8002394946806008 re-testD: 0.3768611442017307 \n",
+      " testE: 0.16666537264976267 re-testE: 0.014585478888414122\n"
+     ]
+    }
+   ],
+   "source": [
+    "#testing if the data is normally distributed\n",
+    "print(\"testD:\", scipy.stats.shapiro(testD)[1],\n",
+    "\"re-testD:\", scipy.stats.shapiro(re_testD)[1],\"\\n\",\n",
+    "\"testE:\", scipy.stats.shapiro(testE)[1],\n",
+    "\"re-testE:\", scipy.stats.shapiro(re_testE)[1])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So, the testD, re-testD and the testE are normally distributed based on their P-values, but the re-testE is not. The different datasets within the groups are paired but different groups are not. <br>\n",
+    "- That's why I choose the paired t-test for testD and re-testD within group D\n",
+    "- the Wilcoxon signed-rank test for testE and re-testE within group E\n",
+    "- the unpaired t-test for testD and testE\n",
+    "- and the Mann-Whitney U test for re-testD and re-testE\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.09740501217589806"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#test and re-test in group D\n",
+    "scipy.stats.ttest_rel(testD,re_testD)[1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There is no significant difference between the testD and the re-testD, because the P-value > 0.05."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.03125"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#test and re-test in group E\n",
+    "scipy.stats.wilcoxon(testE,re_testE)[1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There is significant difference between the testE and the re-testE, because the P-value < 0.05."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.6040909505950958"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#testD and testE\n",
+    "scipy.stats.ttest_ind(testD,testE)[1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There is no significant difference between the testD and the testE, because the P-value > 0.05."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.9429784240576059"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#re-testD and re-testE\n",
+    "scipy.stats.mannwhitneyu(re_testD,re_testE)[1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There is no significant difference between the re-testD and the re-testE, because the P-value > 0.05."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. More two-variable tests (continues)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- Calculate the Pearson correlation coefficient and its P-value."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "PearsonRResult(statistic=0.9773664314916523, pvalue=0.00014623458861244028)"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "scipy.stats.pearsonr(testE,re_testE)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- Calculate the Spearman correlation coefficient and its P-value."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "SignificanceResult(statistic=0.9369749612033814, pvalue=0.0018510301964418925)"
+      ]
+     },
+     "execution_count": 15,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "scipy.stats.spearmanr(testE,re_testE)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- What hypotheses do the tests concern? What can you conclude based on the observed p-values?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The Pearson correlation coefficient is very high, which indicates a very strong positive positive linear relationship between the testE and re-testE. And because the p-value is so small the correation is significant. <br>\n",
+    "<br>\n",
+    "The Spearman correlation coefficient is 0.937, which also indicates a strong positive monotonic relationship between the test and re-test scores. And also because the p-value is well bwlow 0.05 the correlation is significant. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Advertisements"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df1 = p.read_csv('ads-image.csv', \n",
+    "                 # use row 0 as column names\n",
+    "                header=0,\n",
+    "                sep=\",\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>person_id</th>\n",
+       "      <th>amount_spent</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>87434</td>\n",
+       "      <td>14.01</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>24844</td>\n",
+       "      <td>10.46</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>23566</td>\n",
+       "      <td>21.32</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>79972</td>\n",
+       "      <td>121.10</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>37170</td>\n",
+       "      <td>14.26</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   person_id  amount_spent\n",
+       "0      87434         14.01\n",
+       "1      24844         10.46\n",
+       "2      23566         21.32\n",
+       "3      79972        121.10\n",
+       "4      37170         14.26"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df1.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df2 = p.read_csv('ads-video.csv', \n",
+    "                 # use row 0 as column names\n",
+    "                header=0,\n",
+    "                sep=\",\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>person_id</th>\n",
+       "      <th>amount_spent</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>5370</td>\n",
+       "      <td>12.73</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>60667</td>\n",
+       "      <td>25.42</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>96716</td>\n",
+       "      <td>22.08</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>99764</td>\n",
+       "      <td>0.52</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>75990</td>\n",
+       "      <td>49.63</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   person_id  amount_spent\n",
+       "0       5370         12.73\n",
+       "1      60667         25.42\n",
+       "2      96716         22.08\n",
+       "3      99764          0.52\n",
+       "4      75990         49.63"
+      ]
+     },
+     "execution_count": 19,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df2.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- Is there statistical evidence to claim that the total amount spent by customers is different if they click on image advertisements than on video advertisements?\n",
+    "- Explain the assumptions you made about how the data was collected and how it affected your choice of the test."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.2337474438356741e-14\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "6.508535542505141e-13"
+      ]
+     },
+     "execution_count": 20,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#check if the datasets are normally distributed\n",
+    "print(scipy.stats.shapiro(df1.iloc[:,1])[1])\n",
+    "scipy.stats.shapiro(df2.iloc[:,1])[1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Neither is normally distributed. And I assume that the variables amount_spent in both datasets are not paired, because I don't think they could follow the same customer on image ads and on video ads in the same order. So, that's why I chose the Mann-Whitney test."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.10522752647898526"
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "scipy.stats.mannwhitneyu(df1.iloc[:,1],df2.iloc[:,1])[1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The conclusion of this test can be said, that there is statistical evidence to claim that the total amount spent by customer is not significally different on image ads and video ads. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Diffrent approach:<br>\n",
+    "The different datasets are paired because they contain same customers but they are not in the same order and they do not contain the same amount of customers. So I merged the dataframes togheter on the person_id and do the statistical test on this dataframe. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "paired_df = p.merge(df1,df2, on=\"person_id\", how = \"inner\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>person_id</th>\n",
+       "      <th>amount_spent_x</th>\n",
+       "      <th>amount_spent_y</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>87434</td>\n",
+       "      <td>14.01</td>\n",
+       "      <td>40.54</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>24844</td>\n",
+       "      <td>10.46</td>\n",
+       "      <td>78.26</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>23566</td>\n",
+       "      <td>21.32</td>\n",
+       "      <td>5.66</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>79972</td>\n",
+       "      <td>121.10</td>\n",
+       "      <td>24.05</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>37170</td>\n",
+       "      <td>14.26</td>\n",
+       "      <td>48.67</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   person_id  amount_spent_x  amount_spent_y\n",
+       "0      87434           14.01           40.54\n",
+       "1      24844           10.46           78.26\n",
+       "2      23566           21.32            5.66\n",
+       "3      79972          121.10           24.05\n",
+       "4      37170           14.26           48.67"
+      ]
+     },
+     "execution_count": 23,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "paired_df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(171, 3)"
+      ]
+     },
+     "execution_count": 24,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "paired_df.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2.655470011519053e-11\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "8.710116256139686e-13"
+      ]
+     },
+     "execution_count": 25,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#check if the variables are normally distributed\n",
+    "print(scipy.stats.shapiro(paired_df.iloc[:,1])[1])\n",
+    "scipy.stats.shapiro(paired_df.iloc[:,2])[1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Neither are still not normally distributed. So let's use the Wilcoxon signed-rank test."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.2537141492862989"
+      ]
+     },
+     "execution_count": 26,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "scipy.stats.wilcoxon(paired_df.iloc[:,1],paired_df.iloc[:,2])[1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The conclusion of this test can also be said, that there is statistical evidence to claim that the total amount spent by customer is not significally different on image ads and video ads based on the p-value."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Electric bikes (continues)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- model answer from the previous task as a basis for my own answer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = p.read_csv('bikes.data.csv')\n",
+    "\n",
+    "categorical = [\n",
+    "     'ticket',\n",
+    "     'month',\n",
+    "     'location_from',\n",
+    "     'location_to',\n",
+    "     'assistance',\n",
+    "]\n",
+    "quantitative = [\n",
+    "     'cost',\n",
+    "     'duration',\n",
+    "     'distance',\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# remove outliers\n",
+    "\n",
+    "df = df[df.cost < 100]\n",
+    "df = df[df.duration < 10000]\n",
+    "\n",
+    "# fix negative distances\n",
+    "\n",
+    "df.distance = np.abs(df.distance)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# remove rides that are likely just tests\n",
+    "\n",
+    "df = df[df.distance != 0]\n",
+    "df = df[df.duration >= 60]\n",
+    "# (zero costs are likely due to season tickets)\n",
+    "\n",
+    "# Further properties to consider\n",
+    "# - rides with very high average speed\n",
+    "# - rides that start and end in the same location\n",
+    "# - assistance off/on vs. energy used/collected"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- Is there statistical evidence to claim that the travel times tend to be shorter or longer for the single than for the season ticket type?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>ticket</th>\n",
+       "      <th>cost</th>\n",
+       "      <th>month</th>\n",
+       "      <th>location_from</th>\n",
+       "      <th>location_to</th>\n",
+       "      <th>duration</th>\n",
+       "      <th>distance</th>\n",
+       "      <th>assistance</th>\n",
+       "      <th>energy_used</th>\n",
+       "      <th>energy_collected</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>single</td>\n",
+       "      <td>0.35</td>\n",
+       "      <td>9</td>\n",
+       "      <td>MICROTEKNIA</td>\n",
+       "      <td>PUIJONLAAKSO</td>\n",
+       "      <td>411.0</td>\n",
+       "      <td>2150</td>\n",
+       "      <td>1</td>\n",
+       "      <td>19.0</td>\n",
+       "      <td>2.7</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>single</td>\n",
+       "      <td>1.20</td>\n",
+       "      <td>5</td>\n",
+       "      <td>SATAMA</td>\n",
+       "      <td>KEILANKANTA</td>\n",
+       "      <td>1411.0</td>\n",
+       "      <td>7130</td>\n",
+       "      <td>1</td>\n",
+       "      <td>53.8</td>\n",
+       "      <td>15.3</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>savonia</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>9</td>\n",
+       "      <td>TASAVALLANKATU</td>\n",
+       "      <td>NEULAMÄKI</td>\n",
+       "      <td>1308.0</td>\n",
+       "      <td>5420</td>\n",
+       "      <td>1</td>\n",
+       "      <td>43.0</td>\n",
+       "      <td>9.9</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>savonia</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>10</td>\n",
+       "      <td>TORI</td>\n",
+       "      <td>KAUPPAKATU</td>\n",
+       "      <td>1036.0</td>\n",
+       "      <td>1180</td>\n",
+       "      <td>1</td>\n",
+       "      <td>6.5</td>\n",
+       "      <td>2.1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>single</td>\n",
+       "      <td>0.30</td>\n",
+       "      <td>9</td>\n",
+       "      <td>TORI</td>\n",
+       "      <td>TORI</td>\n",
+       "      <td>319.0</td>\n",
+       "      <td>1120</td>\n",
+       "      <td>1</td>\n",
+       "      <td>13.7</td>\n",
+       "      <td>1.2</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "    ticket  cost  month   location_from   location_to  duration  distance  \\\n",
+       "0   single  0.35      9     MICROTEKNIA  PUIJONLAAKSO     411.0      2150   \n",
+       "1   single  1.20      5          SATAMA   KEILANKANTA    1411.0      7130   \n",
+       "2  savonia  0.00      9  TASAVALLANKATU     NEULAMÄKI    1308.0      5420   \n",
+       "3  savonia  0.00     10            TORI    KAUPPAKATU    1036.0      1180   \n",
+       "4   single  0.30      9            TORI          TORI     319.0      1120   \n",
+       "\n",
+       "   assistance  energy_used  energy_collected  \n",
+       "0           1         19.0               2.7  \n",
+       "1           1         53.8              15.3  \n",
+       "2           1         43.0               9.9  \n",
+       "3           1          6.5               2.1  \n",
+       "4           1         13.7               1.2  "
+      ]
+     },
+     "execution_count": 30,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#variables for the durations for each ticket type\n",
+    "single_duration = df[df['ticket'] == 'single']['duration']\n",
+    "season_duration = df[df['ticket'] == 'season']['duration']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>duration</th>\n",
+       "      <th>distance</th>\n",
+       "      <th>cost</th>\n",
+       "      <th>count</th>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>ticket</th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>savonia</th>\n",
+       "      <td>138370.0</td>\n",
+       "      <td>494990</td>\n",
+       "      <td>2.00</td>\n",
+       "      <td>218</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>season</th>\n",
+       "      <td>317094.0</td>\n",
+       "      <td>1315560</td>\n",
+       "      <td>3.00</td>\n",
+       "      <td>457</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>single</th>\n",
+       "      <td>637522.0</td>\n",
+       "      <td>2559320</td>\n",
+       "      <td>552.35</td>\n",
+       "      <td>815</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "         duration  distance    cost  count\n",
+       "ticket                                    \n",
+       "savonia  138370.0    494990    2.00    218\n",
+       "season   317094.0   1315560    3.00    457\n",
+       "single   637522.0   2559320  552.35    815"
+      ]
+     },
+     "execution_count": 32,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "summary = df.groupby('ticket')[['duration', 'distance', 'cost']].sum()\n",
+    "summary['count'] = df.ticket.value_counts()\n",
+    "summary"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0        411.0\n",
+      "1       1411.0\n",
+      "4        319.0\n",
+      "5       1185.0\n",
+      "6        817.0\n",
+      "         ...  \n",
+      "1762    1597.0\n",
+      "1763    1066.0\n",
+      "1767    4400.0\n",
+      "1768    1167.0\n",
+      "1770     199.0\n",
+      "Name: duration, Length: 815, dtype: float64\n",
+      "8        797.0\n",
+      "11      1014.0\n",
+      "12       822.0\n",
+      "16       905.0\n",
+      "28       233.0\n",
+      "         ...  \n",
+      "1755     113.0\n",
+      "1756     256.0\n",
+      "1761     934.0\n",
+      "1765     696.0\n",
+      "1773     478.0\n",
+      "Name: duration, Length: 457, dtype: float64\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(single_duration)\n",
+    "print(season_duration)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.6945360817801978e-35 4.119831123406427e-10\n"
+     ]
+    }
+   ],
+   "source": [
+    "#check if variables are normally distributed\n",
+    "print(scipy.stats.shapiro(single_duration)[1],\n",
+    "      scipy.stats.shapiro(season_duration)[1])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Neither of them is normally distributed. Every row is a different is its own journey so I assume that the variables are not paired. Thats why I chose to use the Mann-Whitney test. And because we want to know the direction of the test I am testing if the season duration is significally greater than the single duration."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.8260791304465724"
+      ]
+     },
+     "execution_count": 35,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "scipy.stats.mannwhitneyu(season_duration, single_duration, alternative=\"greater\")[1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The result of the test tells us that the season duration is not significally greater than the single duration. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sample_size = len(season_duration)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sample_single = single_duration.sample(n=sample_size, random_state=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "PearsonRResult(statistic=-0.0984255769075546, pvalue=0.03542681854380087)"
+      ]
+     },
+     "execution_count": 38,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "scipy.stats.pearsonr(season_duration,sample_single)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- Is there statistical evidence to claim that the savonia ticket type differs from the others with respect to how often the electric assistance is used?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "single_assistance = df[df['ticket'] == 'single']['assistance']\n",
+    "season_assistance = df[df['ticket'] == 'season']['assistance']\n",
+    "savonia_assistance = df[df['ticket'] == 'savonia']['assistance']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "4.90915909646714e-49 1.7328341801106013e-40 1.261976416979814e-26\n"
+     ]
+    }
+   ],
+   "source": [
+    "#check if variables are normally distributed\n",
+    "print(scipy.stats.shapiro(single_assistance)[1],\n",
+    "      scipy.stats.shapiro(season_assistance)[1],\n",
+    "      scipy.stats.shapiro(savonia_assistance)[1])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "not any of them are normal, not paired, so mann-whitney test"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.0010338714188769154"
+      ]
+     },
+     "execution_count": 41,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "scipy.stats.mannwhitneyu(savonia_assistance, single_assistance)[1]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "1.00709460090142e-05"
+      ]
+     },
+     "execution_count": 42,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "scipy.stats.mannwhitneyu(savonia_assistance, season_assistance)[1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Based on these p-values there are significant difference on the savonia ticket type on how often the assistance is used. "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}