diff --git a/exercise4.ipynb b/exercise4.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..09e9fcc54f36dd9b4a82096966cd366b441c6a84 --- /dev/null +++ b/exercise4.ipynb @@ -0,0 +1,1419 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Exercise 4" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as p\n", + "import numpy as np\n", + "import scipy\n", + "import scipy.stats" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Two-variable tests with toy data" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "A = (34, 23, 51, 47, 34)\n", + "B = (48, 27, 33, 45, 41, 35)\n", + "C = (34, 53, 54, 28, 52, 29)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Choose suitable statistical tests to compare pairs A&B, A&C and B&C. Justify your choices. What hypotheses do the tests concern? <br>\n", + "- Calculate the P-values. What can you conclude based on the observed p-values?" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.6185050948808057 0.8776995343877783 0.04821299036621768\n" + ] + } + ], + "source": [ + "#testing if the data is normally distributed\n", + "print(scipy.stats.shapiro(A)[1],\n", + "scipy.stats.shapiro(B)[1],\n", + "scipy.stats.shapiro(C)[1])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Datasets A and B are normally distributed, but C is not because the P-value < 0.05. <br>\n", + "The statistical test that I chose for A&B is the t-test, because they are both normally distributed and they are unpaired.<br>\n", + "For the A&C and B&C I chose the Mann-Whitney U test, because the variable C is not normally distributed and the variables are unpaired." + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.9507932942353805" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#A&B\n", + "scipy.stats.ttest_ind(A,B)[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The variables are not significatly different, because the P-value is high." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.5189924682098411" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#A&C\n", + "scipy.stats.mannwhitneyu(A,C)[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Because the P-value is high the two different variables are not statiscally significatly different." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.5887445887445888" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#B&C\n", + "scipy.stats.mannwhitneyu(B,C)[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Because the P-value is high the two different variables are not statiscally significatly different." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- The description of the data is deliberately vague. Can you come up with other plausible tests for each pair?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We don't know if the pairs are related, because of the vagueness of the data description. If the pairs would be related we could use the paired t-test for A&B and the Wilcoxon signed-rank test for A&C and B&C." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. More two-variable tests" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Select the correct statistical tests to compare the following pairs and calculate the P-values. What hypotheses do the tests concern? What can you conclude based on the observed p-values?" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "#Group D\n", + "testD = (5.6, 3.1, 8.7, 4.5, 6.7, 4.5)\n", + "re_testD = (6.1, 5.8, 8.5, 5.3, 7.2, 5.1)\n", + "\n", + "#Group E\n", + "testE = (4.5, 3.9, 7.1, 4.3, 6.9, 8.2, 7.6)\n", + "re_testE = (4.9, 4.7, 7.8, 4.8, 7.5, 7.8, 8.1)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "testD: 0.8002394946806008 re-testD: 0.3768611442017307 \n", + " testE: 0.16666537264976267 re-testE: 0.014585478888414122\n" + ] + } + ], + "source": [ + "#testing if the data is normally distributed\n", + "print(\"testD:\", scipy.stats.shapiro(testD)[1],\n", + "\"re-testD:\", scipy.stats.shapiro(re_testD)[1],\"\\n\",\n", + "\"testE:\", scipy.stats.shapiro(testE)[1],\n", + "\"re-testE:\", scipy.stats.shapiro(re_testE)[1])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So, the testD, re-testD and the testE are normally distributed based on their P-values, but the re-testE is not. The different datasets within the groups are paired but different groups are not. <br>\n", + "- That's why I choose the paired t-test for testD and re-testD within group D\n", + "- the Wilcoxon signed-rank test for testE and re-testE within group E\n", + "- the unpaired t-test for testD and testE\n", + "- and the Mann-Whitney U test for re-testD and re-testE\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.09740501217589806" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#test and re-test in group D\n", + "scipy.stats.ttest_rel(testD,re_testD)[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There is no significant difference between the testD and the re-testD, because the P-value > 0.05." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.03125" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#test and re-test in group E\n", + "scipy.stats.wilcoxon(testE,re_testE)[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There is significant difference between the testE and the re-testE, because the P-value < 0.05." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.6040909505950958" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#testD and testE\n", + "scipy.stats.ttest_ind(testD,testE)[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There is no significant difference between the testD and the testE, because the P-value > 0.05." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.9429784240576059" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#re-testD and re-testE\n", + "scipy.stats.mannwhitneyu(re_testD,re_testE)[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There is no significant difference between the re-testD and the re-testE, because the P-value > 0.05." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. More two-variable tests (continues)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Calculate the Pearson correlation coefficient and its P-value." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "PearsonRResult(statistic=0.9773664314916523, pvalue=0.00014623458861244028)" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "scipy.stats.pearsonr(testE,re_testE)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Calculate the Spearman correlation coefficient and its P-value." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "SignificanceResult(statistic=0.9369749612033814, pvalue=0.0018510301964418925)" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "scipy.stats.spearmanr(testE,re_testE)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- What hypotheses do the tests concern? What can you conclude based on the observed p-values?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Pearson correlation coefficient is very high, which indicates a very strong positive positive linear relationship between the testE and re-testE. And because the p-value is so small the correation is significant. <br>\n", + "<br>\n", + "The Spearman correlation coefficient is 0.937, which also indicates a strong positive monotonic relationship between the test and re-test scores. And also because the p-value is well bwlow 0.05 the correlation is significant. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Advertisements" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "df1 = p.read_csv('ads-image.csv', \n", + " # use row 0 as column names\n", + " header=0,\n", + " sep=\",\")" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>person_id</th>\n", + " <th>amount_spent</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>87434</td>\n", + " <td>14.01</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>24844</td>\n", + " <td>10.46</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>23566</td>\n", + " <td>21.32</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>79972</td>\n", + " <td>121.10</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>37170</td>\n", + " <td>14.26</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " person_id amount_spent\n", + "0 87434 14.01\n", + "1 24844 10.46\n", + "2 23566 21.32\n", + "3 79972 121.10\n", + "4 37170 14.26" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "df2 = p.read_csv('ads-video.csv', \n", + " # use row 0 as column names\n", + " header=0,\n", + " sep=\",\")" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>person_id</th>\n", + " <th>amount_spent</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>5370</td>\n", + " <td>12.73</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>60667</td>\n", + " <td>25.42</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>96716</td>\n", + " <td>22.08</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>99764</td>\n", + " <td>0.52</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>75990</td>\n", + " <td>49.63</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " person_id amount_spent\n", + "0 5370 12.73\n", + "1 60667 25.42\n", + "2 96716 22.08\n", + "3 99764 0.52\n", + "4 75990 49.63" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df2.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Is there statistical evidence to claim that the total amount spent by customers is different if they click on image advertisements than on video advertisements?\n", + "- Explain the assumptions you made about how the data was collected and how it affected your choice of the test." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1.2337474438356741e-14\n" + ] + }, + { + "data": { + "text/plain": [ + "6.508535542505141e-13" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#check if the datasets are normally distributed\n", + "print(scipy.stats.shapiro(df1.iloc[:,1])[1])\n", + "scipy.stats.shapiro(df2.iloc[:,1])[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Neither is normally distributed. And I assume that the variables amount_spent in both datasets are not paired, because I don't think they could follow the same customer on image ads and on video ads in the same order. So, that's why I chose the Mann-Whitney test." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.10522752647898526" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "scipy.stats.mannwhitneyu(df1.iloc[:,1],df2.iloc[:,1])[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The conclusion of this test can be said, that there is statistical evidence to claim that the total amount spent by customer is not significally different on image ads and video ads. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Diffrent approach:<br>\n", + "The different datasets are paired because they contain same customers but they are not in the same order and they do not contain the same amount of customers. So I merged the dataframes togheter on the person_id and do the statistical test on this dataframe. " + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "paired_df = p.merge(df1,df2, on=\"person_id\", how = \"inner\")" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>person_id</th>\n", + " <th>amount_spent_x</th>\n", + " <th>amount_spent_y</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>87434</td>\n", + " <td>14.01</td>\n", + " <td>40.54</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>24844</td>\n", + " <td>10.46</td>\n", + " <td>78.26</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>23566</td>\n", + " <td>21.32</td>\n", + " <td>5.66</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>79972</td>\n", + " <td>121.10</td>\n", + " <td>24.05</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>37170</td>\n", + " <td>14.26</td>\n", + " <td>48.67</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " person_id amount_spent_x amount_spent_y\n", + "0 87434 14.01 40.54\n", + "1 24844 10.46 78.26\n", + "2 23566 21.32 5.66\n", + "3 79972 121.10 24.05\n", + "4 37170 14.26 48.67" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "paired_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(171, 3)" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "paired_df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2.655470011519053e-11\n" + ] + }, + { + "data": { + "text/plain": [ + "8.710116256139686e-13" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#check if the variables are normally distributed\n", + "print(scipy.stats.shapiro(paired_df.iloc[:,1])[1])\n", + "scipy.stats.shapiro(paired_df.iloc[:,2])[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Neither are still not normally distributed. So let's use the Wilcoxon signed-rank test." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.2537141492862989" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "scipy.stats.wilcoxon(paired_df.iloc[:,1],paired_df.iloc[:,2])[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The conclusion of this test can also be said, that there is statistical evidence to claim that the total amount spent by customer is not significally different on image ads and video ads based on the p-value." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Electric bikes (continues)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- model answer from the previous task as a basis for my own answer" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "df = p.read_csv('bikes.data.csv')\n", + "\n", + "categorical = [\n", + " 'ticket',\n", + " 'month',\n", + " 'location_from',\n", + " 'location_to',\n", + " 'assistance',\n", + "]\n", + "quantitative = [\n", + " 'cost',\n", + " 'duration',\n", + " 'distance',\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [], + "source": [ + "# remove outliers\n", + "\n", + "df = df[df.cost < 100]\n", + "df = df[df.duration < 10000]\n", + "\n", + "# fix negative distances\n", + "\n", + "df.distance = np.abs(df.distance)" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "# remove rides that are likely just tests\n", + "\n", + "df = df[df.distance != 0]\n", + "df = df[df.duration >= 60]\n", + "# (zero costs are likely due to season tickets)\n", + "\n", + "# Further properties to consider\n", + "# - rides with very high average speed\n", + "# - rides that start and end in the same location\n", + "# - assistance off/on vs. energy used/collected" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Is there statistical evidence to claim that the travel times tend to be shorter or longer for the single than for the season ticket type?" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>ticket</th>\n", + " <th>cost</th>\n", + " <th>month</th>\n", + " <th>location_from</th>\n", + " <th>location_to</th>\n", + " <th>duration</th>\n", + " <th>distance</th>\n", + " <th>assistance</th>\n", + " <th>energy_used</th>\n", + " <th>energy_collected</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>single</td>\n", + " <td>0.35</td>\n", + " <td>9</td>\n", + " <td>MICROTEKNIA</td>\n", + " <td>PUIJONLAAKSO</td>\n", + " <td>411.0</td>\n", + " <td>2150</td>\n", + " <td>1</td>\n", + " <td>19.0</td>\n", + " <td>2.7</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>single</td>\n", + " <td>1.20</td>\n", + " <td>5</td>\n", + " <td>SATAMA</td>\n", + " <td>KEILANKANTA</td>\n", + " <td>1411.0</td>\n", + " <td>7130</td>\n", + " <td>1</td>\n", + " <td>53.8</td>\n", + " <td>15.3</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>savonia</td>\n", + " <td>0.00</td>\n", + " <td>9</td>\n", + " <td>TASAVALLANKATU</td>\n", + " <td>NEULAMÄKI</td>\n", + " <td>1308.0</td>\n", + " <td>5420</td>\n", + " <td>1</td>\n", + " <td>43.0</td>\n", + " <td>9.9</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>savonia</td>\n", + " <td>0.00</td>\n", + " <td>10</td>\n", + " <td>TORI</td>\n", + " <td>KAUPPAKATU</td>\n", + " <td>1036.0</td>\n", + " <td>1180</td>\n", + " <td>1</td>\n", + " <td>6.5</td>\n", + " <td>2.1</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>single</td>\n", + " <td>0.30</td>\n", + " <td>9</td>\n", + " <td>TORI</td>\n", + " <td>TORI</td>\n", + " <td>319.0</td>\n", + " <td>1120</td>\n", + " <td>1</td>\n", + " <td>13.7</td>\n", + " <td>1.2</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " ticket cost month location_from location_to duration distance \\\n", + "0 single 0.35 9 MICROTEKNIA PUIJONLAAKSO 411.0 2150 \n", + "1 single 1.20 5 SATAMA KEILANKANTA 1411.0 7130 \n", + "2 savonia 0.00 9 TASAVALLANKATU NEULAMÄKI 1308.0 5420 \n", + "3 savonia 0.00 10 TORI KAUPPAKATU 1036.0 1180 \n", + "4 single 0.30 9 TORI TORI 319.0 1120 \n", + "\n", + " assistance energy_used energy_collected \n", + "0 1 19.0 2.7 \n", + "1 1 53.8 15.3 \n", + "2 1 43.0 9.9 \n", + "3 1 6.5 2.1 \n", + "4 1 13.7 1.2 " + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [], + "source": [ + "#variables for the durations for each ticket type\n", + "single_duration = df[df['ticket'] == 'single']['duration']\n", + "season_duration = df[df['ticket'] == 'season']['duration']" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>duration</th>\n", + " <th>distance</th>\n", + " <th>cost</th>\n", + " <th>count</th>\n", + " </tr>\n", + " <tr>\n", + " <th>ticket</th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>savonia</th>\n", + " <td>138370.0</td>\n", + " <td>494990</td>\n", + " <td>2.00</td>\n", + " <td>218</td>\n", + " </tr>\n", + " <tr>\n", + " <th>season</th>\n", + " <td>317094.0</td>\n", + " <td>1315560</td>\n", + " <td>3.00</td>\n", + " <td>457</td>\n", + " </tr>\n", + " <tr>\n", + " <th>single</th>\n", + " <td>637522.0</td>\n", + " <td>2559320</td>\n", + " <td>552.35</td>\n", + " <td>815</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " duration distance cost count\n", + "ticket \n", + "savonia 138370.0 494990 2.00 218\n", + "season 317094.0 1315560 3.00 457\n", + "single 637522.0 2559320 552.35 815" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "summary = df.groupby('ticket')[['duration', 'distance', 'cost']].sum()\n", + "summary['count'] = df.ticket.value_counts()\n", + "summary" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0 411.0\n", + "1 1411.0\n", + "4 319.0\n", + "5 1185.0\n", + "6 817.0\n", + " ... \n", + "1762 1597.0\n", + "1763 1066.0\n", + "1767 4400.0\n", + "1768 1167.0\n", + "1770 199.0\n", + "Name: duration, Length: 815, dtype: float64\n", + "8 797.0\n", + "11 1014.0\n", + "12 822.0\n", + "16 905.0\n", + "28 233.0\n", + " ... \n", + "1755 113.0\n", + "1756 256.0\n", + "1761 934.0\n", + "1765 696.0\n", + "1773 478.0\n", + "Name: duration, Length: 457, dtype: float64\n" + ] + } + ], + "source": [ + "print(single_duration)\n", + "print(season_duration)" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1.6945360817801978e-35 4.119831123406427e-10\n" + ] + } + ], + "source": [ + "#check if variables are normally distributed\n", + "print(scipy.stats.shapiro(single_duration)[1],\n", + " scipy.stats.shapiro(season_duration)[1])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Neither of them is normally distributed. Every row is a different is its own journey so I assume that the variables are not paired. Thats why I chose to use the Mann-Whitney test. And because we want to know the direction of the test I am testing if the season duration is significally greater than the single duration." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.8260791304465724" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "scipy.stats.mannwhitneyu(season_duration, single_duration, alternative=\"greater\")[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The result of the test tells us that the season duration is not significally greater than the single duration. " + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [], + "source": [ + "sample_size = len(season_duration)" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [], + "source": [ + "sample_single = single_duration.sample(n=sample_size, random_state=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "PearsonRResult(statistic=-0.0984255769075546, pvalue=0.03542681854380087)" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "scipy.stats.pearsonr(season_duration,sample_single)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Is there statistical evidence to claim that the savonia ticket type differs from the others with respect to how often the electric assistance is used?" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [], + "source": [ + "single_assistance = df[df['ticket'] == 'single']['assistance']\n", + "season_assistance = df[df['ticket'] == 'season']['assistance']\n", + "savonia_assistance = df[df['ticket'] == 'savonia']['assistance']" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "4.90915909646714e-49 1.7328341801106013e-40 1.261976416979814e-26\n" + ] + } + ], + "source": [ + "#check if variables are normally distributed\n", + "print(scipy.stats.shapiro(single_assistance)[1],\n", + " scipy.stats.shapiro(season_assistance)[1],\n", + " scipy.stats.shapiro(savonia_assistance)[1])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "not any of them are normal, not paired, so mann-whitney test" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.0010338714188769154" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "scipy.stats.mannwhitneyu(savonia_assistance, single_assistance)[1]" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1.00709460090142e-05" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "scipy.stats.mannwhitneyu(savonia_assistance, season_assistance)[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Based on these p-values there are significant difference on the savonia ticket type on how often the assistance is used. " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}