Commit c813fb43 authored by Elias Ervelä's avatar Elias Ervelä
Browse files

Upload New File

parent b956190e
%% Cell type:markdown id: tags:
Elias Ervelä <br>
student number 518434 <br>
emerve@utu.fi <br>
December, 4, 2020 <br>
%% Cell type:markdown id: tags:
# Data Analysis and Knowledge Discovery: Exercise 3, Supervised learning
%% Cell type:markdown id: tags:
This is the template for the third exercise. The idea of this exercise is to apply supervised learning to predict the ship type using certain attributes (speed, destination harbour...) and K nearest neighbors (kNN) classifier. The data is available in Moodle course page: shipdata_2020.xlsx. <br>
General guidance for exercises is given in Moodle course page. <br>
- answer all the questions below
- write easily readable code, include explanations what your code does
- make informative illustrations: include labels for x- and y-axes, legends and captions for your plots
- do not change anything manually or outside the script in the data file
- before saving the ipynb file (and possible printing) run: "Restart & Run all", to make sure you return a file that works as expected
- name your file as DAKD2020_ex3_firstname_lastname.ipynb
- +1 bonus point requires a correct solution and also thorough analysis. Discuss also how the results could be improved
- if you encounter problems, Google first. If you can't find an answer to the problem, don't hesitate to ask in the Moodle discussion or directly: pekavir@utu.fi
- Note! Don't leave it to the last moment! No feedback service during the weekend
- The deadline is **Friday 4th of December 23:59**
%% Cell type:markdown id: tags:
## Data import
%% Cell type:markdown id: tags:
Gather *all* packages needed for this notebook here:
%% Cell type:code id: tags:
``` python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```
%% Cell type:markdown id: tags:
Import the data.
%% Cell type:markdown id: tags:
Lets import the data from my google drive.
I used this as a help: https://buomsoo-kim.github.io/colab/2018/04/16/Importing-files-from-Google-Drive-in-Google-Colab.md/
%% Cell type:code id: tags:
``` python
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
```
%% Cell type:code id: tags:
``` python
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
```
%% Cell type:code id: tags:
``` python
downloaded = drive.CreateFile({'id':"1Pkdj3ZSe_ipq31Z7EY2Lpgwdb7oebEjl"}) # replace the id with id of file you want to access
downloaded.GetContentFile('shipdata_2020.xlsx') # replace the file name with your file
```
%% Cell type:code id: tags:
``` python
data = pd.read_excel('shipdata_2020.xlsx')
```
%% Cell type:code id: tags:
``` python
data
```
%% Output
MMSI Speed COG ... Gross_tonnage Length Breadth
0 212209000 10.1377 64.3074 ... 3416 94.91 15.34
1 212436000 13.5256 77.0755 ... 6280 116.90 18.00
2 219082000 9.9416 74.6762 ... 9980 141.20 21.90
3 219083000 11.6038 74.7529 ... 9980 141.20 21.60
4 219426000 11.9203 56.3253 ... 3219 99.90 15.00
.. ... ... ... ... ... ... ...
129 273374820 10.0396 74.6253 ... 4979 139.90 16.70
130 273385070 9.3507 74.5454 ... 4979 139.90 16.94
131 273388150 9.7668 68.7159 ... 5075 140.85 16.86
132 636092755 11.1554 73.7013 ... 23240 183.00 27.37
133 357100000 11.2703 59.3888 ... 43717 229.04 32.31
[134 rows x 8 columns]
%% Cell type:markdown id: tags:
## Data preprocessing
%% Cell type:markdown id: tags:
- First, find out how many different destinations there are in the data. Do you need to make any preprocessing? **1p**
- Destination harbor is a categorical variable. It needs to be converted into numerical. Explain, why do you need to make this step? You can use get_dummies from pandas to implement onehot coding for categorical features **1p**
- Plot Gross tonnage versus the ship Length. Use different colors for different ship types. According to the plot, there is one clear outlier. Find the correct value from marinetraffic.com, and make the correction **1p**
- It is good to exploit domain knowledge and make some reasonable transformation to the feature values to improve the expected results and/or to avoid redundancy. Find out what gross tonnage means. Make some transformation to Length values to acquire a linear relationship between the transformed length and Gross tonnage values **1p**
- The numerical variables have quite different ranges. To ensure that all variables can have the same importance on the model, perform Z-score standardization. Perform it for speed, transformed length, and breadth **1p**
%% Cell type:markdown id: tags:
**First, find out how many different destinations there are in the data. Do you need to make any preprocessing? 1p**
%% Cell type:code id: tags:
``` python
data['Destination'].unique()
```
%% Output
array(['Hamina', 'Helsinki', 'Kotka', 'Kronshtadt', 'Kunda', 'Muuga',
'Paldiski', 'Porvoo', 'Primorsk', 'Sillamäe', 'Sillamae',
'Tallinn', 'Ust-Luga', 'Valko-Loviisa', 'Viipuri', 'Vuosaari',
'Vysotsk'], dtype=object)
%% Cell type:code id: tags:
``` python
# There is Sillamäe and Sillamae that most likely mean the same place. Lets change them all to Sillamae
data[data['Destination'] == 'Sillamäe']['Destination']
```
%% Output
76 Sillamäe
77 Sillamäe
Name: Destination, dtype: object
%% Cell type:code id: tags:
``` python
data.loc[76:77,['Destination']] = "Sillamae"
```
%% Cell type:code id: tags:
``` python
# Lets check that worked
data['Destination'].unique()
```
%% Output
array(['Hamina', 'Helsinki', 'Kotka', 'Kronshtadt', 'Kunda', 'Muuga',
'Paldiski', 'Porvoo', 'Primorsk', 'Sillamae', 'Tallinn',
'Ust-Luga', 'Valko-Loviisa', 'Viipuri', 'Vuosaari', 'Vysotsk'],
dtype=object)
%% Cell type:markdown id: tags:
**Destination harbor is a categorical variable. It needs to be converted into numerical. Explain, why do you need to make this step? You can use get_dummies from pandas to implement onehot coding for categorical features 1p**
%% Cell type:markdown id: tags:
Because then we can do numerical opertaions on destinations.
%% Cell type:code id: tags:
``` python
dest = pd.get_dummies(data['Destination'])
dest
```
%% Output
Hamina Helsinki Kotka ... Viipuri Vuosaari Vysotsk
0 1 0 0 ... 0 0 0
1 1 0 0 ... 0 0 0
2 1 0 0 ... 0 0 0
3 1 0 0 ... 0 0 0
4 1 0 0 ... 0 0 0
.. ... ... ... ... ... ... ...
129 0 0 0 ... 0 0 1
130 0 0 0 ... 0 0 1
131 0 0 0 ... 0 0 1
132 0 0 0 ... 0 0 1
133 0 0 0 ... 0 0 1
[134 rows x 16 columns]
%% Cell type:markdown id: tags:
**Plot Gross tonnage versus the ship Length. Use different colors for different ship types. According to the plot, there is one clear outlier. Find the correct value from marinetraffic.com, and make the correction 1p**
%% Cell type:code id: tags:
``` python
# Lets find out the different types of ships
data['Ship_type'].unique()
```
%% Output
array(['Cargo', 'Tanker', 'Tug'], dtype=object)
%% Cell type:code id: tags:
``` python
# Plot
plt.plot(data[data['Ship_type']=='Cargo']['Gross_tonnage'], data[data['Ship_type']=='Cargo']['Length'], 'o', color = 'r', label = 'Cargo')
plt.plot(data[data['Ship_type']=='Tanker']['Gross_tonnage'], data[data['Ship_type']=='Tanker']['Length'], 'o', color = 'b', label = 'Tanker')
plt.plot(data[data['Ship_type']=='Tug']['Gross_tonnage'], data[data['Ship_type']=='Tug']['Length'], 'o', color = 'g', label = 'Tug')
plt.legend()
plt.show()
```
%% Output
%% Cell type:code id: tags:
``` python
# I can see an clear outlier with a tanker that has <100 length and >20000 gross tonnage
data.loc[(data['Length']<100) & (data['Ship_type'] == 'Tanker') & (data['Gross_tonnage']>20000),['Gross_tonnage']]
```
%% Output
Gross_tonnage
83 30026
%% Cell type:code id: tags:
``` python
# Lets change 30026 to real value 326
data.loc[(data['Length']<100) & (data['Ship_type'] == 'Tanker') & (data['Gross_tonnage']>20000),['Gross_tonnage']] = 326
```
%% Cell type:code id: tags:
``` python
# Lets plot to check what it looks like now
plt.plot(data[data['Ship_type']=='Cargo']['Gross_tonnage'], data[data['Ship_type']=='Cargo']['Length'], 'o', color = 'r', label = 'Cargo')
plt.plot(data[data['Ship_type']=='Tanker']['Gross_tonnage'], data[data['Ship_type']=='Tanker']['Length'], 'o', color = 'b', label = 'Tanker')
plt.plot(data[data['Ship_type']=='Tug']['Gross_tonnage'], data[data['Ship_type']=='Tug']['Length'], 'o', color = 'g', label = 'Tug')
plt.legend()
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
**It is good to exploit domain knowledge and make some reasonable transformation to the feature values to improve the expected results and/or to avoid redundancy. Find out what gross tonnage means. Make some transformation to Length values to acquire a linear relationship between the transformed length and Gross tonnage values 1p**
%% Cell type:markdown id: tags:
Gross tonnage is the ships volume times a multplier based on the volume. To be exact:
Gross_tonnage = V * (0.2 + 0.02 * log10(V))
Volume is height * width * depth (m^3). So we can get linear relationship with the length doing length^3. So:
length^3*log10(length^3) ~ gross_tonnage
%% Cell type:code id: tags:
``` python
# Lets check if we get linear looking data
plt.plot(data[data['Ship_type']=='Cargo']['Gross_tonnage'], (data[data['Ship_type']=='Cargo']['Length']**3)*(np.log10(data[data['Ship_type']=='Cargo']['Length']**3)), 'o', color = 'r', label = 'Cargo')
plt.plot(data[data['Ship_type']=='Tanker']['Gross_tonnage'], (data[data['Ship_type']=='Tanker']['Length']**3)*(np.log10(data[data['Ship_type']=='Tanker']['Length']**3)), 'o', color = 'b', label = 'Tanker')
plt.plot(data[data['Ship_type']=='Tug']['Gross_tonnage'], (data[data['Ship_type']=='Tug']['Length']**3)*(np.log10(data[data['Ship_type']=='Tug']['Length']**3)), 'o', color = 'g', label = 'Tug')
plt.legend()
plt.show()
```
%% Output
%% Cell type:code id: tags:
``` python
# Lets add a column for this transformation
data['Length_transformed'] = (data['Length']**3)*(np.log10(data['Length']**3))
data
```
%% Output
MMSI Speed COG ... Length Breadth Length_transformed
0 212209000 10.1377 64.3074 ... 94.91 15.34 5.071453e+06
1 212436000 13.5256 77.0755 ... 116.90 18.00 9.910062e+06
2 219082000 9.9416 74.6762 ... 141.20 21.90 1.815643e+07
3 219083000 11.6038 74.7529 ... 141.20 21.60 1.815643e+07
4 219426000 11.9203 56.3253 ... 99.90 15.00 5.980718e+06
.. ... ... ... ... ... ... ...
129 273374820 10.0396 74.6253 ... 139.90 16.70 1.762655e+07
130 273385070 9.3507 74.5454 ... 139.90 16.94 1.762655e+07
131 273388150 9.7668 68.7159 ... 140.85 16.86 1.801271e+07
132 636092755 11.1554 73.7013 ... 183.00 27.37 4.159621e+07
133 357100000 11.2703 59.3888 ... 229.04 32.31 8.506501e+07
[134 rows x 9 columns]
%% Cell type:markdown id: tags:
**The numerical variables have quite different ranges. To ensure that all variables can have the same importance on the model, perform Z-score standardization. Perform it for speed, transformed length, and breadth 1p**
%% Cell type:code id: tags:
``` python
data_std = data.copy()
data_std['Speed'] = (data_std['Speed'] - data_std['Speed'].mean()) / data_std['Speed'].std()
data_std['Length_transformed'] = (data_std['Length_transformed'] - data_std['Length_transformed'].mean()) / data_std['Length_transformed'].std()
data_std['Breadth'] = (data_std['Breadth'] - data_std['Breadth'].mean()) / data_std['Breadth'].std()
```
%% Cell type:code id: tags:
``` python
data_std
```
%% Output
MMSI Speed COG ... Length Breadth Length_transformed
0 212209000 -0.160696 64.3074 ... 94.91 -0.487276 -0.557527
1 212436000 1.574301 77.0755 ... 116.90 -0.219871 -0.410920
2 219082000 -0.261122 74.6762 ... 141.20 0.172188 -0.161060
3 219083000 0.590117 74.7529 ... 141.20 0.142030 -0.161060
4 219426000 0.752202 56.3253 ... 99.90 -0.521456 -0.529977
.. ... ... ... ... ... ... ...
129 273374820 -0.210935 74.6253 ... 139.90 -0.350558 -0.177115
130 273385070 -0.563732 74.5454 ... 139.90 -0.326431 -0.177115
131 273388150 -0.350640 68.7159 ... 140.85 -0.334473 -0.165415
132 636092755 0.360484 73.7013 ... 183.00 0.722077 0.549150
133 357100000 0.419326 59.3888 ... 229.04 1.218685 1.866228
[134 rows x 9 columns]
%% Cell type:markdown id: tags:
## Classification accuracy with random training and test sets
%% Cell type:markdown id: tags:
Predict the **ship type** using **speed, destination, transformed length, and breadth** as features. Find an estimation for the classification accuracy (number of correctly classified ships to the total number of ships) using *random training and test sets*. <br>
- Produce training and test data **1p**
- Gather the normalized features and one-hot-coded destination columns as array __X__ (input variables), and the ship type as array **y** (output variable)
- Divide the data randomly into training (20%) and test (80%) sets
- Do you need to use stratification? Explain your decision
- Train the model and test its performance **1p**
- Use kNN classifier with k=3
- Print out the confusion matrix. How does the model perform with different ship types?
- What is the (total) classification accuracy?
- Repeat the calculation 1000 times with different split of training/test data, and make a histogram of the results for classification accuracy **1p**
- Discuss your results **1p**
%% Cell type:markdown id: tags:
**Gather the normalized features and one-hot-coded destination columns as array __X__ (input variables), and the ship type as array y (output variable)**
%% Cell type:code id: tags:
``` python
X = data_std[['Speed','Length_transformed','Breadth']]
X = pd.concat([X,dest], axis=1)
X
```
%% Output
Speed Length_transformed Breadth ... Viipuri Vuosaari Vysotsk
0 -0.160696 -0.557527 -0.487276 ... 0 0 0
1 1.574301 -0.410920 -0.219871 ... 0 0 0
2 -0.261122 -0.161060 0.172188 ... 0 0 0
3 0.590117 -0.161060 0.142030 ... 0 0 0
4 0.752202 -0.529977 -0.521456 ... 0 0 0
.. ... ... ... ... ... ... ...
129 -0.210935 -0.177115 -0.350558 ... 0 0 1
130 -0.563732 -0.177115 -0.326431 ... 0 0 1
131 -0.350640 -0.165415 -0.334473 ... 0 0 1
132 0.360484 0.549150 0.722077 ... 0 0 1
133 0.419326 1.866228 1.218685 ... 0 0 1
[134 rows x 19 columns]
%% Cell type:code id: tags:
``` python
y = data_std[['Ship_type']]
y
```
%% Output
Ship_type
0 Cargo
1 Tanker
2 Tanker
3 Tanker
4 Tanker
.. ...
129 Tanker
130 Tanker
131 Tanker
132 Tanker
133 Cargo
[134 rows x 1 columns]
%% Cell type:markdown id: tags:
**Divide the data randomly into training (20%) and te