Skip to content
Snippets Groups Projects
Commit 25f91738 authored by Elias Ervelä's avatar Elias Ervelä
Browse files

Upload New File

parent ec4c702b
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
Elias Ervelä <br>
student number 518434 <br>
emerve@utu.fi <br>
Feb, 1, 2021 <br>
%% Cell type:markdown id: tags:
# Exercise 1 | TKO_2096 Application of Data Analysis 2021
%% Cell type:markdown id: tags:
#### Nested cross-validation for K-nearest neighbors <br>
- Use Python 3 to program a nested cross-validation for the k-nearest neighbors (kNN) method so that the number of neighbours k is automatically selected from the range 1 to 10. In other words, the base learning algorithm is kNN but the actual learning algorithm, whose prediction performance will be evaluated with nested CV, is kNN with automatic CV-based model selection (see the lectures and the pseudo codes presented on them for more info on this interpretation).
- As a kNN implementation, you can use sklearn: http://scikit-learn.org/stable/modules/neighbors.html but your own kNN implementation can also be used if you like to keep more control on what is happening in the learning process. The CV implementation should be easily modifiable, since the forthcoming exercises involve different problem-dependent CV variations.
- Use the nested CV implementation on the iris data and report the resulting classification accuracy. Hint: you can use the nested CV example provided on sklearn documentation: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html as a starting point and compare your nested CV implementation with that but do NOT use the ready made CV implementations of sklearn as the idea of the exercise is to learn to split the data on your own. The other exercises need more sophisticated data splitting which are not necessarily available in libraries.
- Return your solution for each exercise BOTH as a Jupyter Notebook file and as a PDF-file made from it.
- Return the report to the course page on **Monday 1st of February** at the latest.
%% Cell type:markdown id: tags:
## Import libraries
%% Cell type:code id: tags:
``` python
#In this cell import all libraries you need. For example:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris # Iris dataset
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
```
%% Cell type:markdown id: tags:
## Results the nested cross-validation
%% Cell type:code id: tags:
``` python
# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
#X_iris
#y_iris
```
%% Cell type:code id: tags:
``` python
#In this cell run your script for nested CV and print the result.
test_splits = 5 # How many folds in the outer cv
val_splits = 5 # How many folds in the inner cv
skf_test = StratifiedKFold(n_splits = test_splits, shuffle=True) # I wasn't sure if StratifiedKFold was allowed to use, but I did used them.
skf_val = StratifiedKFold(n_splits = val_splits, shuffle=True)
test_score_sum = 0 # Setuping the test sets score sum variable
outer_iteration = 0 # Tracking the outer iteration
# Test the performance with cross-validation of automatically selected hypothesis.
for train_val_index, test_index in skf_test.split(X_iris, y_iris):
outer_iteration += 1
print("Outer CV iteration: ", outer_iteration)
# Split the data into test and train/validation
X_train_val, X_test = X_iris[train_val_index], X_iris[test_index]
y_train_val, y_test = y_iris[train_val_index], y_iris[test_index]
best_performance = -1 # Best performance of a iteration
best_k = 0 # Value of k in best iteration
# Go through k = 1, ..., 10 and pick the best performing.
for k in range(1,11):
val_score_sum = 0 # Sum of the validation score
# Valuate the performance of given k, with cross-validation.
for train_index, val_index in skf_val.split(X_train_val, y_train_val):
# Split data into validation and training
X_train, X_val = X_iris[train_val_index], X_iris[val_index]
y_train, y_val = y_iris[train_val_index], y_iris[val_index]
neigh = KNeighborsClassifier(n_neighbors=k) # Setup kNN
neigh.fit(X_train, y_train) # Train the model
val_score_sum += neigh.score(X_val, y_val) # Test the model with validation sum the score
avg_performance = val_score_sum/val_splits # Average performance
print(" Avg val score with k=", k, ": ", avg_performance)
if avg_performance > best_performance:
best_performance = avg_performance
best_k = k
print(" Best k: ", best_k)
# Train with whole X_train_val with best k and test it
neigh = KNeighborsClassifier(n_neighbors=best_k) # Setup kNN
neigh.fit(X_train_val, y_train_val) # Train the model
test_score_sum += neigh.score(X_test, y_test)
print(" Test score with k=", best_k, ": ", neigh.score(X_test, y_test))
avg_test_score = test_score_sum/test_splits
print("Avg test score:", avg_test_score)
```
%% Output
Outer CV iteration: 1
Avg val score with k= 1 : 0.9916666666666668
Avg val score with k= 2 : 0.975
Avg val score with k= 3 : 0.9666666666666668
Avg val score with k= 4 : 0.9666666666666668
Avg val score with k= 5 : 0.9583333333333334
Avg val score with k= 6 : 0.9666666666666666
Avg val score with k= 7 : 0.9583333333333334
Avg val score with k= 8 : 0.975
Avg val score with k= 9 : 0.975
Avg val score with k= 10 : 0.975
Best k: 1
Test score with k= 1 : 0.9666666666666667
Outer CV iteration: 2
Avg val score with k= 1 : 0.9833333333333334
Avg val score with k= 2 : 0.975
Avg val score with k= 3 : 0.9666666666666668
Avg val score with k= 4 : 0.9666666666666668
Avg val score with k= 5 : 0.975
Avg val score with k= 6 : 0.9833333333333334
Avg val score with k= 7 : 0.9833333333333334
Avg val score with k= 8 : 0.9833333333333334
Avg val score with k= 9 : 0.9833333333333334
Avg val score with k= 10 : 0.9833333333333334
Best k: 1
Test score with k= 1 : 0.9333333333333333
Outer CV iteration: 3
Avg val score with k= 1 : 1.0
Avg val score with k= 2 : 0.9833333333333334
Avg val score with k= 3 : 0.9583333333333334
Avg val score with k= 4 : 0.9666666666666666
Avg val score with k= 5 : 0.9583333333333334
Avg val score with k= 6 : 0.975
Avg val score with k= 7 : 0.9833333333333334
Avg val score with k= 8 : 0.9833333333333334
Avg val score with k= 9 : 0.9833333333333334
Avg val score with k= 10 : 0.9833333333333334
Best k: 1
Test score with k= 1 : 0.9666666666666667
Outer CV iteration: 4
Avg val score with k= 1 : 0.9833333333333334
Avg val score with k= 2 : 0.975
Avg val score with k= 3 : 0.9583333333333334
Avg val score with k= 4 : 0.9583333333333334
Avg val score with k= 5 : 0.9583333333333333
Avg val score with k= 6 : 0.9583333333333334
Avg val score with k= 7 : 0.9583333333333334
Avg val score with k= 8 : 0.9666666666666668
Avg val score with k= 9 : 0.9666666666666666
Avg val score with k= 10 : 0.975
Best k: 1
Test score with k= 1 : 0.9333333333333333
Outer CV iteration: 5
Avg val score with k= 1 : 1.0
Avg val score with k= 2 : 0.9833333333333334
Avg val score with k= 3 : 0.9583333333333334
Avg val score with k= 4 : 0.9666666666666666
Avg val score with k= 5 : 0.9583333333333334
Avg val score with k= 6 : 0.9666666666666666
Avg val score with k= 7 : 0.975
Avg val score with k= 8 : 0.975
Avg val score with k= 9 : 0.9666666666666666
Avg val score with k= 10 : 0.9833333333333334
Best k: 1
Test score with k= 1 : 1.0
Avg test score: 0.96
%% Cell type:markdown id: tags:
These results are suprisingly good.
%% Cell type:code id: tags:
``` python
# Made after returning this
def kNN_nestedCrossValidation(X, y, test_splits, val_splits, k_range, print_steps):
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import LeaveOneOut
if test_splits == "loo":
splits_test = LeaveOneOut()
test_splits = len(X)
else:
splits_test = StratifiedKFold(n_splits = test_splits, shuffle=True)
if val_splits == "loo":
splits_val = LeaveOneOut()
val_splits = len(X)
else:
splits_val = StratifiedKFold(n_splits = val_splits, shuffle=True)
test_score_sum = 0 # Setuping the test sets score sum variable
outer_iteration = 0 # Tracking the outer iteration
# Test the performance with cross-validation of automatically selected hypothesis.
for train_val_index, test_index in splits_test.split(X, y):
outer_iteration += 1
if print_steps:
print("Outer CV iteration: ", outer_iteration)
# Split the data into test and train/validation
X_train_val, X_test = X[train_val_index], X[test_index]
y_train_val, y_test = y[train_val_index], y[test_index]
best_performance = -1 # Setup best performance of a iteration
best_k = 0 # Value of k in best iteration
# Go through k = 1, ..., 10 and pick the best performing.
for k in k_range:
val_score_sum = 0 # Sum of the validation score
# Valuate the performance of given k, with cross-validation.
for train_index, val_index in splits_val.split(X_train_val, y_train_val):
# Split data into validation and training
X_train, X_val = X[train_val_index], X[val_index]
y_train, y_val = y[train_val_index], y[val_index]
neigh = KNeighborsClassifier(n_neighbors=k) # Setup kNN
neigh.fit(X_train, y_train) # Train the model
val_score_sum += neigh.score(X_val, y_val) # Test the model with validation sum the score
avg_performance = val_score_sum/val_splits # Average performance
if print_steps:
print(" Avg val score with k=", k, ": ", avg_performance)
if avg_performance > best_performance:
best_performance = avg_performance
best_k = k
if print_steps:
print(" Best k: ", best_k)
# Train with whole X_train_val with best k and test it
neigh = KNeighborsClassifier(n_neighbors=best_k) # Setup kNN
neigh.fit(X_train_val, y_train_val) # Train the model
test_score_sum += neigh.score(X_test, y_test)
if print_steps:
print(" Test score with k=", best_k, ": ", neigh.score(X_test, y_test))
avg_test_score = test_score_sum/test_splits
print("Avg test score:", avg_test_score)
```
%% Cell type:code id: tags:
``` python
from sklearn.datasets import load_iris # Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
kNN_nestedCrossValidation(X, y, test_splits=10, val_splits=2, k_range=range(1,3), print_steps=False)
```
%% Output
Avg test score: 0.9600000000000002
%% Cell type:code id: tags:
``` python
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment