Skip to content
Snippets Groups Projects
Commit 0e0f1005 authored by Emmi Ylikoski's avatar Emmi Ylikoski
Browse files

Update DAKD2024_ex2_Emmi_Ylikoski.ipynb

parent 24825b39
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
<div class="alert alert-block" style="color: green">
<h1><center> DAKD 2024 EXERCISE 2: SUPERVISED LEARNING </center></h1>
%% Cell type:markdown id: tags:
### Fill in your name, student id number and email address
#### name: Emmi ylikoski
#### student id: 2110170
#### email: eoylik@utu.fi
%% Cell type:markdown id: tags:
The previous exercise was about <i>data understanding</i> and <i>data preparation</i>, which formed the basis for the modeling phase of the data mining process. Many modeling techniques make assumptions about the data, so the exploration and preparation phases can't be ignored. Now, as we have checked the validity of data and familiarized ourselves with it, we can move on to the next stage of the Cross-Industry Standard Process for Data Mining (CRISP-DM), which is <font color = green>modeling</font>.
The questions to be answered at this stage could include:
- What kind of model architecture best fits our data?
- How well does the model perform technically?
- Could we improve its performance?
- How do we evaluate the model's performance?
<i>Machine learning</i> is a subfield of artificial intelligence that provides automatic, objective and data-driven techniques for modeling the data. Its two main branches are <i>supervised learning</i> and <i>unsupervised learning</i>, and in this exercise, we are going to use the former, <font color = green>supervised learning</font>, for classification and regression tasks.
For classification, data remains the same as in the previous exercise, but I've already cleaned it up for you. Some data pre-processing steps are still required to ensure that it's in an appropriate format so that models can learn from it. Even though we are not conducting any major data exploration nor data preparation this time, <i>you should never forget it in your future data analyses</i>.
-----
#### General Guidance for Exercises
- <b>Complete all tasks:</b> Make sure to answer all questions, even if you cannot get your script to fully work.
- <b>Code clarity:</b> Write clear and readable code. Include comments to explain what your code does.
- <b>Effective visualizations:</b> Ensure all plots have labeled axes, legends, and captions. Your visualizations should clearly represent the underlying data.
- <b>Notebook organization:</b> You can add more code or markdown cells to improve the structure of your notebook as long as it maintains a logical flow.
- <b>Submission:</b> Submit both the .ipynb and .html or .pdf versions of your notebook. Before finalizing your notebook, use the "Restart & Run All" feature to ensure it runs correctly.
- <b>Grading criteria:</b>
- The grading scale is *Fail*/*Pass*/*Pass with honors* (+1).
- To pass, you must complete the required parts 1-4.
- To achieve Pass with honors, complete the bonus exercises.
- <b>Technical issues:</b>
- If you encounter problems, start with an online search to find solutions but do not simply copy and paste code. Understand any code you use and integrate it appropriately.
- Cite all external sources used, whether for code or explanations.
- If problems persist, ask for help in the course discussion forum, at exercise sessions, or via email at tuhlei@utu.fi, aibekt@utu.fi.
- <b>Use of AI and large language models:</b>
- We do not encourage the use of AI tools like ChatGPT. If you use them, critically evaluate their outputs.
- Describe how you used the AI tools in your work, including your input and how the output was beneficial.
- <b>Time management:</b> Do not leave your work until the last moment. No feedback will be available during weekends.
- <b>Additional notes:</b>
- You can find the specific deadlines and session times for each assignment on the Moodle course page.
- Ensure all your answers are concise—typically a few sentences per question.
- Your .ipynb notebook is expected to be run to completion, which means that it should execute without errors when all cells are run in sequence.
are run in sequence.
<font color = green> The guided exercise session is held on the 27th of November at 14:15-16:00, at lecture hall X, Natura building.</font>
<font color = red size = 4>The deadline is the 2nd of December at 23:59</font>. Late submissions will not be accepted unless there is a valid excuse for an extension which should be asked **before** the original deadline.
------
%% Cell type:markdown id: tags:
### <font color = red> Packages needed for this exercise: </font>
You can use other packages as well, but this excercise can be completed with those below.
%% Cell type:code id: tags:
``` python
import numpy as np
import pandas as pd
# Visualization packages - matplotlib and seaborn
# Remember that pandas is also handy and capable when it comes to plotting!
import seaborn as sns
import matplotlib.pyplot as plt
# Machine learning package - scikit-learn
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, LeaveOneOut, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Ridge
# Show the plots inline in the notebook
%matplotlib inline
```
%% Cell type:markdown id: tags:
______________
## <font color = lightcoral>1. Classification using k-nearest neighbors </font>
%% Cell type:markdown id: tags:
We start exploring the world of data modeling by using the <font color = lightcoral>K-Nearest Neightbors (k-NN) algorithm</font>. The k-NN algorithm is a classic supervised machine learning technique based on the assumption that data points with similar features tend to belong to the same class, and thus are likely to be near each other in feature space.
In our case, we'll use the k-NN algorithm to *predict the presence of cardiovascular disease* (CVD) using all the other variables as <font color = lightcoral>features</font> in the given data set. I.e. the <font color = lightcoral>target variable</font> that we are interested in is `cardio`. Let's have a brief look at the features again:
| Feature | Type | Explanation |
| :- | :- | :-
| age | numeric | The age of the patient in years
| sex | binary | Female == 0, Male == 1
| height | numeric | Measured weight of the patient (kg)
| weight | numeric | Measured weight of the patient (cm)
| ap_hi | numeric | Measured Systolic blood pressure
| ap_lo | numeric | Measured Diastolic blood pressure
| smoke | binary | A subjective feature based on asking the patient whether or not he/she smokes
| alco | binary | A subjective feature based on asking the patient whether or not he/she consumes alcohol
| active | binary | A subjective feature based on asking the patient whether or not he/she exercises regularly
| cholesterol | categorical | Cholesterol associated risk information evaluated by a doctor
| gluc | categorical | Glucose associated risk information evaluated by a doctor
But first, we need data for the task. The code for loading the data into the environment is provided for you. The code should work but make sure that you have the CSV file of the data in the same directory where you have this notebook file.
**Exercise 1 A)**
Take a random sample of 1000 rows from the dataframe using a fixed random seed. Print the first 15 rows to check that everything is ok with the dataframe.
*Note: As mentioned, the data remains the same, but cholesterol has been one-hot-encoded for you already. There's a new variable, `gluc` (about glucose aka blood sugar levels), which is also one-hot-encoded for you. It has similar values as `cholesterol`.*
%% Cell type:code id: tags:
``` python
### Loading code provided
# ------------------------------------------------------
# The data file should be at the same location than the
# exercise file to make sure the following lines work!
# Otherwise, fix the path.
# ------------------------------------------------------
# Path for the data
data_path = 'ex2_cardio_data.csv'
# Read the CSV file
cardio_data = pd.read_csv(data_path)
```
%% Cell type:code id: tags:
``` python
### Code - Resample and print 15 rows
cardio_data_resampled = cardio_data.sample(n = 1000, random_state = 1)
cardio_data_resampled.head(15)
```
%% Output
age sex height weight ap_hi ap_lo smoke alco active cardio \
1483 40 1 164 65.0 120 80 0 0 1 0
2185 51 1 183 78.0 110 80 0 0 1 0
2520 50 1 160 78.0 150 90 0 0 1 0
3721 41 0 170 90.0 120 80 0 0 1 0
3727 45 1 169 64.0 120 80 1 1 1 0
4524 43 1 168 89.0 140 100 0 0 1 1
234 42 1 167 62.0 120 80 0 0 1 0
4735 64 0 158 65.0 120 80 0 0 1 1
5839 41 1 181 70.0 120 80 0 0 1 1
2939 49 0 164 92.0 120 80 0 0 1 0
3053 59 1 178 80.0 150 90 0 0 1 0
867 50 0 162 75.0 100 70 0 0 1 0
276 50 0 160 62.0 120 80 0 0 1 0
5798 57 1 160 62.0 130 80 0 0 1 1
3512 49 0 162 72.0 120 80 0 0 1 0
cholesterol_normal cholesterol_at_risk cholesterol_elevated \
1483 1 0 0
2185 1 0 0
2520 0 0 1
3721 1 0 0
3727 1 0 0
4524 1 0 0
234 0 1 0
4735 1 0 0
5839 1 0 0
2939 1 0 0
3053 1 0 0
867 1 0 0
276 1 0 0
5798 1 0 0
3512 1 0 0
gluc_normal gluc_at_risk gluc_elevated
1483 1 0 0
2185 1 0 0
2520 0 0 1
3721 1 0 0
3727 1 0 0
4524 1 0 0
234 1 0 0
4735 1 0 0
5839 1 0 0
2939 1 0 0
3053 1 0 0
867 1 0 0
276 1 0 0
5798 1 0 0
3512 1 0 0
%% Cell type:code id: tags:
``` python
# just testing it worked the way I wanted
cardio_data_resampled.shape
```
%% Output
(1000, 16)
%% Cell type:markdown id: tags:
----
We have the data so now, let's put it to use. All the analyses will be done based on this sample of 1000.
To teach the k-NN algorithm (or any other machine learning algorithm) to recognize patterns, we need <font color = lightcoral>training data</font>. However, to assess how well a model has learned these patterns, we require <font color = lightcoral>test data</font> which is new and unseen by the trained model. It's important to note that the test set is not revealed to the model until after the training is complete.
So, to *estimate the performance of a model*, we may use a basic <font color = lightcoral>train-test split</font>. The term "split" is there because we literally split the data into two sets.
Before the exercise itself, we might as well discuss about the reproducibility of experiments we conduct in research. It can be quite a nightmare for some if code spewed out only random results. To address this, we can set a **random seed** to ensure that any random processes, such as splitting our dataset into training and test sets, yield consistent results across multiple runs. By using a fixed random seed, we enhance the reproducibility of our experiments, making it easier to validate findings. In fact, we already used one when sampling our subset from the loaded dataset.
**Exercise 1 B)**
Gather the features into one array and the target variable into another array. Create training and test data by splitting the data into training (80%) and test (20%) sets. Use a fixed random seed to ensure that even if you execute this cell hundreds of times, you will get the same split each time.
- Do you need stratification for our dataset? Explain your decision.
%% Cell type:code id: tags:
``` python
### Code - Train-test split
# Arrays for features and target variable
features = ["age","sex","height","weight","ap_hi","ap_lo","smoke","alco","active","cholesterol_normal","cholesterol_at_risk",
"cholesterol_elevated","gluc_normal","gluc_at_risk","gluc_elevated"]
target = ["cardio"]
# Separeta dataframes for features and the target variable
data_features = cardio_data_resampled[features]
data_target = cardio_data_resampled[target]
# checking how the target variable's values are divided so I can know if stratification is needed
print(data_target.value_counts())
# splitting the data into train and test set
x_train, x_test, y_train, y_test = train_test_split(data_features, data_target, test_size=0.2, train_size=0.8, random_state=1, stratify=data_target)
```
%% Output
cardio
0 707
1 293
Name: count, dtype: int64
%% Cell type:markdown id: tags:
<font color = lightcoral> \<Write your answer here\></font>
%% Cell type:markdown id: tags:
I printed the value counts of the target variable, so I can see how the values are divided between the two classes. Based on that we can see that the class 0 appears much more than the class 1, meaning that the classes are pretty imabalanced. Because of this, I think we need stratification, so that both train and test sets have the same ratio of classes of the target variable.
%% Cell type:markdown id: tags:
----------
**Exercise 1 C)**
Standardize the numerical features in both the train and test sets.
- Explain how the k-NN model makes predictions about whether or not a patient has cardiovascular disease (CVD) when the features are not standardized. Specifically, discuss how the varying scales of different features can influence the model's predictions, and how standardization would change this influence.
*Note: Some good information about preprocessing and how to use it for train and test data can be found https://scikit-learn.org/stable/modules/preprocessing.html*.
%% Cell type:code id: tags:
``` python
### Code - Standardization
# Separating the numerical features
numerical_features = ["age", "height", "weight", "ap_hi", "ap_lo"]
# then standadizing them with the StandardScaler documentation found in the link provided
scaler = StandardScaler()
scaler.fit(x_train[numerical_features])
x_train[numerical_features] = scaler.transform(x_train[numerical_features])
x_test[numerical_features] = scaler.transform(x_test[numerical_features])
```
%% Cell type:code id: tags:
``` python
# checking it worked the way I wanted to
x_train
```
%% Output
age sex height weight ap_hi ap_lo smoke alco \
1573 0.520514 0 0.063826 -0.195042 -0.174525 -0.080723 0 0
3762 -0.788498 0 0.294348 1.257000 -1.153627 -0.353896 0 0
5818 1.684081 1 -1.204047 -0.577158 0.151842 -0.217309 0 0
5300 -0.497606 1 0.179087 -0.806428 0.478210 0.055864 0 0
5393 0.520514 0 -0.743002 0.110651 -0.500892 -0.217309 0 0
... ... ... ... ... ... ... ... ...
457 -1.370281 0 -0.973524 0.492768 -0.174525 -0.080723 0 0
2688 -0.206715 0 -0.743002 -1.264967 -0.174525 -0.080723 0 0
1817 -1.224836 0 -0.166696 -1.647083 -0.827260 -0.217309 0 0
5627 0.229623 0 -0.512480 0.187075 -0.174525 -0.080723 0 0
5969 1.102297 0 0.063826 -0.424311 -0.174525 -0.080723 0 0
active cholesterol_normal cholesterol_at_risk cholesterol_elevated \
1573 0 1 0 0
3762 1 0 1 0
5818 1 1 0 0
5300 1 1 0 0
5393 1 1 0 0
... ... ... ... ...
457 1 1 0 0
2688 1 1 0 0
1817 1 1 0 0
5627 1 1 0 0
5969 0 1 0 0
gluc_normal gluc_at_risk gluc_elevated
1573 1 0 0
3762 1 0 0
5818 1 0 0
5300 1 0 0
5393 1 0 0
... ... ... ...
457 1 0 0
2688 1 0 0
1817 1 0 0
5627 1 0 0
5969 1 0 0
[800 rows x 15 columns]
%% Cell type:code id: tags:
``` python
# checking it worked the way I wanted to
x_test
```
%% Output
age sex height weight ap_hi ap_lo smoke alco \
2285 1.393189 1 1.562221 0.722037 0.151842 -0.080723 0 0
141 -1.079390 1 0.640132 -0.347888 -0.174525 -0.080723 0 0
3512 -0.497606 0 -0.281958 -0.042195 -0.174525 -0.080723 0 0
3422 -0.206715 0 0.063826 -0.653581 -0.500892 -0.080723 0 0
5320 1.102297 1 0.063826 -0.577158 0.478210 -0.080723 0 0
... ... ... ... ... ... ... ... ...
5578 0.375068 0 -1.088786 1.639116 -0.500892 -0.217309 0 0
3607 -1.806619 0 0.294348 -0.959274 -0.500892 -0.080723 0 0
1941 -1.224836 0 0.524870 -0.424311 -0.174525 -0.080723 0 0
5692 0.520514 0 -0.512480 -0.195042 1.130944 0.055864 0 0
1385 -1.952065 0 -0.973524 -0.271465 -0.174525 -0.080723 0 0
active cholesterol_normal cholesterol_at_risk cholesterol_elevated \
2285 1 0 0 1
141 1 1 0 0
3512 1 1 0 0
3422 1 1 0 0
5320 1 1 0 0
... ... ... ... ...
5578 0 0 0 1
3607 1 1 0 0
1941 0 1 0 0
5692 1 1 0 0
1385 1 1 0 0
gluc_normal gluc_at_risk gluc_elevated
2285 1 0 0
141 1 0 0
3512 1 0 0
3422 1 0 0
5320 1 0 0
... ... ... ...
5578 1 0 0
3607 1 0 0
1941 1 0 0
5692 1 0 0
1385 1 0 0
[200 rows x 15 columns]
%% Cell type:markdown id: tags:
<font color = lightcoral> \<Write your answer here\></font>
%% Cell type:markdown id: tags:
If the features are not standardized those features with much larger scales would have a much bigger impact on the made prediction, overruling all the other features with smaller scales. In our dataset, for example height, has a bigger scale than the rest of the numerical features and when calculating the distance in the k-NN model it would have a much bigger impact on the distance calculated and therefore a bigger impact on the predicted result even though it is no more important than any other feature. When the features are standardized, they all have an equal weight for making the prediction, because then they all are all in the same scale and none would have a much bigger impact on the prediction just because it has bigger values.
%% Cell type:markdown id: tags:
-------
It's time for us to train the model!
**Exercise 1 D)**
Train a k-NN model with $k=3$. Print out the confusion matrix and use it to compute the accuracy, the precision and the recall.
- What does each cell in the confusion matrix represents in the context of our dataset?
- How does the model perform with the different classes? Where do you think the differences come from? Interpret the performance metrics you just computed.
- With our dataset, why should you be a little more cautious when interpreting the accuracy?
*Note: We are very aware that there are functions available for these metrics, but this time, please calculate them using the confusion matrix.*
%% Cell type:code id: tags:
``` python
# First we needed to change the y_train dataframe to numpy array because the function wanted it in that form
y_train = y_train.iloc[:, 0].to_numpy()
```
%% Cell type:code id: tags:
``` python
### Code - the kNN classifier
# training the model
neighor = KNeighborsClassifier(n_neighbors=3)
neighor.fit(x_train, y_train)
#prediction
prediction = neighor.predict(x_test)
# confusion matrix
conf_matrix = metrics.confusion_matrix(y_test, prediction, labels=[0,1])
conf_matrix
```
%% Output
array([[125, 16],
[ 37, 22]])
%% Cell type:code id: tags:
``` python
# metrics
print(f"accuracy: {(125+22)/(125+16+37+22)}") # Correct predictions/All predictions
print(f"precision: {22/(22+16)}") # True Positives/True Positives+False Positives
print(f"recall: {22/(22+37)}") # True Positives/True Positives+False Negatives
```
%% Output
accuracy: 0.735
precision: 0.5789473684210527
recall: 0.3728813559322034
%% Cell type:markdown id: tags:
<font color = lightcoral> \<Write your answer here\></font>
%% Cell type:markdown id: tags:
**- What does each cell in the confusion matrix represents in the context of our dataset?** <br>
<br>
In the confusion matrix the rows represent the actual labels and columns predicted labels. First row is 0 second 1 and first column 0 and second 1 <br> So:
- Cell (0,0) represents the number of True Negatives, which means that actual and predicted label is 0
- cell (0,1) False Positives, which means that actual label is 0 and predicted 1
- cell (1,0) False Negatives, which means that actual label is 1 and predicted 0
- cell (1,1) True Positives, which mean that actual label is 1 and predited 1.
<br>
<br>
**- How does the model perform with the different classes? Where do you think the differences come from? Interpret the performance metrics you just computed.** <br>
<br>
The model seems to perform a bit better with the 0 class, because it has predicted more True Negatives than True Positives, of course taking into account that the class 0 is much more likely than class 1 because the split between the two classes was about 70/30. And then there are more False Negatives, which means that the model has predicted 1 class wrong more often than class 0.
<br>
<br>
The accuracy is pretty decent in my opinion, so that would mean that the model is right with the predictions about 70% of the time, which is pretty good. The precision is again a bit lower, which would mean that the positive predictions are more often wrong compared to all the made predictions. The recall on the other hand is compared to accuracy and precision very low, which would mean that the model doesn't predict the actual positives that well.
<br>
<br>
These differences in the metrics and between the classes could come from the fact that altogheter class 1 did not appear in the data as often as class 0, which could make it harder for the model to predict the class 1 because there is not as much training data on that class.
**- With our dataset, why should you be a little more cautious when interpreting the accuracy?**
<br>
<br>
I think that because there was some imbalance between the target classes. The ratio between the target classes was about 70/30, so if the model would always predict a 0 it would still have 70% accuracy as it is in our metrics. But at the same time it would fail to identify any samples from class 1. I think that in our case we would want to avoid False Negatives, because we would want to diagnose people correctly with the disease. And incorrectly predicting a false negative, someone could miss the diagnosis of the disease and therefore not receive the care that they need.
%% Cell type:markdown id: tags:
__________
## <font color = royalblue> 2. Classification accuracy using leave-one-out cross-validation
%% Cell type:markdown id: tags:
While the train-test split may provide us with an unbiased estimate of the performance, we only evaluate the model once. Especially when dealing with small datasets, a test set itself will be very small. How can we be sure that the evaluation is accurate with this small test set and not just a good (or bad) luck? And what if we'd like to compare two models and the other seems to be better -- how can we be sure that it's not just a coincidence?
Well, there's a great help available and it's called <font color = royalblue>cross-validation</font>. With its help, we can split the dataset into multiple different training and test sets, which allows us to evaluate models across various data partitions. This time, we'll take a closer look at the <font color = royalblue>leave-one-out cross-validation</font>.
%% Cell type:markdown id: tags:
**Exercise 2**
Let's keep the focus on detecting the CVD, so once again we utilize the k-NN model (with $k=3$) to predict the precense of the disease. Now, apply leave-one-out cross-validation to assess whether the k-NN model is suitable for addressing the problem. You may use the entire sample of 1000 on this task.
- What can you say about the accuracy compared to the previous task?
- What do you think: Does the k-NN model work for the problem in hand? Explain your answer.
*Tip: This can certainly be done manually, but `cross_val_score` is also a very handy function.*
%% Cell type:code id: tags:
``` python
# Let's combine our train and test sets because here we use the whole 1000 samples
x_train = pd.concat([x_train, x_test])
y_test = y_test.iloc[:, 0].to_numpy()
y_train = np.append(y_train, y_test)
print(len(y_train))
print(x_train.shape)
```
%% Output
1000
(1000, 15)
%% Cell type:code id: tags:
``` python
### Code - Leave-one-out cross-validation
# model
model = KNeighborsClassifier(n_neighbors=3)
# leave-one-out cross-val
loo = LeaveOneOut()
scores = cross_val_score(model, x_train, y_train, cv=loo, scoring="accuracy")
print(f"mean accuracy between the splits: {scores.mean()}")
print(f"standard deviation of accuracy: {scores.std()}")
```
%% Output
mean accuracy between the splits: 0.732
standard deviation of accuracy: 0.4429175995600084
%% Cell type:markdown id: tags:
<font color = royalblue> \<Write your answer here\></font>
%% Cell type:markdown id: tags:
**- What can you say about the accuracy compared to the previous task?**
<br>
The accuracy doesn't change much with the leave-one-out cross validation which makes it more reliable than with the previous task. However I would still take into account the class imbalance of the target variable and I also printed out the standard deviation of the accuracy, which indicates that there is still some variation between the accuracys. And again because the ratio between the target classes is about 70/30 that accuracy isn't that good, because the model could still always predict 0 and get 70% accuracy.
<br>
**- What do you think: Does the k-NN model work for the problem in hand? Explain your answer.**
<br>
I think that it doesn't work that well with the problem. And I think that it is because the model doesn't interpret the actual positives that well. It can predict the negatives well but it does not generalize well enough for the 1 prediction. And like I said, I think that the False Negatives in our case are costly because then some people would go undiagnosed.
%% Cell type:markdown id: tags:
____________
## <font color = forestgreen> 3. Model selection with leave-one-out cross-validation
%% Cell type:markdown id: tags:
So far, we've trained one model at a time and I've given the value of k for you. Accuracy is what it is (no spoilers here), but could we still do a little better? Let's explore that possibility through a process known as <font color=green>hyperparameter tuning</font>. The cross-validation is especially important tool for this task. Note here, that model selection and model evaluation (or assessment) are two different things: We use model selection to estimate the performance of various models to identify the model which is most likely to provide the "best" predictive performance for the task. And when we have found this most suitable model, we *assess* its perfomance and generalisation power on unseen data.
This time, we're going to train multiple models, let's say 30, and our goal is to select the best K-Nearest Neighbors model from this set. Most models come with various hyperparameters that require careful selection, and the k-NN model is no exception. Although we're talking about the number of neighbors here, it's important to note that k-NN also has several other hyperparameters, such as the used distance measure. However, for the sake of simplicity, this time we'll focus solely on fine-tuning the number of nearest neighbors, that is, the value of k, and use default values for all the other hyperparameters.
Let's focus on the model selection part here for the sake of comprehending the cross-validation itself. We'll get later on the whole pipeline, which also includes model assessment.
**Exercise 3**
Find the optimal k value from a set of $k=1...30$ using leave-one-out cross-validation. Plot the accuracies vs. the k values. Again, you may use the entire sample of 1000 on this task.
- Which value of k produces the best accuracy when using leave-one-out cross-validation? Compare the result to the previous model with $k=3$.
- If the number of k is still increased, what is the limit that the accuracy approaches? Why?
- Discuss the impact of choosing a very small or very large number of neighbors on the k-NN model's ability to distinguish between the healthy individuals and the ones with CVD.
%% Cell type:code id: tags:
``` python
### Code - Select best model
# setting parameters
accuracies = []
loo = LeaveOneOut()
# iterating 1 to 31 k neighbors and doing leave-one-out cross-validation on each iteraton and adding the accuracy to a list
for i in range(1, 31):
model = KNeighborsClassifier(n_neighbors=i)
scores = cross_val_score(model, x_train, y_train, cv=loo, scoring="accuracy")
accuracies.append(scores.mean())
# getting the best accuracy and the k value that produced it
best = max(accuracies)
print(f"Best accuracy: {best} was achieved with k number of neighbors: {accuracies.index(best)+1}")
```
%% Output
Best accuracy: 0.738 was achieved with k number of neighbors: 11
%% Cell type:code id: tags:
``` python
# just for myself to see
print(accuracies)
```
%% Output
[0.69, 0.722, 0.732, 0.72, 0.728, 0.723, 0.728, 0.722, 0.726, 0.731, 0.738, 0.73, 0.738, 0.732, 0.735, 0.726, 0.726, 0.719, 0.723, 0.726, 0.726, 0.714, 0.723, 0.713, 0.716, 0.719, 0.718, 0.718, 0.717, 0.718]
%% Cell type:code id: tags:
``` python
### Code - Plot the accuracies vs. the values for k
plt.plot(range(1,31), accuracies, marker="o", label = "accuracy")
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")
plt.title("k-NN Accuracy vs. Number of Neighbors")
plt.axvline(11, color='r', label=f'Optimal k = {11}')
plt.axvline(13, color='r', label=f'Optimal k = {13}')
plt.legend()
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
<font color = green>\<Write your answer here\></font>
%% Cell type:markdown id: tags:
**- Which value of k produces the best accuracy when using leave-one-out cross-validation? Compare the result to the previous model with $k=3$.**
<br>
The k value of 11 and 13 produces the best accuracy 0.738. The change in the accuracy between the different models is not that big. This would tell me that the model can't generilze that good for this specific problem even though we use cross validation and parameter tuning.
<br>
<br>
**- If the number of k is still increased, what is the limit that the accuracy approaches? Why?**
<br>
<br>
The accurancy tends to approach something between 0.71-0.72. When the number of neighbors is increased the model takes more datapoints into account when making the prediction and then the majority class of those neighbors is assigned to the prediction. When the number of neighbors is too large the majority class in the dataset starts to dominate the predictions, because the model can't anymore detect variations in the data and the minority class goes unseen. So then again we can see that the model approaches the 70/30 ratio, because it only predicts the majority class anymore.
<br>
<br>
**- Discuss the impact of choosing a very small or very large number of neighbors on the k-NN model's ability to distinguish between the healthy individuals and the ones with CVD.**
<br>
<br>
Like said in the answer before when the k is too large the model tends to only predict the majority class, which in our case is 0, because the model can't detect the variations in the data anymore. So many people would go undiagnosed.
<br>
<br>
If the k is very small the model can learn and predict the training data really well, but can't generilze to unseen test data that well. So again it doesn't work that well when trying to diagnose people with the disease if the model can't generilze outside the training data well enough.
%% Cell type:markdown id: tags:
_____________________
## <font color = darkorange> 4. Ridge regression </font>
%% Cell type:markdown id: tags:
The previous exercises were about classification. Now, we are ready to see another kind of supervised learning - regression - as we are changing our main goal from predicting discrete classes (healthy/sick) to estimating continuous values. The following exercises are going to involve utilizing one regression model, <font color = darkorange>Ridge Regression</font>, and our goal is to evaluate the performance of this model.
%% Cell type:markdown id: tags:
Let's change the dataset to make the following exercises more intuitive. The new dataset is about brushtail possums and it includes variables such as
| Feature | Type | Explanation |
| :- | :- | :-
|sex|binary| Sex, either male (0) or female (1)
|age|numeric| Age in years
|len_head|numeric| Head length in mm
|width_skull|numeric| Skull width in mm
|len_earconch|numeric| Ear conch length in mm
|width_eye|numeric| Distance from medial canthus to lateral canthusof right eye, i.e., eye width in mm
|len_foot|numeric| Foot length in mm
|len_tail|numeric| Tail length in mm
|chest|numeric| Chest grit in mm
|belly |numeric| Belly grit in mm
|len_total|numeric| Total length in mm
In this case, our target variable will be *the age of the possum*. The data for this exercise has been modified from the original source.
There's the code chunk for loading data provided again. <font color = red>Again, the data file should be located in the same directory as this notebook file!</font>
%% Cell type:code id: tags:
``` python
### Loading code provided
# ------------------------------------------------------
# The data file should be at the same location than the
# exercise file to make sure the following lines work!
# Otherwise, fix the path.
# ------------------------------------------------------
# Data path
data_path = 'ex2_possum_data.csv'
# Load the data
possum_data = pd.read_csv(data_path)
```
%% Cell type:markdown id: tags:
-----------
Regression allows us to examine <font color = darkorange>relationships between two or more variables</font>. This relationship is represented by an *equation*, which itself represents how a change in one variable affects another on average. For example, we could examine how a change in possum's total length affects, on average, its estimated age.
We start by examining those relationships between the variables in the given dataset.
**Exercise 4 A)**
Plot pairwise relationships between the age variable and the others where you color the samples based on the sex variable.
- Which body dimensions seem to be most correlated with age? And are there any variables that seem to have no correlation with it?
- Are there any differences in the correlations between males and females?
*Tip: `seaborn.pairplot()` is handy with the parameters `(x,y)_vars` and `hue`. You actually can fit a linear model to draw a regression line with the parameter `kind` set to `"reg"`.*
%% Cell type:code id: tags:
``` python
possum_data
```
%% Output
sex age len_head width_skull len_earconch width_eye len_foot \
0 0 8.0 94.1 60.4 54.5 15.2 74.5
1 1 6.0 92.5 57.6 51.2 16.0 72.5
2 1 6.0 94.0 60.0 51.9 15.5 75.4
3 1 6.0 93.2 57.1 52.2 15.2 76.1
4 1 2.0 91.5 56.3 53.2 15.1 71.0
.. ... ... ... ... ... ... ...
96 0 1.0 89.5 56.0 46.8 14.8 66.0
97 0 1.0 88.6 54.7 48.0 14.0 64.4
98 1 6.0 92.4 55.0 45.4 13.0 63.5
99 0 4.0 91.5 55.2 45.9 15.4 62.9
100 1 3.0 93.6 59.9 46.0 14.8 67.6
len_tail chest belly len_total
0 360.0 280.0 360.0 890.0
1 365.0 285.0 330.0 915.0
2 390.0 300.0 340.0 955.0
3 380.0 280.0 340.0 920.0
4 360.0 285.0 330.0 855.0
.. ... ... ... ...
96 365.0 230.0 270.0 815.0
97 390.0 250.0 330.0 825.0
98 380.0 250.0 300.0 890.0
99 365.0 250.0 290.0 825.0
100 400.0 285.0 335.0 890.0
[101 rows x 11 columns]
%% Cell type:code id: tags:
``` python
### Code - Pairplot
sns.pairplot(possum_data, y_vars=["age"], hue="sex", kind = "reg")
plt.title("pairplots between age and other variables")
```
%% Output
Text(0.5, 1.0, 'pairplots between age and other variables')
%% Cell type:markdown id: tags:
<font color = darkorange>\<Write your answer here\>
%% Cell type:markdown id: tags:
**- Which body dimensions seem to be most correlated with age? And are there any variables that seem to have no correlation with it?**
<br>
<br>
The length of the ear conch seems to have no strong correlation with age. The length of the foot also has weak correlation with the age. The belly and chest grits again seem to be the most correlated with age.
<br>
<br>
**- Are there any differences in the correlations between males and females?**
<br>
<br>
There are some notable differences in the correlation between males and females. For example the tail length seems to be strongly correlated to the age with females but not with males. And the skull width seems to have strong correlation to the age with males but not with females. The total length seems to be correlated to the age with females but not so much with males.
%% Cell type:markdown id: tags:
------
Before the regression analysis itself, let's check that our dataset is in a proper format. We'll also perform the train-test split as we're going to test the overall performance of the model using the test set.
**Exercise 4 B)**
Do you need to prepare the data a little? Explain your decision. Perform the train-test (80/20) split.
*Note: Set the features in the dataframe named as `possum_X` so you can play around with the upcoming code snippet.*
%% Cell type:code id: tags:
``` python
### Code - Data preparation
# arrays for features and the target variable
features2 = ["sex", "len_head", "width_skull", "len_earconch", "width_eye", "len_foot", "len_tail", "chest", "belly", "len_total"]
target2 = ["age"]
# Separate dataframes for features and the target variable
possum_X = possum_data[features2]
possum_Y = possum_data[target2]
# splitting the data into train and test set
X_train, X_test, Y_train, Y_test = train_test_split(possum_X, possum_Y, test_size=0.2, train_size=0.8, random_state=1)
```
%% Cell type:code id: tags:
``` python
# standardizing numerical features
# Separating the numerical features
numerical_features2 = ["len_head", "width_skull", "len_earconch", "width_eye", "len_foot", "len_tail", "chest", "belly", "len_total"]
# then standadizing them with the StandardScaler
scaler = StandardScaler()
scaler.fit(X_train[numerical_features2])
X_train[numerical_features2] = scaler.transform(X_train[numerical_features2])
X_test[numerical_features2] = scaler.transform(X_test[numerical_features2])
```
%% Cell type:code id: tags:
``` python
# checking if it worked
X_test
```
%% Output
sex len_head width_skull len_earconch width_eye len_foot len_tail \
94 0 0.194444 4.229331 -1.026541 -0.571192 -0.713125 -1.007116
78 1 -2.036627 -1.005788 -1.370358 0.264699 -1.852730 -0.229920
17 0 0.287405 -0.396220 1.699441 0.171822 1.170711 -0.488986
100 1 0.318392 1.109774 -0.609048 -0.292562 -0.247980 1.583536
36 1 -1.014053 -0.718932 0.864455 -1.407082 0.589280 -1.007116
85 0 1.588863 0.571919 -0.559931 -0.664068 -0.992212 -1.007116
55 0 3.076244 2.149626 -0.977424 -0.385438 1.054425 1.583536
83 0 1.836760 1.396629 -0.412581 -0.106808 0.705566 2.360732
82 1 -1.354911 -1.292644 -1.173891 -1.407082 -1.945759 0.806341
52 0 2.270580 1.683485 -0.412581 1.193466 0.007849 0.547275
95 1 0.225431 -0.216935 -1.149333 -1.035575 -0.899183 0.806341
44 0 1.867747 2.293054 -0.879190 1.750726 -0.410781 -1.525247
31 1 0.535302 -0.037650 0.864455 -0.199685 1.426541 1.065406
93 0 -0.487272 -1.328501 -0.609048 -0.571192 -1.131755 0.547275
65 0 1.681825 -0.432077 -0.707282 -0.106808 -0.852668 0.288210
35 0 0.225431 0.894632 0.864455 -0.199685 1.310255 -1.007116
66 1 -0.208388 -0.145221 -1.075657 -1.964342 -0.759640 0.547275
70 1 -1.199975 -1.722928 -0.633606 -0.385438 -1.666672 0.547275
81 0 -0.952079 -0.216935 -1.223008 -0.571192 -0.713125 1.842602
80 0 -1.292937 -0.790646 -1.198449 1.100589 -1.410842 -0.488986
33 0 -0.611221 -0.396220 1.134598 -0.664068 1.031168 -0.229920
chest belly len_total
94 0.714590 -0.213867 -0.772908
78 -0.476393 -0.213867 -1.253162
17 0.476393 -0.213867 0.667853
100 0.714590 0.315288 0.427726
36 0.476393 -0.390251 -1.133099
85 0.952786 -0.743021 -0.652845
55 2.381965 1.197212 2.108614
83 -0.476393 1.197212 1.388233
82 -0.476393 -0.566636 -0.172591
52 0.238197 -0.390251 1.556322
95 0.476393 0.844442 -0.172591
44 0.476393 0.844442 -0.532781
31 0.476393 0.491673 1.628360
93 -0.952786 -0.390251 -0.292655
65 0.714590 0.491673 0.187599
35 -0.714590 1.197212 0.187599
66 0.000000 0.491673 -0.052528
70 -0.476393 0.491673 -1.013035
81 0.000000 -0.390251 1.148107
80 -0.952786 -1.448560 -1.613352
33 -0.476393 -1.448560 -0.412718
%% Cell type:code id: tags:
``` python
X_test
```
%% Output
sex len_head width_skull len_earconch width_eye len_foot len_tail \
94 0 0.194444 4.229331 -1.026541 -0.571192 -0.713125 -1.007116
78 1 -2.036627 -1.005788 -1.370358 0.264699 -1.852730 -0.229920
17 0 0.287405 -0.396220 1.699441 0.171822 1.170711 -0.488986
100 1 0.318392 1.109774 -0.609048 -0.292562 -0.247980 1.583536
36 1 -1.014053 -0.718932 0.864455 -1.407082 0.589280 -1.007116
85 0 1.588863 0.571919 -0.559931 -0.664068 -0.992212 -1.007116
55 0 3.076244 2.149626 -0.977424 -0.385438 1.054425 1.583536
83 0 1.836760 1.396629 -0.412581 -0.106808 0.705566 2.360732
82 1 -1.354911 -1.292644 -1.173891 -1.407082 -1.945759 0.806341
52 0 2.270580 1.683485 -0.412581 1.193466 0.007849 0.547275
95 1 0.225431 -0.216935 -1.149333 -1.035575 -0.899183 0.806341
44 0 1.867747 2.293054 -0.879190 1.750726 -0.410781 -1.525247
31 1 0.535302 -0.037650 0.864455 -0.199685 1.426541 1.065406
93 0 -0.487272 -1.328501 -0.609048 -0.571192 -1.131755 0.547275
65 0 1.681825 -0.432077 -0.707282 -0.106808 -0.852668 0.288210
35 0 0.225431 0.894632 0.864455 -0.199685 1.310255 -1.007116
66 1 -0.208388 -0.145221 -1.075657 -1.964342 -0.759640 0.547275
70 1 -1.199975 -1.722928 -0.633606 -0.385438 -1.666672 0.547275
81 0 -0.952079 -0.216935 -1.223008 -0.571192 -0.713125 1.842602
80 0 -1.292937 -0.790646 -1.198449 1.100589 -1.410842 -0.488986
33 0 -0.611221 -0.396220 1.134598 -0.664068 1.031168 -0.229920
chest belly len_total
94 0.714590 -0.213867 -0.772908
78 -0.476393 -0.213867 -1.253162
17 0.476393 -0.213867 0.667853
100 0.714590 0.315288 0.427726
36 0.476393 -0.390251 -1.133099
85 0.952786 -0.743021 -0.652845
55 2.381965 1.197212 2.108614
83 -0.476393 1.197212 1.388233
82 -0.476393 -0.566636 -0.172591
52 0.238197 -0.390251 1.556322
95 0.476393 0.844442 -0.172591
44 0.476393 0.844442 -0.532781
31 0.476393 0.491673 1.628360
93 -0.952786 -0.390251 -0.292655
65 0.714590 0.491673 0.187599
35 -0.714590 1.197212 0.187599
66 0.000000 0.491673 -0.052528
70 -0.476393 0.491673 -1.013035
81 0.000000 -0.390251 1.148107
80 -0.952786 -1.448560 -1.613352
33 -0.476393 -1.448560 -0.412718
%% Cell type:markdown id: tags:
<font color = darkorange>\<Write your answer here\></font>
%% Cell type:markdown id: tags:
I standardized the numerical features so they all would have the same scale and then none of them would have much bigger impact on the prediction made just because they contain bigger values then the rest.
%% Cell type:markdown id: tags:
------
Regarding Ridge Regression, we'll focus on the hyperparameter called $\lambda$ (read as 'lambda'), the regularization term (or penalty term or L2 penalty, how ever we'd like to call it).
**Exercise 4 C)**
Fit a ridge regression model with the whole training set. For the hyperparameter 'lambda', use 64. Evaluate the model using the test set and describe the results. For evaluating on the test set, use a metric called mean absolute error (MAE).
- How well did the model perform in estimating the possums' ages?
- How do you interpret the MAE in our case when the target variable is age?
%% Cell type:code id: tags:
``` python
## Code - Ridge regression
# model
rig = Ridge(alpha=64)
# fitting training data
rig.fit(X_train, Y_train)
#making predicions for the test data
predictions2 = rig.predict(X_test)
# calculating the MAE
MAE = metrics.mean_absolute_error(Y_test, predictions2)
print(f"Age predictions: {predictions2}")
print(f"MAE: {MAE}")
```
%% Output
Age predictions: [[3.84077859]
[3.01248392]
[4.12395 ]
[4.34062733]
[3.20616884]
[3.86200805]
[5.57810952]
[4.81658784]
[3.04225297]
[4.80800151]
[3.93167973]
[4.57525185]
[4.53022639]
[3.2698389 ]
[4.33866247]
[3.95309768]
[3.52138533]
[3.3189372 ]
[3.73755716]
[2.92022269]
[3.24543027]]
MAE: 1.3946477474717445
%% Cell type:code id: tags:
``` python
# Coefficients
rig.coef_
```
%% Output
array([[0.02888464, 0.19298885, 0.06390551, 0.07666833, 0.17864283,
0.01086109, 0.12134348, 0.19812574, 0.17526048, 0.12698785]])
%% Cell type:code id: tags:
``` python
X_train
```
%% Output
sex len_head width_skull len_earconch width_eye len_foot len_tail \
32 0 -0.363324 -0.790646 0.815339 -0.292562 0.496251 0.029145
40 0 -2.253537 -0.969931 0.667988 -1.221329 -1.387585 -2.561508
39 1 -0.487272 -0.647218 0.717105 -1.407082 0.961396 -0.488986
38 1 -2.439459 -1.902213 1.208273 -1.964342 0.007849 -1.525247
46 1 -0.301350 -0.145221 -0.314347 -0.199685 -0.852668 0.547275
.. ... ... ... ... ... ... ...
75 0 -2.098601 -2.440067 -1.345800 -0.292562 -1.364328 -0.229920
9 1 -0.239375 0.428491 1.208273 -0.664068 0.519509 0.288210
72 0 -0.053453 -0.001793 -0.510814 2.493739 -0.968955 2.101667
12 0 0.783199 1.109774 0.324171 0.636205 0.542766 -0.488986
37 0 -0.053453 -0.288649 0.250496 0.729082 -0.061922 -0.748051
chest belly len_total
32 -1.429179 -0.919406 0.427726
40 -0.714590 0.138903 -2.453796
39 0.000000 -0.919406 -0.652845
38 -0.952786 -2.683254 -2.934050
46 0.476393 1.197212 0.187599
.. ... ... ...
75 -2.381965 -1.448560 -1.493289
9 0.238197 -0.213867 0.547789
72 -0.476393 0.138903 0.427726
12 0.000000 -0.213867 0.547789
37 0.000000 -0.919406 -1.613352
[80 rows x 10 columns]
%% Cell type:markdown id: tags:
<font color = darkorange>\<Write your answer here\></font>
%% Cell type:markdown id: tags:
**- How well did the model perform in estimating the possums' ages?**
<br>
I feel like the model did pretty decent, because the MAE is pretty low. A year and a half throw is not that bad in my opinion. And the variation in the age in the test set is 5 years so compared to that our MAE is pretty good. I printed out the coefficients so we can see what features are more important in the regression. We can see that neither sex or the lenght of the foot do not have a large effect on predicting the age, but length of the head and chest grit do based on the coefficients.
<br>
<br>
**- How do you interpret the MAE in our case when the target variable is age?**
<br>
I interpret the MAE that our predictions are off by 1,4 years possible in both ways.
%% Cell type:markdown id: tags:
Now that we have fitted the regression model, let's break it down for better understanding what is actually happening here. Remember that the model here is essentially just a linear regression model with an added regularization term to deal with e.g overfitting and multicollinearity. We can write the equation used by the model to predict an opossum's age as:
$$
\text{Predicted age} = w_1 \times \text{Sex} + w_2 \times \text{Head length} + w_3 \times \text{Skull width} + ... + w_{10} \times \text{Total length} + \text{Bias}
$$
As mentioned earlier, regression focuses on the relationships between the features and the target variable. In the equation above, each feature contibutes a certain amount to the predicted age, based on the weight $w_i$ learned for that feature. For example, if the total length of an opossum has a large positive weight, it suggests that opossums with greater length are predicted to be older. On the other hand, if the skull width of an opossum has a negative weight, it indicates that opossums with wider skulls are predicted to be younger. In this case, as skull width increases, the predicted age decreases.
Different classes have different class attributes that you can access after e.g. fitting a model, and the `Ridge` class is no exception: For example, the `coef_` variable contains the learned weights $w_1, ..., w_{10}$ that represent the relationship between the features and the target (a.k.a age) variable. The `intercept_` variable holds the bias term (or the intercept, however we wanna call it).
We can now write down the equation used by our fitted model. You can experiment with it by adjusting the regularization term or using a different sample, if you'd like, to see how the weights and bias change. This is just extra!
%% Cell type:code id: tags:
``` python
# NOTE: To make this code chunk to work with the already fitted model,
# the model variable needs to be named as `ridge_model`. Also, the
# initial feature dataframe is named here as `possum_X`.
coefficients = rig.coef_[0] # CHANGE THE VARIABLE NAME IF NOT WROTE AS THIS
bias = rig.intercept_[0] # # CHANGE THE VARIABLE NAME IF NOT WROTE AS THIS
feature_names = possum_X.columns # CHANGE THE VARIABLE NAME HERE IF NOT AS WROTE AS THIS
# Let's write the equation
equation = 'Predicted age = '
for i in range(len(coefficients)):
equation += f'{coefficients[i]:.3f}*{feature_names[i]} + '
equation += f'{bias:.3f}'
print(equation)
```
%% Output
Predicted age = 0.029*sex + 0.193*len_head + 0.064*width_skull + 0.077*len_earconch + 0.179*width_eye + 0.011*len_foot + 0.121*len_tail + 0.198*chest + 0.175*belly + 0.127*len_total + 3.838
%% Cell type:markdown id: tags:
________________
## <font color = slategrey> BONUS: Feature selection - most useful features in predicting cardiovascular diseases </font>
%% Cell type:markdown id: tags:
You can stop here and get the "pass" grade! To get the pass with honors, you need to do the following exercise. This means you'll get one bonus point for the exam.
The exercise may require you to do some research of your own. You are also required to **explain** the steps you choose with your own words, and show that you tried to understand the idea behind the task. There's no single correct solution for this so just explain what you did and especially ***why*** you did it. Please note that submitting only code will not be awarded a pass with honors.
----------------
Due to the lack of resources and time, doctors can't measure all the values represented in the given cardio dataset. Fortunately, eager students are willing to help: Your task is to identify <font color = slategrey>five [5] most useful features</font> for predicting the presence of the CVD from the dataset. The steps needed for this job are presented above except the feature selection part. You must remember not to leak any information from the test set when selecting the features, i.e., you try to find those five features using only the training set.
Regarding the feature selection itself, you're asked to use <font color = slategrey>Random Forest</font>. To do this, use the Random Forest classifier's built-in feature importance estimation in scikit-learn. Explain briefly the working of the model on the given cardio dataset: How does the model select features that are relevant in predicting CVD?
Evaluate the model of your choice using accuracy and the area under the ROC curve (AUC). Draw the corresponding curve in a plot. **Discuss** your findings and results.
What goes wrong in your AUC analysis, if you use the predictions from the `predict()` function instead of the `predict_proba()` function to calculate the AUC?
%% Cell type:code id: tags:
``` python
## Code - Bonus task
```
%% Cell type:markdown id: tags:
<font color = slategrey>\<Write your answer here\></font>
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment