Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision

Target

Select target project
  • iapyyk/data-analysis-exercises
1 result
Select Git revision
Show changes
Commits on Source (2)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<div class="alert alert-block alert-warning"> <div class="alert alert-block alert-warning">
<h1><center> DAKD 2023 EXERCISE 1: DATA UNDERSTANDING </center></h1> <h1><center> DAKD 2023 EXERCISE 1: DATA UNDERSTANDING </center></h1>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
This exercise relates to the _data understanding_ and _data preparation_ stages of the Crisp Data Mining (CRISP-DM) model presented on the course. The questions at this stage of a data-analysis project are for example: This exercise relates to the _data understanding_ and _data preparation_ stages of the Crisp Data Mining (CRISP-DM) model presented on the course. The questions at this stage of a data-analysis project are for example:
- Is the data quality sufficient? - Is the data quality sufficient?
- How can we check the data for problems? - How can we check the data for problems?
- How do we have to clean the data? - How do we have to clean the data?
- How is the data best transformed for modeling? - How is the data best transformed for modeling?
It may be tempting to just run a model on data without checking it. However, not doing basic checks can ruin your whole analysis and make your results invalid as well as mislead you in further analyses. There is no excuse for not plotting and checking that the data is as we expect and clean. In this exercise we do just that, check the validity of data and familiarize ourselves with a dataset, also discussing preprocessing and multi-dimensional plotting. It may be tempting to just run a model on data without checking it. However, not doing basic checks can ruin your whole analysis and make your results invalid as well as mislead you in further analyses. There is no excuse for not plotting and checking that the data is as we expect and clean. In this exercise we do just that, check the validity of data and familiarize ourselves with a dataset, also discussing preprocessing and multi-dimensional plotting.
------------ ------------
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### <font color = red> *** FILL YOUR INFORMATION BELOW *** </font> ### <font color = red> *** FILL YOUR INFORMATION BELOW *** </font>
Iida Pyykkönen <br> Iida Pyykkönen <br>
526289 <br> 526289 <br>
iapyyk@utu.fi <br> iapyyk@utu.fi <br>
10.11.2023 <br> 10.11.2023 <br>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### General guidance for exercises #### General guidance for exercises
- You can add more code and markup cells, as long as the flow of the notebook stays readable and logical. - You can add more code and markup cells, as long as the flow of the notebook stays readable and logical.
- Answer **all** questions (except the bonus if you do not want to attempt it), even if you can't get your script to fully work - Answer **all** questions (except the bonus if you do not want to attempt it), even if you can't get your script to fully work
- Write clear and easily readable code, include explanations of what your code does - Write clear and easily readable code, include explanations of what your code does
- Make informative illustrations: include labels for x and y axes, legends and captions for your plots - Make informative illustrations: include labels for x and y axes, legends and captions for your plots
- Before saving the ipynb file (and possible printing) run: "Restart & Run all", to make sure you return a file that works as expected. - Before saving the ipynb file (and possible printing) run: "Restart & Run all", to make sure you return a file that works as expected.
- Grading: *Fail*/*Pass*/*Pass with honors* (+1) - Grading: *Fail*/*Pass*/*Pass with honors* (+1)
- If you encounter problems, Google first. If you can't find an answer to the problem, don't hesitate to ask in the Moodle discussion or directly via moodle chat. - If you encounter problems, Google first. If you can't find an answer to the problem, don't hesitate to ask in the Moodle discussion or directly via moodle chat.
- It's important to know that while the use of ChatGPT to generate solutions can be very tempting, the main purpose of the exercices is to suppport the learning process. And so if you do end up using generative AI models, it is important to avoid direct copy-paste without first understanding the generated code, instead make sure to write a short description of how you used ChatGPT in the context of these exercises (what was your input, how did you benefit from the output?) - It's important to know that while the use of ChatGPT to generate solutions can be very tempting, the main purpose of the exercices is to suppport the learning process. And so if you do end up using generative AI models, it is important to avoid direct copy-paste without first understanding the generated code, instead make sure to write a short description of how you used ChatGPT in the context of these exercises (what was your input, how did you benefit from the output?)
- When submitting the exercice, make sure to return both an **ipynb-file** as well as an **html-file**. Your .ipynb notebook is expected to be run to completion, which means that it should execute without errors when all cells are run in sequence. - When submitting the exercice, make sure to return both an **ipynb-file** as well as an **html-file**. Your .ipynb notebook is expected to be run to completion, which means that it should execute without errors when all cells are run in sequence.
- Don't leave it to the last moment! No feedback service during weekends. - Don't leave it to the last moment! No feedback service during weekends.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### <font color = red> Packages needed for this exercise: </font> ### <font color = red> Packages needed for this exercise: </font>
- The exercise can be done without importing any extra packages, but you can import new ones but bear in mind that if you are importing many new packages, you may be complicating your answer. - The exercise can be done without importing any extra packages, but you can import new ones but bear in mind that if you are importing many new packages, you may be complicating your answer.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# --- Libraries with a short description --- # --- Libraries with a short description ---
import pandas as pd # for data manipulation import pandas as pd # for data manipulation
import matplotlib.pyplot as plt # for plotting import matplotlib.pyplot as plt # for plotting
import numpy as np #for numeric calculations and making simulated data. import numpy as np #for numeric calculations and making simulated data.
import seaborn as sns # for plotting, an extension on matplotlib import seaborn as sns # for plotting, an extension on matplotlib
# - sklearn has many data analysis utility functions like scaling as well as a large variety of modeling tools. # - sklearn has many data analysis utility functions like scaling as well as a large variety of modeling tools.
from sklearn.decomposition import PCA from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import minmax_scale from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import scale from sklearn.preprocessing import scale
from sklearn.manifold import TSNE from sklearn.manifold import TSNE
# This forces plots to be shown inline in the notebook # This forces plots to be shown inline in the notebook
%matplotlib inline %matplotlib inline
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<div class="alert alert-block alert-warning"> <div class="alert alert-block alert-warning">
<h1><center> PLOTTING TUTORIAL </center></h1> <h1><center> PLOTTING TUTORIAL </center></h1>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
This small explanation of the matplotlib package aims to avoid confusion and help you avoid common mistakes and frustration. Matplotlib is an object-oriented plotting package with the benefit of giving the user a lot of control. The downside is that it can be confusing to new users. **If you are having problems with the plotting exercises, return to this tutorial as it explains the needed concepts to do the exercises!** This small explanation of the matplotlib package aims to avoid confusion and help you avoid common mistakes and frustration. Matplotlib is an object-oriented plotting package with the benefit of giving the user a lot of control. The downside is that it can be confusing to new users. **If you are having problems with the plotting exercises, return to this tutorial as it explains the needed concepts to do the exercises!**
----------- -----------
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Figure and axes ### Figure and axes
All plots in matplotlib are structured with the **<font color = dimgrey> figure </font>** and **<font color = blue> axes </font>** objects. All plots in matplotlib are structured with the **<font color = dimgrey> figure </font>** and **<font color = blue> axes </font>** objects.
- The **<font color = dimgrey> figure </font>** object is a container for all plotting elements (in other words, everything we see). - The **<font color = dimgrey> figure </font>** object is a container for all plotting elements (in other words, everything we see).
- A figure can have many **<font color = blue> axes </font>**. They are the objects you plot on to. The axes can be anywhere inside the figure and can even overlap. Position of axes is defined relative to the figure. - A figure can have many **<font color = blue> axes </font>**. They are the objects you plot on to. The axes can be anywhere inside the figure and can even overlap. Position of axes is defined relative to the figure.
The **<font color = blue> axes </font>** objects have the methods you will use to define most of your plots. For example axes.hist() is used to draw a histogram and axes.set_title() to give one axes a title. The name of the object can be a bit confusing as it does not refer to the axes in the way "x-axis" does but to the container of a single plot. The **<font color = blue> axes </font>** objects have the methods you will use to define most of your plots. For example axes.hist() is used to draw a histogram and axes.set_title() to give one axes a title. The name of the object can be a bit confusing as it does not refer to the axes in the way "x-axis" does but to the container of a single plot.
-------------- --------------
- Below is an example that illustrates how **<font color = dimgrey> figures </font>**and **<font color = blue> axes </font>** work together in matplotlib. The comments explain what is done in every row of code. <font color = green> You are encouraged to play around with it, but its not required in terms of the exercise </font>. Below, we will create all figures and axes separately, but later on we will use a quicker way to do so. - Below is an example that illustrates how **<font color = dimgrey> figures </font>**and **<font color = blue> axes </font>** work together in matplotlib. The comments explain what is done in every row of code. <font color = green> You are encouraged to play around with it, but its not required in terms of the exercise </font>. Below, we will create all figures and axes separately, but later on we will use a quicker way to do so.
This is not yet a part of the exercises themselves and you do not need to change anything ! This is not yet a part of the exercises themselves and you do not need to change anything !
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# --- Lets make some example data. --- # --- Lets make some example data. ---
x_example_data = np.linspace(0,5,10) x_example_data = np.linspace(0,5,10)
y_example_data = x_example_data**2 y_example_data = x_example_data**2
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
### Create a figure ### ### Create a figure ###
example_figure = plt.figure(figsize =(5,5)) #you give the size of the figure as a tuple of inches example_figure = plt.figure(figsize =(5,5)) #you give the size of the figure as a tuple of inches
### Create an axes separately and add it to the figure ### ### Create an axes separately and add it to the figure ###
example_axes_outer = example_figure.add_axes([0.1, 0.1, 0.9, 0.9]) #the argument gives the relative location of the axes in percentage from the corners of the figure. The order is left, bottom, right, top. example_axes_outer = example_figure.add_axes([0.1, 0.1, 0.9, 0.9]) #the argument gives the relative location of the axes in percentage from the corners of the figure. The order is left, bottom, right, top.
### Set labels and titles to the axes ### ### Set labels and titles to the axes ###
example_axes_outer.set_xlabel("This is how you set an x-axis label to an axes") example_axes_outer.set_xlabel("This is how you set an x-axis label to an axes")
example_axes_outer.set_ylabel("The y-label of an axes is set like this") example_axes_outer.set_ylabel("The y-label of an axes is set like this")
example_axes_outer.set_title("We learned how to give an axes a title!") example_axes_outer.set_title("We learned how to give an axes a title!")
example_axes_inner = example_figure.add_axes([0.45, 0.45, 0.4, 0.3]) example_axes_inner = example_figure.add_axes([0.45, 0.45, 0.4, 0.3])
example_axes_inner.set_title("This axes has a title too") example_axes_inner.set_title("This axes has a title too")
### Add something to the axes ### ### Add something to the axes ###
example_axes_inner.scatter(x_example_data, y_example_data) example_axes_inner.scatter(x_example_data, y_example_data)
# Multiple things, like lines can be plotted on same axis. # Multiple things, like lines can be plotted on same axis.
example_axes_outer.plot(x_example_data**4, y_example_data**2) example_axes_outer.plot(x_example_data**4, y_example_data**2)
example_axes_outer.plot(x_example_data**7, y_example_data**2) example_axes_outer.plot(x_example_data**7, y_example_data**2)
# If you want to add other objects, you add them to axes too, like text # If you want to add other objects, you add them to axes too, like text
# Now you specify the location relative to the parent axes # Now you specify the location relative to the parent axes
example_axes_inner.text(3, 5, "This is a text object relative to the inner axes") example_axes_inner.text(3, 5, "This is a text object relative to the inner axes")
#Many more things can be added to axes in a similar way, not just text. #Many more things can be added to axes in a similar way, not just text.
#For more information there are many good tutorials available for example in youtube. #For more information there are many good tutorials available for example in youtube.
``` ```
%% Output %% Output
Text(3, 5, 'This is a text object relative to the inner axes') Text(3, 5, 'This is a text object relative to the inner axes')
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Subplots: creating multiple axes and placing them in a grid on the figure ### Subplots: creating multiple axes and placing them in a grid on the figure
An established convention of using matplotlib is to start plotting by calling the **<font color = blue> plt.subplots </font>** function, which automatically creates a figure and a determined number of axes in a grid inside it and automatically links the axes to the figure. Even when creating just one axes this is a often used way to start making a plot. An established convention of using matplotlib is to start plotting by calling the **<font color = blue> plt.subplots </font>** function, which automatically creates a figure and a determined number of axes in a grid inside it and automatically links the axes to the figure. Even when creating just one axes this is a often used way to start making a plot.
The most important arguments to **<font color = blue> plt.subplots </font>** are **nrows**, **ncols**, **figsize**, **sharex** and **sharey** The most important arguments to **<font color = blue> plt.subplots </font>** are **nrows**, **ncols**, **figsize**, **sharex** and **sharey**
- **nrows** controls how many subplots there will be in the grid by row, **ncol** controls the number of columns - **nrows** controls how many subplots there will be in the grid by row, **ncol** controls the number of columns
- **figsize** is a tuple e.g (1,5) which controls the size of the **<font color = dimgrey> figure </font>**, first is width and then height. - **figsize** is a tuple e.g (1,5) which controls the size of the **<font color = dimgrey> figure </font>**, first is width and then height.
- sharex (True, False) tells matplotlib whether all axes in the grid should have same x-axis scale and ticks, sharey does the same for all y. - sharex (True, False) tells matplotlib whether all axes in the grid should have same x-axis scale and ticks, sharey does the same for all y.
-------- --------
Below an example on creating subplots is presented. There is also a template-like example on how to fill the subplots in a loop using the **enumerate** function of python for indexing into the subplots. The function **enumerate()** will give you an additional int indexer over the object you are looping over. This indexer can be used to loop over the different subplot elements like the axes for each of the subplots. Below an example on creating subplots is presented. There is also a template-like example on how to fill the subplots in a loop using the **enumerate** function of python for indexing into the subplots. The function **enumerate()** will give you an additional int indexer over the object you are looping over. This indexer can be used to loop over the different subplot elements like the axes for each of the subplots.
**<font color = dimgrey> plt.tight_layout() </font>** is also a good command to know with subplots. It attempts to automatically arrange the different axes in a pretty way. It should be called after the plot is finished. **<font color = dimgrey> plt.tight_layout() </font>** is also a good command to know with subplots. It attempts to automatically arrange the different axes in a pretty way. It should be called after the plot is finished.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# ----- Create some random data for the example, 3 continuous numeric features and 3 binary ----- # ----- Create some random data for the example, 3 continuous numeric features and 3 binary -----
#dont worry about understanding the function, it creates lists and is shorthand for a for loop called list comprehension. #dont worry about understanding the function, it creates lists and is shorthand for a for loop called list comprehension.
numeric_datas = [np.random.rand(10,2) for _ in range(0,3)] #this creates list of lists of linear data, using list comprehension numeric_datas = [np.random.rand(10,2) for _ in range(0,3)] #this creates list of lists of linear data, using list comprehension
binary_datas = [(np.unique(np.random.randint(0, 2, size= 10), return_counts = True)[1]) for _ in range(0,3)] # create list of lists of samples of 0,1 like (co binary_datas = [(np.unique(np.random.randint(0, 2, size= 10), return_counts = True)[1]) for _ in range(0,3)] # create list of lists of samples of 0,1 like (co
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Create figure with six axes in a 2*3 grid and set up titles -------------------------------------------------------- # Create figure with six axes in a 2*3 grid and set up titles --------------------------------------------------------
fig, axes = plt.subplots(2,3, figsize = (10,5)) # now axes have indexes like axes[i, j] fig, axes = plt.subplots(2,3, figsize = (10,5)) # now axes have indexes like axes[i, j]
numeric_plot_titles = ['scatter_plot_1', 'a second plot', 'yet a third plot' ]#some titles for the different axes numeric_plot_titles = ['scatter_plot_1', 'a second plot', 'yet a third plot' ]#some titles for the different axes
binary_plot_titles = ['coin_tosses1', 'tossing again', 'still tossing' ]#some titles for the different axes binary_plot_titles = ['coin_tosses1', 'tossing again', 'still tossing' ]#some titles for the different axes
# Enumerate the index into the axes, fill the first 3 columns of first row with scatterplots of numeric_datas -------- # Enumerate the index into the axes, fill the first 3 columns of first row with scatterplots of numeric_datas --------
i = 0 # for indexing to the row of the axes [**i**, j] i = 0 # for indexing to the row of the axes [**i**, j]
for j, numeric_data in enumerate(numeric_datas): # j = [0,1, ... n_datasets] for filling the columns, i stays constant as its the row for j, numeric_data in enumerate(numeric_datas): # j = [0,1, ... n_datasets] for filling the columns, i stays constant as its the row
axes[i, j].scatter(x = numeric_data[:, 0], y = numeric_data[:, 1]) #plots are called on the axes axes[i, j].scatter(x = numeric_data[:, 0], y = numeric_data[:, 1]) #plots are called on the axes
axes[i, j].set_title(numeric_plot_titles[j]) #set a title for each axes axes[i, j].set_title(numeric_plot_titles[j]) #set a title for each axes
plt.tight_layout() plt.tight_layout()
# Plot the binary data ----------------------------------------------------------------------------------------------- # Plot the binary data -----------------------------------------------------------------------------------------------
i = 1 # second row i = 1 # second row
for j, binary_data in enumerate(binary_datas): # j = [0,1, ... n_datasets] for filling the columns, i stays constant as its the row for j, binary_data in enumerate(binary_datas): # j = [0,1, ... n_datasets] for filling the columns, i stays constant as its the row
axes[i, j].bar(x = ["0","1"], height = binary_data) #make a barplot axes[i, j].bar(x = ["0","1"], height = binary_data) #make a barplot
axes[i, j].set_title(binary_plot_titles[j]) #set a title for each axes axes[i, j].set_title(binary_plot_titles[j]) #set a title for each axes
axes[i, j].set_ylim((0,10)) # set the yaxis limits, set_xlim works the same way. axes[i, j].set_ylim((0,10)) # set the yaxis limits, set_xlim works the same way.
fig.suptitle("fig.suptitle gives the figure a title and axes.set_title the axes") fig.suptitle("fig.suptitle gives the figure a title and axes.set_title the axes")
plt.tight_layout() plt.tight_layout()
``` ```
%% Output %% Output
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### <font color = maroon> Seaborn and matplotlib </font> #### <font color = maroon> Seaborn and matplotlib </font>
- Finally, it is good to know, that the popular Seaborn plotting library is based on matplotlib, and was designed to be an extension of it and to be more user-friendly and faster to use. - Finally, it is good to know, that the popular Seaborn plotting library is based on matplotlib, and was designed to be an extension of it and to be more user-friendly and faster to use.
- One tip in particular that might help new users with seaborn is that two kinnds of plotting functions: for figure-level and axes-level plots. Axes level plots can be put into subplots like matplotlib plots as you saw in the example above whereas figure-level plots are done completely with seaborn. (For more information on this see https://seaborn.pydata.org/tutorial/function_overview.html) - One tip in particular that might help new users with seaborn is that two kinnds of plotting functions: for figure-level and axes-level plots. Axes level plots can be put into subplots like matplotlib plots as you saw in the example above whereas figure-level plots are done completely with seaborn. (For more information on this see https://seaborn.pydata.org/tutorial/function_overview.html)
- For axes-level plots, the matplotlib-axes object is usually given to the seaborn plotting function as an argument. There is an example below. - For axes-level plots, the matplotlib-axes object is usually given to the seaborn plotting function as an argument. There is an example below.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
fig, axes = plt.subplots(2) fig, axes = plt.subplots(2)
# make some data # make some data
random_data_a = np.random.rand(30) random_data_a = np.random.rand(30)
random_data_b = np.random.rand(100) random_data_b = np.random.rand(100)
# print the data we are plotting # print the data we are plotting
sns.histplot(data = random_data_a, ax = axes[0]) # we make a seaborn plot and put it into one of the axes we created sns.histplot(data = random_data_a, ax = axes[0]) # we make a seaborn plot and put it into one of the axes we created
sns.histplot(data = random_data_b, ax = axes[1]) # we make a seaborn plot and put it into one of the axes we created sns.histplot(data = random_data_b, ax = axes[1]) # we make a seaborn plot and put it into one of the axes we created
``` ```
%% Output %% Output
<Axes: ylabel='Count'> <Axes: ylabel='Count'>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<div class="alert alert-block alert-warning"> <div class="alert alert-block alert-warning">
<h1><center> START OF EXERCISES </center></h1> <h1><center> START OF EXERCISES </center></h1>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## <font color = dimgrey> 1. Introduction to the dataset </font> ## <font color = dimgrey> 1. Introduction to the dataset </font>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The dataset in this exercice contains comprehensive health information from hospital patients with and without cardiovascular disease. The target variable "cardio," reflects the presence or absence of the disease, which is characterized by a buildup of fatty deposits inside the arteries (blood vessels) of the heart. The dataset in this exercice contains comprehensive health information from hospital patients with and without cardiovascular disease. The target variable "cardio," reflects the presence or absence of the disease, which is characterized by a buildup of fatty deposits inside the arteries (blood vessels) of the heart.
------- -------
As is often the case with data analysis projects, the features/variables have been retrieved from different sources: As is often the case with data analysis projects, the features/variables have been retrieved from different sources:
- doctors notes (texts) - doctors notes (texts)
- examination variables that have come from a database containing lab results or taken during a doctors examination - examination variables that have come from a database containing lab results or taken during a doctors examination
- self reported variables - self reported variables
-------------- --------------
The exercise data has the following columns/attributes: The exercise data has the following columns/attributes:
| Feature | Type | Explanation | | Feature | Type | Explanation |
| :- | :- | :- | :- | :- | :-
| age | numeric | The age of the patient in days | age | numeric | The age of the patient in days
| gender | binary | Male/Female | gender | binary | Male/Female
| body_mass | numeric | Measured weight of the patient (cm) | body_mass | numeric | Measured weight of the patient (cm)
| height | numeric | Measured weight of the patient (kg) | height | numeric | Measured weight of the patient (kg)
| blood_pressure_high | numeric | Measured Systolic blood pressure | blood_pressure_high | numeric | Measured Systolic blood pressure
| blood_pressure_low | numeric | Measured Diastolic blood pressure | blood_pressure_low | numeric | Measured Diastolic blood pressure
| smoke | binary | A subjective feature based on asking the patient whether or not he/she smokes | smoke | binary | A subjective feature based on asking the patient whether or not he/she smokes
| active | binary | A subjective feature based on asking the patient whether or not he/she exercises regularly | active | binary | A subjective feature based on asking the patient whether or not he/she exercises regularly
| serum_lipid_level | categorical | Serum lipid / Cholesterol associated risk information evaluated by a doctor | serum_lipid_level | categorical | Serum lipid / Cholesterol associated risk information evaluated by a doctor
|family_history| binary | Indicator for the presence of family history of cardiovascular disease based on medical records of patients |family_history| binary | Indicator for the presence of family history of cardiovascular disease based on medical records of patients
| cardio | binary | Whether or not the patient has been diagnosed with cardiac disease. | cardio | binary | Whether or not the patient has been diagnosed with cardiac disease.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
----------- -----------
#### ***Reading data*** #### ***Reading data***
It is good practice to read the features in using their correct types instead of fixing them later. Below, there is ready-made code for you to read in the data, using the data types and column names listed in the above table. Don't change the name of the variable, _data_. It is important in later exercises (for example in ex. 5e) that this is the name of the variable. <font color = red> If you have the dataset in the same folder as this notebook, the path already given to you should work. </font> It is good practice to read the features in using their correct types instead of fixing them later. Below, there is ready-made code for you to read in the data, using the data types and column names listed in the above table. Don't change the name of the variable, _data_. It is important in later exercises (for example in ex. 5e) that this is the name of the variable. <font color = red> If you have the dataset in the same folder as this notebook, the path already given to you should work. </font>
--------------- ---------------
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# --- READ IN DATA (no need to change) -------- # --- READ IN DATA (no need to change) --------
data_path = "CardioCare_ex1.csv" #if you just give the name of the file it will look for the data in the same folder as your script data_path = "CardioCare_ex1.csv" #if you just give the name of the file it will look for the data in the same folder as your script
data = pd.read_csv(data_path, dtype = {'age': 'int', 'height': 'int', 'body_mass':'int', 'blood_pressure_low':'int', 'blood_pressure_high':'int', 'gender': 'boolean', 'smoke': 'boolean', data = pd.read_csv(data_path, dtype = {'age': 'int', 'height': 'int', 'body_mass':'int', 'blood_pressure_low':'int', 'blood_pressure_high':'int', 'gender': 'boolean', 'smoke': 'boolean',
'active':'boolean', 'cardio':'boolean', 'serum_lipid_level':'category', 'family_history':'boolean'}) #the main data you use in this exercise should have this variable name, so that code given for you further on will run. 'active':'boolean', 'cardio':'boolean', 'serum_lipid_level':'category', 'family_history':'boolean'}) #the main data you use in this exercise should have this variable name, so that code given for you further on will run.
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
--------- ---------
***Exercise 1 a)*** ***Exercise 1 a)***
1. First, print out the first five rows of the data. 1. First, print out the first five rows of the data.
2. Then, save the feature names to lists by their types: make three lists named **numeric_features**, **binary_features** and **categorical_features**, containing the **names** of the features of each corresponding type (*you can think in terms of this exercise that binary variables can also be called booleans*). 2. Then, save the feature names to lists by their types: make three lists named **numeric_features**, **binary_features** and **categorical_features**, containing the **names** of the features of each corresponding type (*you can think in terms of this exercise that binary variables can also be called booleans*).
_When working with DataFrames, it can be incredibly helpful to organize column names into a list or lists. This organization simplifies data manipulation and analysis, and can be used to easily select, filter, or perform operations on specific sets of columns, it also prevents typing errors and avoids repetition!_ _When working with DataFrames, it can be incredibly helpful to organize column names into a list or lists. This organization simplifies data manipulation and analysis, and can be used to easily select, filter, or perform operations on specific sets of columns, it also prevents typing errors and avoids repetition!_
_For example, you can access all columns in you DataFrame with numeric features using the data[numeric_features] notation_ _For example, you can access all columns in you DataFrame with numeric features using the data[numeric_features] notation_
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# First five rows # First five rows
print(data.head(5)) print(data.head(5))
# Feature names # Feature names
numeric_features = ['age', 'body_mass', 'height', 'blood_pressure_high', 'blood_pressure_low'] numeric_features = ['age', 'body_mass', 'height', 'blood_pressure_high', 'blood_pressure_low']
binary_features = ['gender', 'smoke', 'active', 'family_history', 'cardio'] binary_features = ['gender', 'smoke', 'active', 'family_history', 'cardio']
categorical_features = ['serum_lipid_level'] categorical_features = ['serum_lipid_level']
``` ```
%% Output %% Output
age gender height body_mass blood_pressure_high blood_pressure_low \ age gender height body_mass blood_pressure_high blood_pressure_low \
0 19797 False 161 55 102 68 0 19797 False 161 55 102 68
1 22571 True 178 68 120 70 1 22571 True 178 68 120 70
2 16621 True 169 69 120 80 2 16621 True 169 69 120 80
3 16688 False 156 77 120 80 3 16688 False 156 77 120 80
4 19498 True 170 98 130 80 4 19498 True 170 98 130 80
smoke active cardio serum_lipid_level family_history smoke active cardio serum_lipid_level family_history
0 False True False elevated False 0 False True False elevated False
1 False False False normal False 1 False False False normal False
2 False True False normal False 2 False True False normal False
3 False True False normal False 3 False True False normal False
4 True True True elevated False 4 True True True elevated False
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
_________ _________
## <font color = dimgrey> 2. Checking data quality ## <font color = dimgrey> 2. Checking data quality
Often in data analysis projects the data has not been gathered exclusively for the data analysis only but originally for other reasons. Because of this, the features are most often not nicely formatted and may have mistakes. It might be tempting to just use the data as is with a model, but it is very important to first check the data for possible mistakes as they can make all the conclusions you make based on your analysis misleading. One good routine for checking data quality is to first calculate statistical descriptives and then to plot the features to check if the values are realistic. Often in data analysis projects the data has not been gathered exclusively for the data analysis only but originally for other reasons. Because of this, the features are most often not nicely formatted and may have mistakes. It might be tempting to just use the data as is with a model, but it is very important to first check the data for possible mistakes as they can make all the conclusions you make based on your analysis misleading. One good routine for checking data quality is to first calculate statistical descriptives and then to plot the features to check if the values are realistic.
----------- -----------
Some descriptive statistics don't really make sense for certain kinds of features. In pandas, like in many other packages, some functions work differently depending on the data type of a column. In the following exercise we will look at the data descriptive statistics as well as how the behavior can change when the data types are different. Some descriptive statistics don't really make sense for certain kinds of features. In pandas, like in many other packages, some functions work differently depending on the data type of a column. In the following exercise we will look at the data descriptive statistics as well as how the behavior can change when the data types are different.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
---------- ----------
***2 a)*** Print out the data types of your dataset below. ***2 a)*** Print out the data types of your dataset below.
_Perhaps the most common data types in pandas (see https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes) are **float**, **int**, **bool** and **category**._ _Perhaps the most common data types in pandas (see https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes) are **float**, **int**, **bool** and **category**._
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
data.dtypes data.dtypes
``` ```
%% Output %% Output
age int64 age int64
gender boolean gender boolean
height int64 height int64
body_mass int64 body_mass int64
blood_pressure_high int64 blood_pressure_high int64
blood_pressure_low int64 blood_pressure_low int64
smoke boolean smoke boolean
active boolean active boolean
cardio boolean cardio boolean
serum_lipid_level category serum_lipid_level category
family_history boolean family_history boolean
dtype: object dtype: object
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
-------- --------
***2 b)*** Use the **DataFrame.describe() method** in the cell below on your data. ***2 b)*** Use the **DataFrame.describe() method** in the cell below on your data.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
data.describe() data.describe()
``` ```
%% Output %% Output
age height body_mass blood_pressure_high \ age height body_mass blood_pressure_high \
count 210.000000 210.000000 210.000000 210.000000 count 210.000000 210.000000 210.000000 210.000000
mean 19455.504762 164.180952 73.895238 127.857143 mean 19455.504762 164.180952 73.895238 127.857143
std 2429.010199 7.534648 14.612326 17.508947 std 2429.010199 7.534648 14.612326 17.508947
min 14367.000000 142.000000 45.000000 90.000000 min 14367.000000 142.000000 45.000000 90.000000
25% 17635.750000 158.000000 64.000000 120.000000 25% 17635.750000 158.000000 64.000000 120.000000
50% 19778.000000 164.000000 70.000000 120.000000 50% 19778.000000 164.000000 70.000000 120.000000
75% 21230.500000 170.000000 81.000000 140.000000 75% 21230.500000 170.000000 81.000000 140.000000
max 23565.000000 195.000000 125.000000 190.000000 max 23565.000000 195.000000 125.000000 190.000000
blood_pressure_low blood_pressure_low
count 210.000000 count 210.000000
mean 81.814286 mean 81.814286
std 9.947652 std 9.947652
min 50.000000 min 50.000000
25% 80.000000 25% 80.000000
50% 80.000000 50% 80.000000
75% 90.000000 75% 90.000000
max 120.000000 max 120.000000
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
-------- --------
***2 c)*** Did you get all of the features statistics or not? What do you think happened? ***2 c)*** Did you get all of the features statistics or not? What do you think happened?
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
I got only the ones that are numerical. I got only the ones that are numerical.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
---------- ----------
***2 d)*** Calculate descriptives for the binary (boolean) features and the categorical feature <br> ***2 d)*** Calculate descriptives for the binary (boolean) features and the categorical feature <br>
_tip: in python, same type data structures can in many cases be concatenated using the + operator. If youre using the lists of names you created to subset, you can concatenate the two lists of feature names and use the resulting list to help you subset the dataframe_ _tip: in python, same type data structures can in many cases be concatenated using the + operator. If youre using the lists of names you created to subset, you can concatenate the two lists of feature names and use the resulting list to help you subset the dataframe_
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
data[binary_features].describe() data[binary_features].describe()
``` ```
%% Output %% Output
gender smoke active family_history cardio gender smoke active family_history cardio
count 210 210 210 210 210 count 210 210 210 210 210
unique 2 2 2 2 2 unique 2 2 2 2 2
top False False True False False top False False True False False
freq 129 186 162 128 105 freq 129 186 162 128 105
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
---------- ----------
Now, we will see ***what would have happened if the data was read in using default settings*** and not giving information about the types of the features (dtypes), giving no arguments to pd.read_csv. Now, we will see ***what would have happened if the data was read in using default settings*** and not giving information about the types of the features (dtypes), giving no arguments to pd.read_csv.
Run the below cell (no need to modify the code) and look at the output of the cell with the wrongly read data. Compare it with the output of the cell where you used the correctly read data to get the descriptives. Run the below cell (no need to modify the code) and look at the output of the cell with the wrongly read data. Compare it with the output of the cell where you used the correctly read data to get the descriptives.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# read in the dataset with no arguments # read in the dataset with no arguments
wrongly_read_data = pd.read_csv(data_path) wrongly_read_data = pd.read_csv(data_path)
# calculate descriptives for the data that was wrongly read in. # calculate descriptives for the data that was wrongly read in.
wrongly_read_data.describe() wrongly_read_data.describe()
``` ```
%% Output %% Output
age gender height body_mass blood_pressure_high \ age gender height body_mass blood_pressure_high \
count 210.000000 210.000000 210.000000 210.000000 210.000000 count 210.000000 210.000000 210.000000 210.000000 210.000000
mean 19455.504762 0.385714 164.180952 73.895238 127.857143 mean 19455.504762 0.385714 164.180952 73.895238 127.857143
std 2429.010199 0.487927 7.534648 14.612326 17.508947 std 2429.010199 0.487927 7.534648 14.612326 17.508947
min 14367.000000 0.000000 142.000000 45.000000 90.000000 min 14367.000000 0.000000 142.000000 45.000000 90.000000
25% 17635.750000 0.000000 158.000000 64.000000 120.000000 25% 17635.750000 0.000000 158.000000 64.000000 120.000000
50% 19778.000000 0.000000 164.000000 70.000000 120.000000 50% 19778.000000 0.000000 164.000000 70.000000 120.000000
75% 21230.500000 1.000000 170.000000 81.000000 140.000000 75% 21230.500000 1.000000 170.000000 81.000000 140.000000
max 23565.000000 1.000000 195.000000 125.000000 190.000000 max 23565.000000 1.000000 195.000000 125.000000 190.000000
blood_pressure_low smoke active cardio family_history blood_pressure_low smoke active cardio family_history
count 210.000000 210.000000 210.000000 210.000000 210.000000 count 210.000000 210.000000 210.000000 210.000000 210.000000
mean 81.814286 0.114286 0.771429 0.500000 0.390476 mean 81.814286 0.114286 0.771429 0.500000 0.390476
std 9.947652 0.318918 0.420916 0.501195 0.489023 std 9.947652 0.318918 0.420916 0.501195 0.489023
min 50.000000 0.000000 0.000000 0.000000 0.000000 min 50.000000 0.000000 0.000000 0.000000 0.000000
25% 80.000000 0.000000 1.000000 0.000000 0.000000 25% 80.000000 0.000000 1.000000 0.000000 0.000000
50% 80.000000 0.000000 1.000000 0.500000 0.000000 50% 80.000000 0.000000 1.000000 0.500000 0.000000
75% 90.000000 0.000000 1.000000 1.000000 1.000000 75% 90.000000 0.000000 1.000000 1.000000 1.000000
max 120.000000 1.000000 1.000000 1.000000 1.000000 max 120.000000 1.000000 1.000000 1.000000 1.000000
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
***2 e)*** Looking at the above output, can you now say whats wrong with this presentation and why it was important to define the data types? ***2 e)*** Looking at the above output, can you now say whats wrong with this presentation and why it was important to define the data types?
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
If the datatypes are not defined, every variable is treated as numerical. If the datatypes are not defined, every variable is treated as numerical.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
----------------------- -----------------------
## 3. Plotting numeric features ## 3. Plotting numeric features
Descriptives don't really give a full or intuitive picture of the distribution of features. Next, we will make use of different plots to check the data quality. Descriptives don't really give a full or intuitive picture of the distribution of features. Next, we will make use of different plots to check the data quality.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
---------- ----------
***3 a)*** Plot the numeric features as histograms (see tutorial if you need help). ***3 a)*** Plot the numeric features as histograms (see tutorial if you need help).
_tip: if you give only one grid-size argument for plt.subplots() like plt.subplots(3) the grid will be one-dimensional and you can index it with only one indexer._ _tip: if you give only one grid-size argument for plt.subplots() like plt.subplots(3) the grid will be one-dimensional and you can index it with only one indexer._
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Creating the figure and subplots # Creating the figure and subplots
fig, axes = plt.subplots(5, 1, figsize=(7,20)) fig, axes = plt.subplots(5, 1, figsize=(7,20))
# Adding plots to the figure # Adding plots to the figure
sns.histplot(data['age'], ax = axes[0], bins=25) columns = ['age', 'height', 'body_mass', 'blood_pressure_high', 'blood_pressure_low']
sns.histplot(data['height'], ax = axes[1], bins=25) i = 0
sns.histplot(data['body_mass'], ax = axes[2], bins=25) for column in columns:
sns.histplot(data['blood_pressure_high'], ax = axes[3], bins=25) sns.histplot(data[column], ax = axes[i], bins=25)
sns.histplot(data['blood_pressure_low'], ax = axes[4], bins=25) i = i + 1
``` ```
%% Output %% Output
<Axes: xlabel='blood_pressure_low', ylabel='Count'>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
_______ _______
## 4. Plotting binary and categorical features ## 4. Plotting binary and categorical features
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
***4 a)*** Plot **barplots** for each of the non-numeric features. **Use fractions, not the real frequencies of the levels of these features**. ***4 a)*** Plot **barplots** for each of the non-numeric features. **Use fractions, not the real frequencies of the levels of these features**.
-------------- --------------
_tip: For plotting, see documentation on axes.bar. To get the fractions, see the value_counts function and its optional argument normalize._ _tip: For plotting, see documentation on axes.bar. To get the fractions, see the value_counts function and its optional argument normalize._
_If you read in the dtypes to be pandas dtype.boolean, in some cases its easier to work with other packages, suchs as matplotlib when they are represented as numbers [0,1] and not True or False. If you get errors you can try to cast them momentarily to be int or float with astype. This does not mean that you've done the exercise incorrectly, just that you have to change them for the plotting package._ _If you read in the dtypes to be pandas dtype.boolean, in some cases its easier to work with other packages, suchs as matplotlib when they are represented as numbers [0,1] and not True or False. If you get errors you can try to cast them momentarily to be int or float with astype. This does not mean that you've done the exercise incorrectly, just that you have to change them for the plotting package._
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Creating the figure and subplots # Creating the figure and subplots
fig, axes = plt.subplots(2, 3, figsize=(11,11)) fig, axes = plt.subplots(2, 3, figsize=(11,11))
# Adding plots to the figure # Adding plots to the figure
data.gender.value_counts(normalize=True).plot.bar(ax=axes[0, 0], xlabel='Gender') data.gender.value_counts(normalize=True).plot.bar(ax=axes[0, 0], xlabel='Gender')
data.smoke.value_counts(normalize=True).plot.bar(ax=axes[0, 1], xlabel='Smoke') data.smoke.value_counts(normalize=True).plot.bar(ax=axes[0, 1], xlabel='Smoke')
data.active.value_counts(normalize=True).plot.bar(ax=axes[0 , 2], xlabel='Active') data.active.value_counts(normalize=True).plot.bar(ax=axes[0 , 2], xlabel='Active')
data.family_history.value_counts(normalize=True).plot.bar(ax=axes[1,0], xlabel='Family history') data.family_history.value_counts(normalize=True).plot.bar(ax=axes[1,0], xlabel='Family history')
data.cardio.value_counts(normalize=True).plot.bar(ax=axes[1,1], xlabel='Cardio') data.cardio.value_counts(normalize=True).plot.bar(ax=axes[1,1], xlabel='Cardio')
data.serum_lipid_level.value_counts(normalize=True).plot.bar(ax=axes[1,2], xlabel='Serum lipid level') data.serum_lipid_level.value_counts(normalize=True).plot.bar(ax=axes[1,2], xlabel='Serum lipid level')
``` ```
%% Output %% Output
<Axes: xlabel='Serum lipid level'> <Axes: xlabel='Serum lipid level'>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**4 b)** Do you see something odd with one of the features? Fix it. **4 b)** Do you see something odd with one of the features? Fix it.
If you read the dtype of the categorical feature to be pandas dtype.categorical, **you have to also use the pandas function remove_categories to remove the category level from the feature**, even if you would have already removed the value. You can do this like: _data['feature_name'] = data['feature_name'].cat.remove_categories("category name to delete")_ If you read the dtype of the categorical feature to be pandas dtype.categorical, **you have to also use the pandas function remove_categories to remove the category level from the feature**, even if you would have already removed the value. You can do this like: _data['feature_name'] = data['feature_name'].cat.remove_categories("category name to delete")_
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<font color="green">Your answer for 4 b)</font> <font color="green">Your answer for 4 b)</font>
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Removing the typo category # Removing the typo category
data['serum_lipid_level'] = data['serum_lipid_level'].cat.remove_categories("elev ated") data['serum_lipid_level'] = data['serum_lipid_level'].cat.remove_categories("elev ated")
# Fixing the value # Fixing the value
data['serum_lipid_level'] = data['serum_lipid_level'].fillna('elevated') data['serum_lipid_level'] = data['serum_lipid_level'].fillna('elevated')
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
------------- -------------
## 5. Feature generation and exploration ## 5. Feature generation and exploration
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Feature Engineering is a crucial step in the process of preparing data for most data analysis projects. It involves creating new features or modifying existing ones to improve the performance of predictive models. Feature engineering is a combination of domain knowledge, creativity, and data analysis, and it can have a significant impact on the success of a data analysis project. Feature Engineering is a crucial step in the process of preparing data for most data analysis projects. It involves creating new features or modifying existing ones to improve the performance of predictive models. Feature engineering is a combination of domain knowledge, creativity, and data analysis, and it can have a significant impact on the success of a data analysis project.
-------------- --------------
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**BMI**, or **Body Mass Index**, is a simple numerical measure that is commonly used to assess an individual's body weight in relation to their height. In our use case, BMI can be a useful indicator in the prediction of cardiovascular problems, as it could provide a well-established link between obesity and an increased risk of developing the disease. **BMI**, or **Body Mass Index**, is a simple numerical measure that is commonly used to assess an individual's body weight in relation to their height. In our use case, BMI can be a useful indicator in the prediction of cardiovascular problems, as it could provide a well-established link between obesity and an increased risk of developing the disease.
\begin{align*} \begin{align*}
\text{BMI} & = \frac{\text{Body mass (kg)}}{(\text{height (m)})^2} \\ \text{BMI} & = \frac{\text{Body mass (kg)}}{(\text{height (m)})^2} \\
\end{align*} \end{align*}
--------------------------------------- ---------------------------------------
***5 a)*** Generate a new feature based off of the provided formula, using 'height' and 'body_mass' and name it **BMI** ***5 a)*** Generate a new feature based off of the provided formula, using 'height' and 'body_mass' and name it **BMI**
_tip: In the case of our dataset the height is in centimeters, so make sure to convert it into meters_ _tip: In the case of our dataset the height is in centimeters, so make sure to convert it into meters_
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Calculating the BMIs # Calculating the BMIs
BMI = data['body_mass']/((data['height']/100)**2) BMI = data['body_mass']/((data['height']/100)**2)
# Adding the BMIs to the dataframe # Adding the BMIs to the dataframe
data.insert(11, "BMI", BMI, True) data.insert(11, "BMI", BMI, True)
# Checking the result # Checking the result
BMI.head(5) BMI.head(5)
``` ```
%% Output %% Output
0 21.218317 0 21.218317
1 21.461937 1 21.461937
2 24.158818 2 24.158818
3 31.640368 3 31.640368
4 33.910035 4 33.910035
dtype: float64 dtype: float64
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
***5 b)*** Using the previously calculated feature **BMI** generate a new feature named **BMI_category** that categorizes the values into groups, according to the standard BMI categories : ***5 b)*** Using the previously calculated feature **BMI** generate a new feature named **BMI_category** that categorizes the values into groups, according to the standard BMI categories :
- Underweight: BMI less than 18.5 - Underweight: BMI less than 18.5
- Normal Weight: BMI between 18.5 and 24.9 - Normal Weight: BMI between 18.5 and 24.9
- Overweight: BMI between 25 and 29.9 - Overweight: BMI between 25 and 29.9
- Obese: BMI of 30 or greater - Obese: BMI of 30 or greater
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Creating the BMI categories # Creating the BMI categories
BMI_category = pd.cut(BMI,bins=[0,18.5,24.9,29.9,100],labels=['Underweight','Normal Weight','Overweight','Obese']) BMI_category = pd.cut(BMI,bins=[0,18.5,24.9,29.9,100],labels=['Underweight','Normal Weight','Overweight','Obese'])
# Adding the categorised BMIs to the data # Adding the categorised BMIs to the data
data.insert(12, "BMI_category", BMI_category, True) data.insert(12, "BMI_category", BMI_category, True)
# Plotting the categorised BMIs # Plotting the categorised BMIs
sns.histplot(BMI_category) sns.histplot(BMI_category)
``` ```
%% Output %% Output
<Axes: ylabel='Count'> <Axes: ylabel='Count'>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Now that we have our BMI values, it's a good practice to see if we can spot a hidden trend in our data. Now that we have our BMI values, it's a good practice to see if we can spot a hidden trend in our data.
***5 c)*** Create a countplot to visualize the distribution of cardio (target variable) within different BMI categories. ***5 c)*** Create a countplot to visualize the distribution of cardio (target variable) within different BMI categories.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Creating a new dataframe of BMI trends # Creating a new dataframe of BMI trends
bmi_trends = pd.DataFrame(data['cardio']) bmi_trends = pd.DataFrame(data['cardio'])
bmi_trends.insert(1, "bmi_cat", BMI_category, True) bmi_trends.insert(1, "bmi_cat", BMI_category, True)
# Plotting the BMI trends # Plotting the BMI trends
sns.histplot(data=bmi_trends, x='bmi_cat', hue='cardio', multiple='dodge', shrink=.8) sns.histplot(data=bmi_trends, x='bmi_cat', hue='cardio', multiple='dodge', shrink=.8)
``` ```
%% Output %% Output
<Axes: xlabel='bmi_cat', ylabel='Count'> <Axes: xlabel='bmi_cat', ylabel='Count'>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
***5 d)*** Can you notice any relationship or visible trend? ***5 d)*** Can you notice any relationship or visible trend?
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Obese/overweight people are more likely to have been diagnosed with a cardiac disease. Obese/overweight people are more likely to have been diagnosed with a cardiac disease.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Below, there is ready-made code for you to appropriatly add the newly created features to the right column type list. You don't need to change anything about the code, just make sure that the names of the added features are as specified earlier (**BMI** and **BMI_category**) Below, there is ready-made code for you to appropriatly add the newly created features to the right column type list. You don't need to change anything about the code, just make sure that the names of the added features are as specified earlier (**BMI** and **BMI_category**)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# ---- Add features to column type list (no need to change) -------- # ---- Add features to column type list (no need to change) --------
numeric_features.append("BMI") numeric_features.append("BMI")
data['BMI_category'] = data['BMI_category'].astype('category') data['BMI_category'] = data['BMI_category'].astype('category')
categorical_features.append("BMI_category") categorical_features.append("BMI_category")
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
------------- -------------
## 6. Preprocessing numeric features ## 6. Preprocessing numeric features
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Scaling the data improves the performance of machine learning algorithms in many cases, or perhaps better put, can ruin performance if not done. For instance with distance based algorithms covered in the course such as PCA, T-SNE and KNN some features with large values can dominate the distance calculations. Scaling the data improves the performance of machine learning algorithms in many cases, or perhaps better put, can ruin performance if not done. For instance with distance based algorithms covered in the course such as PCA, T-SNE and KNN some features with large values can dominate the distance calculations.
----------- -----------
We will look at two often used ways of bringing the values to the same scale: **min-max scaling to [0,1]** and **standardizing the features to 0 mean and unit variance**. We will see, that the decision has implications on how the data will look afterwards. Standardizing values is very common in statistics and min-max scaling is for example used in training neural networks, where we want the range to match the range of an activation function in the network. Its good to know both. We will look at two often used ways of bringing the values to the same scale: **min-max scaling to [0,1]** and **standardizing the features to 0 mean and unit variance**. We will see, that the decision has implications on how the data will look afterwards. Standardizing values is very common in statistics and min-max scaling is for example used in training neural networks, where we want the range to match the range of an activation function in the network. Its good to know both.
Two functions, sklearn.minmax_scale and sklearn.scale have been imported for you and you can use them in the following exercises. Two functions, sklearn.minmax_scale and sklearn.scale have been imported for you and you can use them in the following exercises.
__________________________ __________________________
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**6 a)** Min-max numeric attributes to [0,1] and **store the results in a new dataframe called data_min_maxed**. You might have to wrap the data to a dataframe again using pd.DataFrame() **6 a)** Min-max numeric attributes to [0,1] and **store the results in a new dataframe called data_min_maxed**. You might have to wrap the data to a dataframe again using pd.DataFrame()
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler() scaler = MinMaxScaler()
# Min maxing the numeric columns # Min maxing the numeric columns
minmax = scaler.fit_transform(data[numeric_features]) minmax = scaler.fit_transform(data[numeric_features])
# Creating a new dataframe of the result # Creating a new dataframe of the result
data_min_maxed = pd.DataFrame(minmax, columns=['age', 'body_mass', 'height', 'blood_pressure_high', 'blood_pressure_low', 'BMI']) data_min_maxed = pd.DataFrame(minmax, columns=['age', 'body_mass', 'height', 'blood_pressure_high', 'blood_pressure_low', 'BMI'])
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler() scaler = MinMaxScaler()
# Min maxing the numeric columns # Min maxing the numeric columns
minmax = scaler.fit_transform(data[numeric_features]) minmax = scaler.fit_transform(data[numeric_features])
# Creating a new dataframe of the result # Creating a new dataframe of the result
data_min_maxed = pd.DataFrame(minmax, columns=['age', 'body_mass', 'height', 'blood_pressure_high', 'blood_pressure_low', 'BMI']) data_min_maxed = pd.DataFrame(minmax, columns=['age', 'body_mass', 'height', 'blood_pressure_high', 'blood_pressure_low', 'BMI'])
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**6 b)** Standardize numeric attributes to 0 mean and unit variance and **store the results in a new dataframe called data_standardized** **6 b)** Standardize numeric attributes to 0 mean and unit variance and **store the results in a new dataframe called data_standardized**
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Standardizing the numeric columns # Standardizing the numeric columns
stand = scale(data[numeric_features]) stand = scale(data[numeric_features])
# Creating a new dataframe of the result # Creating a new dataframe of the result
data_standardized = pd.DataFrame(stand, columns=['age', 'body_mass', 'height', 'blood_pressure_high', 'blood_pressure_low', 'BMI']) data_standardized = pd.DataFrame(stand, columns=['age', 'body_mass', 'height', 'blood_pressure_high', 'blood_pressure_low', 'BMI'])
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**6 c)** Make two boxplots of the 'age' feature, one plot with the data_min_maxed and one with the data_standardized. Preferably put the plots side-by-side and give each titles. See the tutorial in the beginning for help. **6 c)** Make two boxplots of the 'age' feature, one plot with the data_min_maxed and one with the data_standardized. Preferably put the plots side-by-side and give each titles. See the tutorial in the beginning for help.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Creating the fig and subplots # Creating the fig and subplots
fig, axes = plt.subplots(1, 2, figsize=(6,4)) fig, axes = plt.subplots(1, 2, figsize=(6,4))
# Filling the figure # Filling the figure
data_min_maxed.age.plot.box(ax=axes[0], label='Age min maxed') data_min_maxed.age.plot.box(ax=axes[0], label='Age min maxed')
data_standardized.age.plot.box(ax=axes[1], label='Age standardized') data_standardized.age.plot.box(ax=axes[1], label='Age standardized')
``` ```
%% Output %% Output
<Axes: > <Axes: >
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**6 d)** Describe what you would expect to see in these two boxplots. How would the characteristics of the boxplots differ for min-max scaled data and standardized data? **6 d)** Describe what you would expect to see in these two boxplots. How would the characteristics of the boxplots differ for min-max scaled data and standardized data?
_tip: Consider factors like the location of the mean, and the range of values presented._ _tip: Consider factors like the location of the mean, and the range of values presented._
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
In the standardised data the mean is 0 and the range is wider than in min maxed. In the standardised data the mean is 0 and the range is wider than in min maxed.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
--------- ---------
Lets see the differences of these preprocessing methods through an example. We will add an "outlier" point (some point with a large value) to replace the Lets see the differences of these preprocessing methods through an example. We will add an "outlier" point (some point with a large value) to replace the
last value in both data, then again minmax and standardize and plot. The code to add the value is given for you and you shouldn't change it. last value in both data, then again minmax and standardize and plot. The code to add the value is given for you and you shouldn't change it.
-------------------- --------------------
***6e) Do the following:*** ***6e) Do the following:***
1. Take the data for the age feature (age_w_outlier) provided for you 1. Take the data for the age feature (age_w_outlier) provided for you
2. Make two variables, age_w_outlier_minmaxed, containing the min-maxed values of the age_w_outlier and 2. Make two variables, age_w_outlier_minmaxed, containing the min-maxed values of the age_w_outlier and
3. age_w_outlier_standardized containing the standardized values for the age_w_outlier 3. age_w_outlier_standardized containing the standardized values for the age_w_outlier
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
### Add an outlier, DONT CHANGE THIS CELL CODE, JUST RUN IT ### ### Add an outlier, DONT CHANGE THIS CELL CODE, JUST RUN IT ###
data_w_outlier = data.copy() #data should be the name of the variable where you have stored your data! data_w_outlier = data.copy() #data should be the name of the variable where you have stored your data!
data_w_outlier.loc[data.shape[0] -1 , 'age'] = 150 #change the last value of age to be 150 data_w_outlier.loc[data.shape[0] -1 , 'age'] = 150 #change the last value of age to be 150
age_w_outlier = data_w_outlier.age age_w_outlier = data_w_outlier.age
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Min maxing and standardizing the age with added outlier # Min maxing and standardizing the age with added outlier
age_w_outlier_minmaxed = minmax_scale(age_w_outlier) age_w_outlier_minmaxed = minmax_scale(age_w_outlier)
age_w_outlier_standardized = scale(age_w_outlier) age_w_outlier_standardized = scale(age_w_outlier)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
***Below there is pre-written code for you to plot the different cases. Run it. The code should run if you have named your features appropriately. Run the code.*** ***Below there is pre-written code for you to plot the different cases. Run it. The code should run if you have named your features appropriately. Run the code.***
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Wrap in a dataframe that will have two features - the age feature without the outlier, and the age feature with it, min-maxed. # Wrap in a dataframe that will have two features - the age feature without the outlier, and the age feature with it, min-maxed.
minmaxed_datas = pd.DataFrame({"minmaxed_age_no_outlier" : data_min_maxed.age, minmaxed_datas = pd.DataFrame({"minmaxed_age_no_outlier" : data_min_maxed.age,
"minmaxed_age_with_outlier": age_w_outlier_minmaxed }) "minmaxed_age_with_outlier": age_w_outlier_minmaxed })
# Wrap in a dataframe that will have two features - the age feature without the outlier, and the age feature with it, standardized. # Wrap in a dataframe that will have two features - the age feature without the outlier, and the age feature with it, standardized.
standardized_datas = pd.DataFrame({"standardized_data_no_outlier" : data_standardized.age, standardized_datas = pd.DataFrame({"standardized_data_no_outlier" : data_standardized.age,
"standardized_data_w_outlier": age_w_outlier_standardized }) "standardized_data_w_outlier": age_w_outlier_standardized })
axes_minmaxed = minmaxed_datas[['minmaxed_age_no_outlier', 'minmaxed_age_with_outlier']].plot(kind='box', title='Minmax with and without outlier') axes_minmaxed = minmaxed_datas[['minmaxed_age_no_outlier', 'minmaxed_age_with_outlier']].plot(kind='box', title='Minmax with and without outlier')
axes_std = standardized_datas[['standardized_data_no_outlier', 'standardized_data_w_outlier']].plot(kind='box', title='Standardized with and without outlier') axes_std = standardized_datas[['standardized_data_no_outlier', 'standardized_data_w_outlier']].plot(kind='box', title='Standardized with and without outlier')
``` ```
%% Output %% Output
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
---------- ----------
**6 f) Look at the output of the above cell and answer the following**: **6 f) Look at the output of the above cell and answer the following**:
1. Can you notice a difference between the two cases (min-maxed and standardized)? 1. Can you notice a difference between the two cases (min-maxed and standardized)?
2. Can you say something about the difference of the effect of min-maxing and standardization? 2. Can you say something about the difference of the effect of min-maxing and standardization?
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Outlier has a big effect on the min-maxed data and a small one on the standardized. I'm assuming it has something to do with the change of max-value. Outlier has a big effect on the min-maxed data and a small one on the standardized. I'm assuming it has something to do with the change of max-value.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
--------------- ---------------
## 7. Preprocessing categorical features ## 7. Preprocessing categorical features
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We can roughly divide categorical variables/features to two types: ***nominal categorical*** and ***ordinal categorical*** variables/features. Some cases are clear in terms of which of the two a feature falls into. For example nationality is not an ordered feature, but which grade in school someone is has a natural ordering. **One-hot encoding** was presented in the lectures and will be used in the following exercises with different learning methods. We can roughly divide categorical variables/features to two types: ***nominal categorical*** and ***ordinal categorical*** variables/features. Some cases are clear in terms of which of the two a feature falls into. For example nationality is not an ordered feature, but which grade in school someone is has a natural ordering. **One-hot encoding** was presented in the lectures and will be used in the following exercises with different learning methods.
----- -----
***Nominal categorical features need to be encoded***, because not encoding them implies that they have an order. For example, consider a dataset where you would have rows by different countries, encoded randomly with numbers, for ex. Finland = 1, Norway = 2 and so on. For some analyses and methods this would imply that Norway is somehow "greater" in value than Finland. For some algorithms, the implication would also be, that some of the countries would be "closer" to each other. ***Nominal categorical features need to be encoded***, because not encoding them implies that they have an order. For example, consider a dataset where you would have rows by different countries, encoded randomly with numbers, for ex. Finland = 1, Norway = 2 and so on. For some analyses and methods this would imply that Norway is somehow "greater" in value than Finland. For some algorithms, the implication would also be, that some of the countries would be "closer" to each other.
------ ------
***Ordinal categorical features do not necessarily need to be encoded***, but there are cases where it can be wise. One case is that the categories are not even distance from each other, which is the case with the 'serum_lipid_level' feature with the levels 'normal', 'elevated' and 'at risk'. Its not clear that these are equal in distance from each other. When unsure, it may also be better to one-hot encode, and a lot of packages do it for you behind the scenes. Here we decide to one-hot encode. ***Ordinal categorical features do not necessarily need to be encoded***, but there are cases where it can be wise. One case is that the categories are not even distance from each other, which is the case with the 'serum_lipid_level' feature with the levels 'normal', 'elevated' and 'at risk'. Its not clear that these are equal in distance from each other. When unsure, it may also be better to one-hot encode, and a lot of packages do it for you behind the scenes. Here we decide to one-hot encode.
--------------------- ---------------------
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
***7 a)*** One-hot-encode the serum_lipid_level-feature and add the one-hot features to the data. Give the new features meaningful names. Print the first rows of the resulting dataframe. ***7 a)*** One-hot-encode the serum_lipid_level-feature and add the one-hot features to the data. Give the new features meaningful names. Print the first rows of the resulting dataframe.
_tip: pandas has a function for this, google!_ _tip: pandas has a function for this, google!_
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# One-hot encoding serum lipid level # One-hot encoding serum lipid level
encoded = pd.get_dummies(data['serum_lipid_level'], prefix='serum_lipid') encoded = pd.get_dummies(data['serum_lipid_level'], prefix='serum_lipid')
# Creating a new dataframe with the encoded features # Creating a new dataframe with the encoded features
data_with_encoding = pd.concat([data, encoded], axis = 1) data_with_encoding = pd.concat([data, encoded], axis = 1)
data_with_encoding.head(5) data_with_encoding.head(5)
``` ```
%% Output %% Output
age gender height body_mass blood_pressure_high blood_pressure_low \ age gender height body_mass blood_pressure_high blood_pressure_low \
0 19797 False 161 55 102 68 0 19797 False 161 55 102 68
1 22571 True 178 68 120 70 1 22571 True 178 68 120 70
2 16621 True 169 69 120 80 2 16621 True 169 69 120 80
3 16688 False 156 77 120 80 3 16688 False 156 77 120 80
4 19498 True 170 98 130 80 4 19498 True 170 98 130 80
smoke active cardio serum_lipid_level family_history BMI \ smoke active cardio serum_lipid_level family_history BMI \
0 False True False elevated False 21.218317 0 False True False elevated False 21.218317
1 False False False normal False 21.461937 1 False False False normal False 21.461937
2 False True False normal False 24.158818 2 False True False normal False 24.158818
3 False True False normal False 31.640368 3 False True False normal False 31.640368
4 True True True elevated False 33.910035 4 True True True elevated False 33.910035
BMI_category serum_lipid_at risk serum_lipid_elevated \ BMI_category serum_lipid_at risk serum_lipid_elevated \
0 Normal Weight 0 1 0 Normal Weight 0 1
1 Normal Weight 0 0 1 Normal Weight 0 0
2 Normal Weight 0 0 2 Normal Weight 0 0
3 Obese 0 0 3 Obese 0 0
4 Obese 0 1 4 Obese 0 1
serum_lipid_normal serum_lipid_normal
0 0 0 0
1 1 1 1
2 1 2 1
3 1 3 1
4 0 4 0
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
---------- ----------
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<div class="alert alert-block alert-warning"> <div class="alert alert-block alert-warning">
<h1><center> BONUS EXERCISES </center></h1> <h1><center> BONUS EXERCISES </center></h1>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
- Below are the bonus exercises. You can stop here, and get the "pass" grade. - Below are the bonus exercises. You can stop here, and get the "pass" grade.
- By doing the bonus exercises below, you can get a "pass with honors", which means you will get one point bonus for the exam. - By doing the bonus exercises below, you can get a "pass with honors", which means you will get one point bonus for the exam.
The following exercises are more challenging and not as straight-forward and may require some research of your own. However, perfect written answers are not required, but answers that show that you have tried to understand the problems and explain them with your own words. The following exercises are more challenging and not as straight-forward and may require some research of your own. However, perfect written answers are not required, but answers that show that you have tried to understand the problems and explain them with your own words.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
____________ ____________
## <font color = dollargreen > 8. BONUS: Dimensionality reduction and plotting with PCA </font> ## <font color = dollargreen > 8. BONUS: Dimensionality reduction and plotting with PCA </font>
In the lectures, PCA was introduced as a dimensionality reduction technique. Here we will use it to reduce the dimensionality of the numeric features of this dataset and use the resulting compressed view of the dataset to plot it. This means you have to, run PCA and then project the data you used to fit the PCA to the new space, where the principal components are the axes. In the lectures, PCA was introduced as a dimensionality reduction technique. Here we will use it to reduce the dimensionality of the numeric features of this dataset and use the resulting compressed view of the dataset to plot it. This means you have to, run PCA and then project the data you used to fit the PCA to the new space, where the principal components are the axes.
____________ ____________
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
------------- -------------
**8 a)** Do PCA with two components with and without z-score standardization **for the numeric features in the data**. **8 a)** Do PCA with two components with and without z-score standardization **for the numeric features in the data**.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# --- Your for 8 a) code here --- # # --- Your for 8 a) code here --- #
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
------------- -------------
**8 b) Plot the data, projected on to the PCA space as a scatterplot, the x-axis being one component and y the other. **Add the total explained variance to your plot as an annotation**. See the documentation of the pca method on how to get the explained variance. **8 b) Plot the data, projected on to the PCA space as a scatterplot, the x-axis being one component and y the other. **Add the total explained variance to your plot as an annotation**. See the documentation of the pca method on how to get the explained variance.
- _Tip: It may be easier to try the seaborn scatterplot for this one. For help see documentation on how to do annotation (see tutorial). The total explained variance is the sum of both the components explained variance_. - _Tip: It may be easier to try the seaborn scatterplot for this one. For help see documentation on how to do annotation (see tutorial). The total explained variance is the sum of both the components explained variance_.
- _Tip2_: Depending on how you approach annotating the plot, you might have to cast the feature name to be a string. One nice way to format values in python is the f - formatting string, which allows you to insert expressions inside strings (see example below): - _Tip2_: Depending on how you approach annotating the plot, you might have to cast the feature name to be a string. One nice way to format values in python is the f - formatting string, which allows you to insert expressions inside strings (see example below):
------ ------
name = Valtteri<br> name = Valtteri<br>
print(f"hello_{name}") print(f"hello_{name}")
--------- ---------
You can also set the number of wanted decimals for floats<br> You can also set the number of wanted decimals for floats<br>
For example f'{float_variable:.2f}' would result in 2 decimals making it to the string created For example f'{float_variable:.2f}' would result in 2 decimals making it to the string created
---------- ----------
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# --- Your code for 8 b) --- you can make more cells if you like --- # --- Your code for 8 b) --- you can make more cells if you like ---
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**8 c) Gather information for the next part of the exercise and print out the following things:** **8 c) Gather information for the next part of the exercise and print out the following things:**
- First, the standard deviation of the original data features (not standardized, and with the numeric features only). - First, the standard deviation of the original data features (not standardized, and with the numeric features only).
- Second, the standard deviation of the standardized numeric features - Second, the standard deviation of the standardized numeric features
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# --- Your code for 8 c) here --- # # --- Your code for 8 c) here --- #
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
---------- ----------
**8 d) Look at the output above and the explained variance information you added as annotations to the plots. Try to think about the following questions and give a short answer of what you think has happened:** **8 d) Look at the output above and the explained variance information you added as annotations to the plots. Try to think about the following questions and give a short answer of what you think has happened:**
1. Where do you think the difference between the amounts of explained variance might come from? 1. Where do you think the difference between the amounts of explained variance might come from?
2. Can you say something about why it is important to scale the features for PCA by looking at the evidence youve gathered? 2. Can you say something about why it is important to scale the features for PCA by looking at the evidence youve gathered?
__Answer in your own words, here it is not important to get the perfect answer but to try to think and figure out what has happened__ __Answer in your own words, here it is not important to get the perfect answer but to try to think and figure out what has happened__
------------ ------------
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<font color="green">Your answer for 8 d)</font> <font color="green">Your answer for 8 d)</font>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
------------------ ------------------
## <font color = dollargreen > 9. Bonus: t-SNE and high dimensional data </font> ## <font color = dollargreen > 9. Bonus: t-SNE and high dimensional data </font>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Another method that can be used to plot high-dimensional data introduced in the lectures was t-distributed Stochastic Neighbor Embedding (t-SNE). Another method that can be used to plot high-dimensional data introduced in the lectures was t-distributed Stochastic Neighbor Embedding (t-SNE).
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
***9 a)*** Run t-SNE for both standardized and non standardized data (as you did with PCA). ***9 a)*** Run t-SNE for both standardized and non standardized data (as you did with PCA).
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# --- Your code for 9 a) here --- # # --- Your code for 9 a) here --- #
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
***9 b)*** Plot t-sne, similarly to PCA making the color of the points correspond to the levels of the cardio feature, but having only numerical features as a basis of the T-SNE. ***9 b)*** Plot t-sne, similarly to PCA making the color of the points correspond to the levels of the cardio feature, but having only numerical features as a basis of the T-SNE.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# --- Code for 9 b) --- # # --- Code for 9 b) --- #
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
***9 c)*** ***9 c)***
- What do you think might have happened between the two runs of t-SNE on unstandardized and standardized data? Why is it important to standardize before using the algorithm? - What do you think might have happened between the two runs of t-SNE on unstandardized and standardized data? Why is it important to standardize before using the algorithm?
_Here the aim is to think about this and learn, not come up with a perfect explanation. Googling is encouraged. Think about whether t-sne is a distance based algorithm or not?_ _Here the aim is to think about this and learn, not come up with a perfect explanation. Googling is encouraged. Think about whether t-sne is a distance based algorithm or not?_
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<font color="green">Your answer for 9 c)</font> <font color="green">Your answer for 9 c)</font>
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
......
Source diff could not be displayed: it is too large. Options to address this: view the blob.