Skip to content
Snippets Groups Projects
Commit ee7f6890 authored by Emmi Ylikoski's avatar Emmi Ylikoski
Browse files

Upload New File

parent cc657953
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Exercise 1
%% Cell type:code id: tags:
``` python
import csv as C
import numpy as N
import pandas as P
```
%% Cell type:markdown id: tags:
## 1.1 Data structures
%% Cell type:code id: tags:
``` python
A = P.Series([5,8,7,6,8,4],
name = "A")
B = P.Series([1.3, 2.1, 1.8, 1.2, 1.4, 2.3],
name = "B")
C = P.Series(["y","y","n","y","n","n"],
name = "C")
print(A)
print(B)
print(C)
```
%% Output
0 5
1 8
2 7
3 6
4 8
5 4
Name: A, dtype: int64
0 1.3
1 2.1
2 1.8
3 1.2
4 1.4
5 2.3
Name: B, dtype: float64
0 y
1 y
2 n
3 y
4 n
5 n
Name: C, dtype: object
%% Cell type:code id: tags:
``` python
df1 = P.concat([A,B,C], axis=1)
df1
```
%% Output
A B C
0 5 1.3 y
1 8 2.1 y
2 7 1.8 n
3 6 1.2 y
4 8 1.4 n
5 4 2.3 n
%% Cell type:code id: tags:
``` python
df1.iloc[2,1]
```
%% Output
1.8
%% Cell type:code id: tags:
``` python
df1.iloc[3]
```
%% Output
A 6
B 1.2
C y
Name: 3, dtype: object
%% Cell type:code id: tags:
``` python
subset = df1.iloc[1:5,[1,2]]
subset
```
%% Output
B C
1 2.1 y
2 1.8 n
3 1.2 y
4 1.4 n
%% Cell type:code id: tags:
``` python
df1.transpose()
```
%% Output
0 1 2 3 4 5
A 5 8 7 6 8 4
B 1.3 2.1 1.8 1.2 1.4 2.3
C y y n y n n
%% Cell type:markdown id: tags:
## 1.2 Thyroid Disease
%% Cell type:code id: tags:
``` python
df2 = P.read_csv('allbp.data', header=None)
```
%% Cell type:code id: tags:
``` python
df2.head()
```
%% Output
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 \
0 41 F f f f f f f f f ... t 125 t 1.14 t 109 f ? SVHC
1 23 F f f f f f f f f ... t 102 f ? f ? f ? other
2 46 M f f f f f f f f ... t 109 t 0.91 t 120 f ? other
3 70 F t f f f f f f f ... t 175 f ? f ? f ? other
4 70 F f f f f f f f f ... t 61 t 0.87 t 70 f ? SVI
29
0 negative.|3733
1 negative.|1442
2 negative.|2965
3 negative.|806
4 negative.|2807
[5 rows x 30 columns]
%% Cell type:code id: tags:
``` python
df2.info()
```
%% Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2800 entries, 0 to 2799
Data columns (total 30 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 2800 non-null object
1 1 2800 non-null object
2 2 2800 non-null object
3 3 2800 non-null object
4 4 2800 non-null object
5 5 2800 non-null object
6 6 2800 non-null object
7 7 2800 non-null object
8 8 2800 non-null object
9 9 2800 non-null object
10 10 2800 non-null object
11 11 2800 non-null object
12 12 2800 non-null object
13 13 2800 non-null object
14 14 2800 non-null object
15 15 2800 non-null object
16 16 2800 non-null object
17 17 2800 non-null object
18 18 2800 non-null object
19 19 2800 non-null object
20 20 2800 non-null object
21 21 2800 non-null object
22 22 2800 non-null object
23 23 2800 non-null object
24 24 2800 non-null object
25 25 2800 non-null object
26 26 2800 non-null object
27 27 2800 non-null object
28 28 2800 non-null object
29 29 2800 non-null object
dtypes: object(30)
memory usage: 656.4+ KB
%% Cell type:code id: tags:
``` python
df2.describe()
```
%% Output
0 1 2 3 4 5 6 7 8 9 ... 20 \
count 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 ... 2800
unique 94 3 2 2 2 2 2 2 2 2 ... 2
top 59 F f f f f f f f f ... t
freq 75 1830 2470 2760 2766 2690 2759 2761 2752 2637 ... 2616
21 22 23 24 25 26 27 28 29
count 2800 2800 2800 2800 2800 2800 2800 2800 2800
unique 218 2 139 2 210 1 1 5 2800
top ? t ? t ? f ? other negative.|3733
freq 184 2503 297 2505 295 2800 2800 1632 1
[4 rows x 30 columns]
%% Cell type:code id: tags:
``` python
df2.columns =[
"age",
"sex",
"on thyroxine",
"query on thyroxine",
"on antithyroid medication",
"sick",
"pregnant",
"thyroid surgery",
"I131 treatment",
"query hypothyroid",
"query hyperthyroid",
"lithium",
"goitre",
"tumor",
"hypopituitary",
"psych",
"TSH measured",
"TSH",
"T3 measured",
"T3",
"TT4 measured",
"TT4",
"T4U measured",
"T4U",
"FTI measured",
"FTI",
"TBG measured",
"TBG",
"referral source",
"classes"
]
```
%% Cell type:code id: tags:
``` python
df2.head()
```
%% Output
age sex on thyroxine query on thyroxine on antithyroid medication sick \
0 41 F f f f f
1 23 F f f f f
2 46 M f f f f
3 70 F t f f f
4 70 F f f f f
pregnant thyroid surgery I131 treatment query hypothyroid ... TT4 measured \
0 f f f f ... t
1 f f f f ... t
2 f f f f ... t
3 f f f f ... t
4 f f f f ... t
TT4 T4U measured T4U FTI measured FTI TBG measured TBG referral source \
0 125 t 1.14 t 109 f ? SVHC
1 102 f ? f ? f ? other
2 109 t 0.91 t 120 f ? other
3 175 f ? f ? f ? other
4 61 t 0.87 t 70 f ? SVI
classes
0 negative.|3733
1 negative.|1442
2 negative.|2965
3 negative.|806
4 negative.|2807
[5 rows x 30 columns]
%% Cell type:code id: tags:
``` python
df2.shape
```
%% Output
(2800, 30)
%% Cell type:markdown id: tags:
- How many observations and how many variables are there in the data? <br>
2800 observations, 30 variables
%% Cell type:code id: tags:
``` python
df2.replace(['?', 'nan', 'missing'], N.nan, inplace=True)
```
%% Output
/var/folders/fp/cf7b8z110lj8yjpy9rj8f5fr0000gn/T/ipykernel_5773/4070518099.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
df2.replace(['?', 'nan', 'missing'], N.nan, inplace=True)
%% Cell type:code id: tags:
``` python
df2.isna().sum()
```
%% Output
age 1
sex 110
on thyroxine 0
query on thyroxine 0
on antithyroid medication 0
sick 0
pregnant 0
thyroid surgery 0
I131 treatment 0
query hypothyroid 0
query hyperthyroid 0
lithium 0
goitre 0
tumor 0
hypopituitary 0
psych 0
TSH measured 0
TSH 284
T3 measured 0
T3 585
TT4 measured 0
TT4 184
T4U measured 0
T4U 297
FTI measured 0
FTI 295
TBG measured 0
TBG 2800
referral source 0
classes 0
dtype: int64
%% Cell type:code id: tags:
``` python
df2.dtypes
```
%% Output
age object
sex object
on thyroxine object
query on thyroxine object
on antithyroid medication object
sick object
pregnant object
thyroid surgery object
I131 treatment object
query hypothyroid object
query hyperthyroid object
lithium object
goitre object
tumor object
hypopituitary object
psych object
TSH measured object
TSH object
T3 measured object
T3 object
TT4 measured object
TT4 object
T4U measured object
T4U object
FTI measured object
FTI object
TBG measured object
TBG float64
referral source object
classes object
dtype: object
%% Cell type:code id: tags:
``` python
columns_to_change = ["age","TSH", "T3", "TT4", "T4U", "FTI"]
```
%% Cell type:code id: tags:
``` python
for column in columns_to_change:
df2[column] = df2[column].astype(float)
```
%% Cell type:code id: tags:
``` python
df2
```
%% Output
age sex on thyroxine query on thyroxine on antithyroid medication sick \
0 41.0 F f f f f
1 23.0 F f f f f
2 46.0 M f f f f
3 70.0 F t f f f
4 70.0 F f f f f
... ... .. ... ... ... ...
2795 70.0 M f f f f
2796 73.0 M f t f f
2797 75.0 M f f f f
2798 60.0 F f f f f
2799 81.0 F f f f f
pregnant thyroid surgery I131 treatment query hypothyroid ... \
0 f f f f ...
1 f f f f ...
2 f f f f ...
3 f f f f ...
4 f f f f ...
... ... ... ... ... ...
2795 f f f f ...
2796 f f f f ...
2797 f f f f ...
2798 f f f f ...
2799 f f f f ...
TT4 measured TT4 T4U measured T4U FTI measured FTI TBG measured \
0 t 125.0 t 1.14 t 109.0 f
1 t 102.0 f NaN f NaN f
2 t 109.0 t 0.91 t 120.0 f
3 t 175.0 f NaN f NaN f
4 t 61.0 t 0.87 t 70.0 f
... ... ... ... ... ... ... ...
2795 t 155.0 t 1.05 t 148.0 f
2796 t 63.0 t 0.88 t 72.0 f
2797 t 147.0 t 0.80 t 183.0 f
2798 t 100.0 t 0.83 t 121.0 f
2799 t 114.0 t 0.99 t 115.0 f
TBG referral source classes
0 NaN SVHC negative.|3733
1 NaN other negative.|1442
2 NaN other negative.|2965
3 NaN other negative.|806
4 NaN SVI negative.|2807
... ... ... ...
2795 NaN SVI negative.|3689
2796 NaN other negative.|3652
2797 NaN other negative.|1287
2798 NaN other negative.|3496
2799 NaN SVI negative.|724
[2800 rows x 30 columns]
%% Cell type:code id: tags:
``` python
df2.dtypes
```
%% Output
age float64
sex object
on thyroxine object
query on thyroxine object
on antithyroid medication object
sick object
pregnant object
thyroid surgery object
I131 treatment object
query hypothyroid object
query hyperthyroid object
lithium object
goitre object
tumor object
hypopituitary object
psych object
TSH measured object
TSH float64
T3 measured object
T3 float64
TT4 measured object
TT4 float64
T4U measured object
T4U float64
FTI measured object
FTI float64
TBG measured object
TBG float64
referral source object
classes object
dtype: object
%% Cell type:markdown id: tags:
## 1.3 Thyroid disease (continued)
%% Cell type:code id: tags:
``` python
yes_no_columns = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 20, 22, 24, 26]
```
%% Cell type:code id: tags:
``` python
for column in yes_no_columns:
yes_count = ((df2.iloc[:, column] == "t").sum())/2800
print(f"Column: {df2.columns[column]} %Yes: {yes_count:.2%}\n")
```
%% Output
Column: on thyroxine %Yes: 11.79%
Column: query on thyroxine %Yes: 1.43%
Column: on antithyroid medication %Yes: 1.21%
Column: sick %Yes: 3.93%
Column: pregnant %Yes: 1.46%
Column: thyroid surgery %Yes: 1.39%
Column: I131 treatment %Yes: 1.71%
Column: query hypothyroid %Yes: 5.82%
Column: query hyperthyroid %Yes: 6.18%
Column: lithium %Yes: 0.50%
Column: goitre %Yes: 0.89%
Column: tumor %Yes: 2.54%
Column: hypopituitary %Yes: 0.04%
Column: psych %Yes: 4.82%
Column: TSH measured %Yes: 89.86%
Column: T3 measured %Yes: 79.11%
Column: TT4 measured %Yes: 93.43%
Column: T4U measured %Yes: 89.39%
Column: FTI measured %Yes: 89.46%
Column: TBG measured %Yes: 0.00%
%% Cell type:code id: tags:
``` python
calculate_columns = ["TSH", "T3", "TT4", "T4U", "FTI", "TBG"]
```
%% Cell type:code id: tags:
``` python
((df2["TSH"] ** 2).sum())/df2["TSH"].notna().sum()
```
%% Output
481.7251481915739
%% Cell type:code id: tags:
``` python
for column in calculate_columns:
calculation = ((df2[column] ** 2).sum())/(df2[column].notna())
print(calculation)
```
%% Output
0 1.212020e+06
1 1.212020e+06
2 1.212020e+06
3 1.212020e+06
4 1.212020e+06
...
2795 1.212020e+06
2796 inf
2797 inf
2798 1.212020e+06
2799 1.212020e+06
Name: TSH, Length: 2800, dtype: float64
0 10588.025
1 10588.025
2 inf
3 10588.025
4 10588.025
...
2795 inf
2796 10588.025
2797 inf
2798 inf
2799 10588.025
Name: T3, Length: 2800, dtype: float64
0 34397613.32
1 34397613.32
2 34397613.32
3 34397613.32
4 34397613.32
...
2795 34397613.32
2796 34397613.32
2797 34397613.32
2798 34397613.32
2799 34397613.32
Name: TT4, Length: 2800, dtype: float64
0 2587.103636
1 inf
2 2587.103636
3 inf
4 2587.103636
...
2795 2587.103636
2796 2587.103636
2797 2587.103636
2798 2587.103636
2799 2587.103636
Name: T4U, Length: 2800, dtype: float64
0 33454030.13
1 inf
2 33454030.13
3 inf
4 33454030.13
...
2795 33454030.13
2796 33454030.13
2797 33454030.13
2798 33454030.13
2799 33454030.13
Name: FTI, Length: 2800, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
2795 NaN
2796 NaN
2797 NaN
2798 NaN
2799 NaN
Name: TBG, Length: 2800, dtype: float64
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment