Commit fb0a70e4 authored by Robert Siipola's avatar Robert Siipola
Browse files

initial commit

parents
%% Cell type:code id: tags:
``` python
import numpy as np
import pylab as pl
import matplotlib.pyplot as plt
import scipy.spatial as sp
import scipy.stats as st
%pylab inline
```
%% Cell type:markdown id: tags:
In the first part we where supposed to do kNN with normaly distributed random data, and random classifiers. Start by creating the functions. The first function that's needed is the preprocess function, that splits the given data into a test point and a training set.
%% Cell type:code id: tags:
``` python
def preprocess(data, classifiers, index):
test = data[index]
train = np.delete(data, index, 0)
test_class = classifiers[index]
train_class = np.delete(classifiers, index, 0)
return test, train, test_class, train_class
```
%% Cell type:markdown id: tags:
Then the nearest neighbour function. The function determines the distances between the test point and its counterparts in the training set. Then it does a majority vote on the k (here k = 3) nearest neighbours, and returns a prediction for the classifier, based on the vote
%% Cell type:code id: tags:
``` python
# Function returns the predicted class
# Prediction happends by majority voting, based on the classes of the k nearest neighbours
# If there's a tie, a class is chosen at random
def knn(test, train, train_class, k):
distance = np.zeros(len(train))
# Calculate the distance betweem the test point and the training data
for i in range(0, len(train)):
distance[i] = sp.distance.euclidean(test, train[i])
# Sort in ascending order (smallest first)
idx = np.argsort(distance)
k_class = train_class[idx]
# Return k nearest classes
k_class = k_class[0:k]
# Checks how many unique classifiers there are, and their frequencies
unique, counts= np.unique(k_class, return_counts = True)
# Does voting based on the frequency. No need to take ties into account, because k = 3
cls_idx = np.where(counts == max(counts))
pred = unique[cls_idx[0]]
return pred
```
%% Cell type:markdown id: tags:
Next, a funcion to do the leave-one-out-CV (loocv) is needed. Effectively it just has the preprocess function and knn inside a for-loop, and goes through the given data one element at a time. It then returns predictions on the data
%% Cell type:code id: tags:
``` python
def loocv(data_arr, classifiers, k):
prediction = np.zeros(len(data_arr))
for i in range(0,len(data_arr)):
test, train, test_class, train_class = preprocess(data_arr, classifiers, i)
prediction[i] = knn(test, train, train_class, k)
return prediction
```
%% Cell type:markdown id: tags:
Lastly, a function is needed to calculate the C-index. This has been covered extensively in previous works/lectures, so only the code for the C-index is presented.
%% Cell type:code id: tags:
``` python
def cind(true_labels, predictions):
n = 0
h_sum = 0
for i in range(0, len(true_labels)):
t = true_labels[i]
p = predictions[i]
for j in range(i+1, len(true_labels)):
nt = true_labels[j]
np = predictions[j]
if(t != nt):
n = n+1
if((p < np and t < nt) or (p > np and t > nt)):
h_sum = h_sum + 1
elif((p < np and t > nt) or (p > np and t < nt)):
h_sum = h_sum + 0
elif(p==np):
h_sum = h_sum + 0.5
C_index = h_sum / n
return C_index
```
%% Cell type:markdown id: tags:
The functions are then combined in a for-loop, where we use four different sample sizes (10, 50, 100, 500) to do leave-one-out-CV a hundred times on random data.
%% Cell type:code id: tags:
``` python
C = np.zeros((4,100))
sample_size = np.array([10,50,100,500])
for i in range(0, 4):
for j in range(0,100):
x = np.random.randn(sample_size[i])
y0 = np.zeros(sample_size[i]/2)
y1 = np.ones(sample_size[i]/2)
y = np.append(y0, y1)
np.random.shuffle(y)
pred = loocv(x, y, 3)
C[i,j] = cind(y, pred)
```
%% Cell type:code id: tags:
``` python
size = np.array([10,50,100,500])
for i in range(0, 4):
k = size[i]
print "The mean with a sample size of %s " %k,"is" ,np.mean(C[i]), "and the standard deviation is", np.std(C[i])
```
%% Cell type:markdown id: tags:
As the sample sizes grow, the distribution of the data gets closer to the underlying distribution where the data points have been collected. This can already be deduced from the shrinking of the standard deviation as the sample size grows. For a sample size of 10, there where 19 instances where the C-index was as great or greater than 0.7. At 50 samples, there was only one instance and when the sample size grow over that, there were no instances of the C-value being greater or equal to 0.7. The Histograms are shown below.
%% Cell type:code id: tags:
``` python
num = np.where(C[0] >= 0.7)
print len(num[0])
b = np.histogram(C[0], 10, range = (0, 1))
x = np.array(b[1]).flatten()
x = x[0:len(x)-1]
plt.bar(x, b[0], width = 0.1, color = "orange")
plt.title("Sample size of 10")
```
%%%% Output: execute_result
<matplotlib.text.Text at 0x10b36e190>
%%%% Output: display_data
![]()
%% Cell type:code id: tags:
``` python
num = np.where(C[1] >= 0.7)
print len(num[0])
b = np.histogram(C[1], 25, range = (0, 1))
x = np.array(b[1]).flatten()
x = x[0:len(x)-1]
plt.bar(x, b[0], width = 0.040, color = "orange")
plt.title("Sample size of 50")
plt.xlim(0.2, 0.8)
```
%%%% Output: execute_result
(0.2, 0.8)
%%%% Output: display_data
![]()
%% Cell type:code id: tags:
``` python
num = np.where(C[2] >= 0.7)
print len(num[0])
b = np.histogram(C[2], 50, range = (0, 1))
x = np.array(b[1]).flatten()
x = x[0:len(x)-1]
plt.bar(x, b[0], width = 0.02, color = "orange")
plt.title("Sample size of 100")
plt.xlim(0.3, 0.7)
```
%%%% Output: execute_result
(0.3, 0.7)
%%%% Output: display_data
![]()
%% Cell type:code id: tags:
``` python
num = np.where(C[3] >= 0.7)
print len(num[0])
b = np.histogram(C[3], 100, range = (0, 1))
x = np.array(b[1]).flatten()
x = x[0:len(x)-1]
plt.bar(x, b[0], width = 0.01, color = "orange")
plt.title("Sample size of 500")
plt.xlim(0.4, 0.6)
```
%%%% Output: execute_result
(0.4, 0.6)
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
The second part of the excercise examines how feature selection can be implemented correctly and in-correctly. The first part was to create a function "feature select", which checks for correlation between data points and the given classifier. The function calculates the Kendall tau between the data columns and the classifiers, and returns the indexes of the colums with largest indexes.
%% Cell type:code id: tags:
``` python
def feature_select(data, classifiers, no_of_feat):
tau = np.zeros(len(data.T))
p_value = np.zeros(len(data.T))
for i in range(0, len(data.T)):
tau[i], p_value[i] = st.kendalltau(data.T[i], classifiers)
tau = np.absolute(tau)
idx = np.argsort(-tau)
return idx[0:no_of_feat]
```
%% Cell type:markdown id: tags:
The wrong way to do feature selection, is to first select the features from the data and then do preprocessing and leave-one-out cross validation. This is implemented below in the "wrong_feat_preprocess" and "wrong_feat_loocv" functions below.
%% Cell type:code id: tags:
``` python
# This function should do regular preprocessing, but it uses the feature_select function to determine
# which features to use. So all of the samples are used, but not all the columns.
def wrong_feat_preprocess(data, classifiers, index, number_of_features):
used_indexes = feature_select(data, classifiers, number_of_features)
new_set = data.T[used_indexes]
new_set = new_set.T
test, train, test_class, train_class = preprocess(new_set, classifiers, index)
return test, train, test_class, train_class
```
%% Cell type:code id: tags:
``` python
def wrong_feat_loocv(data, classifiers, number_of_features, k):
prediction = np.zeros(len(data))
for i in range(0,len(data)):
test, train, test_class, train_class = wrong_feat_preprocess(data, classifiers, i, number_of_features)
prediction[i] = knn(test, train, train_class, k)
return prediction
```
%% Cell type:markdown id: tags:
The right way to do feature selection, is to first do preprocessing on the data and then the feature selection. This implemented below on the "right_feat_loocv" function.
%% Cell type:code id: tags:
``` python
# Compute the leave-one-out cross-validation estimate with 3-neareset neighbors classifier so that you
# do the feature selection internally on each round of cross-validation.
# That is, on each round of cross-validation you select the best 10 features using only the 49 training examples.
def right_feat_loocv(data, classifiers, number_of_features, k):
prediction = np.zeros(len(data))
for i in range(0, len(data)):
test, train, test_class, train_class = preprocess(data, classifiers, i)
idx = feature_select(train, train_class, number_of_features)
new_test = test[idx]
new_train = train.T[idx]
new_train = new_train.T
pred[i] = knn(new_test, new_train, train_class, k)
return pred
```
%% Cell type:markdown id: tags:
To test the function a data set of 50 samples with 1000 features each, is created (X) and 50 classifiers, -1's and 1's are created and randomly shuffeled (Y).
%% Cell type:code id: tags:
``` python
X = np.random.randn(50, 1000)
Y0 = np.ones(25)*-1
Y1 = np.ones(25)
Y = np.append(Y0, Y1)
Y = Y.astype(float)
np.random.shuffle(Y)
```
%% Cell type:markdown id: tags:
Running through the two different methods, a C-index of 0.74 was estimated for the wrong method and a C-index of 0.34 for the correct method. (I got 0.82 and 0.48 in the first run. There's naturally some amount of randomness in the data, but 0.34 seems like quite a bad result. In any case, it's clear that with the wrong way the result is a lot better, but that's due to poor selection of data)
%% Cell type:code id: tags:
``` python
pred = wrong_feat_loocv(X, Y, 10, 3)
wrong_feat_C = cind(Y, pred)
print wrong_feat_C
```
%% Cell type:code id: tags:
``` python
predictions = right_feat_loocv(X, Y, 10, 3)
right_feat_C = cind(Y, predictions)
print right_feat_C
```
%% Cell type:code id: tags:
``` python
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment