Machine Learning From Scratch

What is Machine Learning?

Machine learning is a hot topic in the industry. In recent years, there are many related articles, books and videos published on this subject. However, as programmer without much math background, I often found them difficult to digest. Luckily, Andrew Ng published a course online that gave me a good taste of what is machine learning about. To strengthen my knowledge, I found it better to get my hands dirty through working on some examples in code. And I would like document the my journey here and hopefully benefits the ones going thru the same process as me. Before that, let me have a quick review of what is machine learning first. Among all the definitions, I like this one the most:

Machine learning is semi-automated extraction of knowledge from data

The idea here is to apply some algorithms against data to extract pattern/ knowledge out from it and that knowledge often times is used to answer your questions on the data.

Supervised vs Unsupervised Learning

There are 2 main categories of machine learning

supervised – we told the machine the answers over the course of training. After that, we expect it to give us answers for the unseen data.

unsupervised – we expect the machine to look at the data and figure out the pattern for us without prior training. Normally, it is used to extract structure from data that best represents the dataset. And there is no “right answer” (eg. clustering)

Common questions people ask for machine learning

What model I should use?
What attributes of my data I should include in the model?
How to optimize this model for best performance?
How do I make sure my model I am building that will “generalize” the unseen data?
How can I estimate the model is likely to perform on the unseen data?

Get your hands dirty with Python

Here we will use sci-learn in Python to prototype. Sci-learn has few areas of machine learning models. They are:

Classification (supervised, predict categorical response)
Regression (supervised, predict continuous response)
Clustering (unsupervised)

Below is the cheatsheet that helps you to pick the right model for your task.

Classification

For the first example, we will classify the iris species using the iris dataset provided by sci-learn. Studies show that we can use pedal size to tell the iris species. So, we are going to use this to run against two different classification models:

K-near neighbors (KNN) – find K nearest neighbors and classify via majority win. There are times that you cannot classify in KNN. For example, if you have 3 labels (ie. A,B,C) with K=5 setting, you could have a case that your 5 nearest neighbors have labels: A=2, B=2 and C=1. In this case, A and B are tie and you cannot classify with confidence.
Logistic regression – use a different hypothesis class to try to predict the probability that a given example belongs to the “1” class versus the probability that it belongs to the “0” class. In this example, although we have more than 2 classes, we still can solve it thru logistical regression if we break down the problem into multiple logistical regressions (ie. one vs rest). For more info about logistical regression, you can watch the video below from Andrew Ng.

from sklearn.datasets import load_iris  
iris = load_iris()        # popular dataset comes as part of the sklearn
type(iris)                # sklearn.datasets.base.Bunch
print iris.data           # array of array (2d)
print type(iris.data)     # ndarray of 0,1,2
print iris.data.shape     # (150L, 4L) = 150 rows and 4 columns

print iris.feature_names  # array of features
print iris.target         # ndarray of 0,1,2
print iris.target_names   # labels
print iris.target.shape   # (150L, ) = 150 rows

# store feature matrix in "X" (uppercase as object)
X = iris.data             

# store response vector in "y" (lowercase as array)
y = iris.target

# K-nearest neighbors with K=1
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X,y)
knn.predict([3, 5, 4, 2]) # output=2
x_new = [[3,5,4,2],[5,4,3,2]]
knn.predict(x_new)        # output=[2,1]

# K-nearest neighbors with K=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X,y)
x_new = [[3,5,4,2],[5,4,3,2]]  
knn.predict(x_new)        # output=[1,1]

# Logistical regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X,y)
logreg.predict(x_new)     # output=[2,0]

from sklearn.datasets import load_iris

iris = load_iris() # popular dataset comes as part of the sklearn

type(iris) # sklearn.datasets.base.Bunch

print iris.data # array of array (2d)

print type(iris.data) # ndarray of 0,1,2

print iris.data.shape # (150L, 4L) = 150 rows and 4 columns

print iris.feature_names # array of features

print iris.target # ndarray of 0,1,2

print iris.target_names # labels

print iris.target.shape # (150L, ) = 150 rows

# store feature matrix in "X" (uppercase as object)

X = iris.data

# store response vector in "y" (lowercase as array)

y = iris.target

# K-nearest neighbors with K=1

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X,y)

knn.predict([3, 5, 4, 2]) # output=2

x_new = [[3,5,4,2],[5,4,3,2]]

knn.predict(x_new) # output=[2,1]

# K-nearest neighbors with K=5

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X,y)

x_new = [[3,5,4,2],[5,4,3,2]]

knn.predict(x_new) # output=[1,1]

# Logistical regression

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(X,y)

logreg.predict(x_new) # output=[2,0]

Pick the right model

As you can see the 3 models above gave us different predictions, which is the best we should use? We can pick it thru a systematic validation approach like:

train/test split – Model is trained and tested on different data or it is easily overfitting.
k-fold validation – As train/test split may suffer big variance as it relies on what are picked in the test set. To fix this problem, you can perform train/test split k times on different train/test sets to gain the confidence. Generally, 10-fold is recommended as it works for many cases.

Train/Test Split Approach

from sklearn.cross_validation import train_test_split
from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)

# validate logistic regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print metrics.accuracy_score(y_test, y_pred)  # output=0.95

# KNN=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print metrics.accuracy_score(y_test, y_pred)  # output=0.966666666667

# KNN=1
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print metrics.accuracy_score(y_test, y_pred)  # output=0.95

# From accuracy check above, KNN=5 is better model among them
# But is K=5 the best? We can locate an even better value thru the code below.

# try K=1 through K=25 and record testing accuracy
k_range = range(1, 26)
scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    scores.append(metrics.accuracy_score(y_test, y_pred))

# import Matplotlib (scientific plotting library)
import matplotlib.pyplot as plt

# allow plots to appear within the notebook
%matplotlib inline

# plot the relationship between K and testing accuracy
plt.plot(k_range, scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Testing Accuracy')

# From the plot, you can tell K=6 to K=16 is the best K to pick. 
# We pick K=11 and re-train our full dataset for prediction using the code below

# instantiate the model with the best known parameters
knn = KNeighborsClassifier(n_neighbors=11)

# train the model with X and y (not X_train and y_train)
knn.fit(X, y)

# make a prediction for an out-of-sample observation
knn.predict([3, 5, 4, 2])

from sklearn.cross_validation import train_test_split

from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)

# validate logistic regression

logreg = LogisticRegression()

logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print metrics.accuracy_score(y_test, y_pred) # output=0.95

# KNN=5

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print metrics.accuracy_score(y_test, y_pred) # output=0.966666666667

# KNN=1

knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print metrics.accuracy_score(y_test, y_pred) # output=0.95

# From accuracy check above, KNN=5 is better model among them

# But is K=5 the best? We can locate an even better value thru the code below.

# try K=1 through K=25 and record testing accuracy

k_range = range(1, 26)

scores = []

for k in k_range:

knn = KNeighborsClassifier(n_neighbors=k)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

scores.append(metrics.accuracy_score(y_test, y_pred))

# import Matplotlib (scientific plotting library)

import matplotlib.pyplot as plt

# allow plots to appear within the notebook

%matplotlib inline

# plot the relationship between K and testing accuracy

plt.plot(k_range, scores)

plt.xlabel('Value of K for KNN')

plt.ylabel('Testing Accuracy')

# From the plot, you can tell K=6 to K=16 is the best K to pick.

# We pick K=11 and re-train our full dataset for prediction using the code below

# instantiate the model with the best known parameters

knn = KNeighborsClassifier(n_neighbors=11)

# train the model with X and y (not X_train and y_train)

knn.fit(X, y)

# make a prediction for an out-of-sample observation

knn.predict([3, 5, 4, 2])

K-fold Cross-Validation Approach

from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score

# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print scores          # a list of 10 scores
print scores.mean()   # 0.966666666667

# search for an optimal value of K for KNN
k_range = range(1, 31)
k_scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
    k_scores.append(scores.mean())
print k_scores

# Hard to see from a set of numbers. Visualize it thru graph.
import matplotlib.pyplot as plt
%matplotlib inline

# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')  

# The graph shows the max accuracy at several different K. 
# Pick the highest K as for KNN b/c it gives the less complex model.

# 10-fold cross-validation with the best KNN model
knn = KNeighborsClassifier(n_neighbors=20)
print cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean() # single line output: 0.98

# check logistic regression using 10-fold cross-validation as well
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
print cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean()  # output: 0.953333333333

from sklearn.cross_validation import KFold

from sklearn.cross_validation import cross_val_score

# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)

knn = KNeighborsClassifier(n_neighbors=5)

scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')

print scores # a list of 10 scores

print scores.mean() # 0.966666666667

# search for an optimal value of K for KNN

k_range = range(1, 31)

k_scores = []

for k in k_range:

knn = KNeighborsClassifier(n_neighbors=k)

scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')

k_scores.append(scores.mean())

print k_scores

# Hard to see from a set of numbers. Visualize it thru graph.

import matplotlib.pyplot as plt

%matplotlib inline

# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)

plt.plot(k_range, k_scores)

plt.xlabel('Value of K for KNN')

plt.ylabel('Cross-Validated Accuracy')

# The graph shows the max accuracy at several different K.

# Pick the highest K as for KNN b/c it gives the less complex model.

# 10-fold cross-validation with the best KNN model

knn = KNeighborsClassifier(n_neighbors=20)

print cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean() # single line output: 0.98

# check logistic regression using 10-fold cross-validation as well

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

print cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean() # output: 0.953333333333

Reference

An Introduction to Statistical Learning (book)
Learning Paradigms (video)
Kaggle competitions
Code of the video series from Kelvin
scikit-learn documentation: Dataset loading utilities
Jake VanderPlas: Fast Numerical Computing with NumPy (slides, video)
Scott Shell: An Introduction to NumPy (PDF)
15 hours of expert videos on ML
Machine Learning Cheatsheet
Deep Learning Tutorial
Machine Learning in Action (book)
Natural Language Processing with Python
Python pipelining
Neural Networking/ Deep Learning