## What is Machine Learning?

Machine learning is a hot topic in the industry. In recent years, there are many related articles, books and videos published on this subject. However, as programmer without much math background, I often found them difficult to digest. Luckily, Andrew Ng published a course online that gave me a good taste of what is machine learning about. To strengthen my knowledge, I found it better to get my hands dirty through working on some examples in code. And I would like document the my journey here and hopefully benefits the ones going thru the same process as me. Before that, let me have a quick review of what is machine learning first. Among all the definitions, I like this one the most:

Machine learning is semi-automated extraction of knowledge from data

The idea here is to apply some algorithms against data to extract pattern/ knowledge out from it and that knowledge often times is used to answer your questions on the data.

## Supervised vs Unsupervised Learning

There are 2 main categories of machine learning

**supervised**– we told the machine the answers over the course of training. After that, we expect it to give us answers for the unseen data.

**unsupervised**– we expect the machine to look at the data and figure out the pattern for us without prior training. Normally, it is used to extract structure from data that best represents the dataset. And there is no “right answer” (eg. clustering)

## Common questions people ask for machine learning

- What model I should use?
- What attributes of my data I should include in the model?
- How to optimize this model for best performance?
- How do I make sure my model I am building that will “generalize” the unseen data?
- How can I estimate the model is likely to perform on the unseen data?

## Get your hands dirty with Python

Here we will use sci-learn in Python to prototype. Sci-learn has few areas of machine learning models. They are:

**Classification**(supervised, predict categorical response)**Regression**(supervised, predict continuous response)**Clustering**(unsupervised)

Below is the cheatsheet that helps you to pick the right model for your task.

### Classification

For the first example, we will classify the iris species using the iris dataset provided by sci-learn. Studies show that we can use pedal size to tell the iris species. So, we are going to use this to run against two different classification models:

**K-near neighbors (KNN)**– find K nearest neighbors and classify via majority win. There are times that you cannot classify in KNN. For example, if you have 3 labels (ie. A,B,C) with K=5 setting, you could have a case that your 5 nearest neighbors have labels: A=2, B=2 and C=1. In this case, A and B are tie and you cannot classify with confidence.**Logistic regression**– use a different hypothesis class to try to predict the probability that a given example belongs to the “1” class versus the probability that it belongs to the “0” class. In this example, although we have more than 2 classes, we still can solve it thru logistical regression if we break down the problem into multiple logistical regressions (ie. one vs rest). For more info about logistical regression, you can watch the video below from Andrew Ng.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
from sklearn.datasets import load_iris iris = load_iris() # popular dataset comes as part of the sklearn type(iris) # sklearn.datasets.base.Bunch print iris.data # array of array (2d) print type(iris.data) # ndarray of 0,1,2 print iris.data.shape # (150L, 4L) = 150 rows and 4 columns print iris.feature_names # array of features print iris.target # ndarray of 0,1,2 print iris.target_names # labels print iris.target.shape # (150L, ) = 150 rows # store feature matrix in "X" (uppercase as object) X = iris.data # store response vector in "y" (lowercase as array) y = iris.target # K-nearest neighbors with K=1 from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=1) knn.fit(X,y) knn.predict([3, 5, 4, 2]) # output=2 x_new = [[3,5,4,2],[5,4,3,2]] knn.predict(x_new) # output=[2,1] # K-nearest neighbors with K=5 knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X,y) x_new = [[3,5,4,2],[5,4,3,2]] knn.predict(x_new) # output=[1,1] # Logistical regression from sklearn.linear_model import LogisticRegression logreg = LogisticRegression() logreg.fit(X,y) logreg.predict(x_new) # output=[2,0] |

## Pick the right model

As you can see the 3 models above gave us different predictions, which is the best we should use? We can pick it thru a systematic validation approach like:

**train/test split**– Model is trained and tested on different data or it is easily overfitting.**k-fold validation**– As train/test split may suffer big variance as it relies on what are picked in the test set. To fix this problem, you can perform train/test split k times on different train/test sets to gain the confidence. Generally, 10-fold is recommended as it works for many cases.

### Train/Test Split Approach

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
from sklearn.cross_validation import train_test_split from sklearn import metrics X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4) # validate logistic regression logreg = LogisticRegression() logreg.fit(X_train, y_train) y_pred = logreg.predict(X_test) print metrics.accuracy_score(y_test, y_pred) # output=0.95 # KNN=5 knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X_train, y_train) y_pred = knn.predict(X_test) print metrics.accuracy_score(y_test, y_pred) # output=0.966666666667 # KNN=1 knn = KNeighborsClassifier(n_neighbors=1) knn.fit(X_train, y_train) y_pred = knn.predict(X_test) print metrics.accuracy_score(y_test, y_pred) # output=0.95 # From accuracy check above, KNN=5 is better model among them # But is K=5 the best? We can locate an even better value thru the code below. # try K=1 through K=25 and record testing accuracy k_range = range(1, 26) scores = [] for k in k_range: knn = KNeighborsClassifier(n_neighbors=k) knn.fit(X_train, y_train) y_pred = knn.predict(X_test) scores.append(metrics.accuracy_score(y_test, y_pred)) # import Matplotlib (scientific plotting library) import matplotlib.pyplot as plt # allow plots to appear within the notebook %matplotlib inline # plot the relationship between K and testing accuracy plt.plot(k_range, scores) plt.xlabel('Value of K for KNN') plt.ylabel('Testing Accuracy') # From the plot, you can tell K=6 to K=16 is the best K to pick. # We pick K=11 and re-train our full dataset for prediction using the code below # instantiate the model with the best known parameters knn = KNeighborsClassifier(n_neighbors=11) # train the model with X and y (not X_train and y_train) knn.fit(X, y) # make a prediction for an out-of-sample observation knn.predict([3, 5, 4, 2]) |

### K-fold Cross-Validation Approach

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
from sklearn.cross_validation import KFold from sklearn.cross_validation import cross_val_score # 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter) knn = KNeighborsClassifier(n_neighbors=5) scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy') print scores # a list of 10 scores print scores.mean() # 0.966666666667 # search for an optimal value of K for KNN k_range = range(1, 31) k_scores = [] for k in k_range: knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy') k_scores.append(scores.mean()) print k_scores # Hard to see from a set of numbers. Visualize it thru graph. import matplotlib.pyplot as plt %matplotlib inline # plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis) plt.plot(k_range, k_scores) plt.xlabel('Value of K for KNN') plt.ylabel('Cross-Validated Accuracy') # The graph shows the max accuracy at several different K. # Pick the highest K as for KNN b/c it gives the less complex model. # 10-fold cross-validation with the best KNN model knn = KNeighborsClassifier(n_neighbors=20) print cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean() # single line output: 0.98 # check logistic regression using 10-fold cross-validation as well from sklearn.linear_model import LogisticRegression logreg = LogisticRegression() print cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean() # output: 0.953333333333 |

## Reference

- An Introduction to Statistical Learning (book)
- Learning Paradigms (video)
- Kaggle competitions
- Code of the video series from Kelvin
- scikit-learn documentation: Dataset loading utilities
- Jake VanderPlas: Fast Numerical Computing with NumPy (slides, video)
- Scott Shell: An Introduction to NumPy (PDF)
- 15 hours of expert videos on ML
- Machine Learning Cheatsheet
- Deep Learning Tutorial
- Machine Learning in Action (book)
- Natural Language Processing with Python
- Python pipelining
- Neural Networking/ Deep Learning

## Connect with us