Machine Learning From Scratch

What is Machine Learning?

Machine learning is a hot topic in the industry. In recent years, there are many related articles, books and videos published on this subject. However, as programmer without much math background, I often found them difficult to digest. Luckily, Andrew Ng published a course online that gave me a good taste of what is machine learning about. To strengthen my knowledge, I found it better to get my hands dirty through working on some examples in code. And I would like document the my journey here and hopefully benefits the ones going thru the same process as me. Before that, let me have a quick review of what is machine learning first. Among all the definitions, I like this one the most:

Machine learning is semi-automated extraction of knowledge from data

The idea here is to apply some algorithms against data to extract pattern/ knowledge out from it and that knowledge often times is used to answer your questions on the data.

Supervised vs Unsupervised Learning

There are 2 main categories of machine learning

  • supervised – we told the machine the answers over the course of training. After that, we expect it to give us answers for the unseen data.


  • unsupervised – we expect the machine to look at the data and figure out the pattern for us without prior training. Normally, it is used to extract structure from data that best represents the dataset. And there is no “right answer” (eg. clustering)

Common questions people ask for machine learning

  • What model I should use?
  • What attributes of my data I should include in the model?
  • How to optimize this model for best performance?
  • How do I make sure my model I am building that will “generalize” the unseen data?
  • How can I estimate the model is likely to perform on the unseen data?

Get your hands dirty with Python

Here we will use sci-learn in Python to prototype. Sci-learn has few areas of machine learning models. They are:

  • Classification (supervised, predict categorical response)
  • Regression (supervised, predict continuous response)
  • Clustering (unsupervised)

Below is the cheatsheet that helps you to pick the right model for your task.



For the first example, we will classify the iris species using the iris dataset provided by sci-learn. Studies show that we can use pedal size to tell the iris species. So, we are going to use this to run against two different classification models:

  • K-near neighbors (KNN) – find K nearest neighbors and classify via majority win. There are times that you cannot classify in KNN. For example, if you have 3 labels (ie. A,B,C) with K=5 setting, you could have a case that your 5 nearest neighbors have labels: A=2, B=2 and C=1. In this case, A and B are tie and you cannot classify with confidence.
  • Logistic regression – use a different hypothesis class to try to predict the probability that a given example belongs to the “1” class versus the probability that it belongs to the “0” class. In this example, although we have more than 2 classes, we still can solve it thru logistical regression if we break down the problem into multiple logistical regressions (ie. one vs rest). For more info about logistical regression, you can watch the video below from Andrew Ng.


Pick the right model

As you can see the 3 models above gave us different predictions, which is the best we should use? We can pick it thru a systematic validation approach like:

  • train/test split – Model is trained and tested on different data or it is easily overfitting.
  • k-fold validation – As train/test split may suffer big variance as it relies on what are picked in the test set. To fix this problem, you can perform train/test split k times on different train/test sets to gain the confidence. Generally, 10-fold is recommended as it works for many cases.

Train/Test Split Approach

K-fold Cross-Validation Approach


Machine Learning From Scratch

log in

Use demo/demo public access

reset password

Back to
log in
Choose A Format
Personality quiz
Trivia quiz