Cheatsheet - Math in Machine Learning

What do I need in Math for Machine Learning?

You need basic probability and statistics, calculus, linear algebra and convex optimization.

Optimization

Why convex optimization?

Many methods in machine learning are based on finding parameters that minimise some objective function. Very often, the objective function is a weighted sum of two terms: a cost function and regularization. In statistics terms the (log-)likelihood and (log-)prior. If both of these components are convex, then their sum is also convex. Why “convex” function is such a big deal? If the objective function is convex, it doesn’t have local minimum with a guarantee that there will have an unique global minimum. Non-convex functions may have several local minima, that is multiple points satisfying that they are the best point in their local neighbourhood, but which are not globally optimal. Therefore, if you have a non-convex problem, there is usually no way to test if the solution you have found is indeed the best solution.

Log-Loss

It measures the accuracy of a classifier. It is used when the model outputs a probability for each class, rather than just the most likely class. Log-loss is a “soft” measurement of accuracy that incorporates the idea of probabilistic confidence. It is intimately tied to information theory: log-loss is the cross entropy between the distribution of the true labels and the predictions. By minimizing the cross entropy, one maximizes the accuracy of the classifier.

Now let us closely follow the formula of logLoss. There can be 4 major cases for the values of yiyi and pipi

\(Case 1 : y_i=1 , p_i = High , 1−y_i=0 , 1−p_i = Low \)
\(Case 2 : y_i=1 , p_i = Low , 1−y_i=0 , 1−p_i = High\)
\(Case 3 : y_i=0 , p_i = Low , 1−y_i=1 , 1−p_i = High\)
\(Case 4 : y_i=0 , p_i = High , 1−y_i=1 , 1−p_i = Low\)

Log Loss Formula :

logLoss=logLoss=−1N−1N∑Ni=1∑i=1N(yi(logpi)+(1−yi)log(1−pi))

Common graphs

1. Linear

y= mx + b where m is slope and b is the y-intercept. Both of them are constants. If m < 0, the line falls to the right. If m = 0, it is a horizontal line. If m > 0, the line falls to the left.

2. Power

\(y=ax^b\) where a and b are constants. If b is a fraction between 0 and 1, the curve gets closer to x-axis if the power get smaller. If b > 1, even is convex while odd is not. If b < 1, the line will be splitted.

3. Quadratic

\(y=ax^2 + bx + c\). The graphs are parabolas. If a is +ve, it opens upward. If a is -ve, it opens downward.

4. Polynomial

\(y=a_nx^n + a_{n-1}x^{n-1} + … + a_1x + a_0\) where a_n, a_{n-1},…a_0 are constant. Only whole number powers of x are allowed. Linear and Quadratic question are a form of Polynomial function. There is a way to test if it is a polynomial function. First, it is a function so it needs to pass the vertical line test. Second, all polynomial function is smooth unbroken line.

So far above we are talking about univariate polynomial. If you have > 1 variables in the function, it will become multivariate polynomial. Optimize a multivariate polynomial graph is not trivial. It is a subject of multivariate calculus. You can get a taste of it from this video or you can take the course from Khan here to build up the foundation.

5. Exponential

\( y= ab^x \) where a and b are constants. If the base b is greater than 1 then the result is exponential growth. If b is smaller than 1, then the result is exponential decay.

6. Logarithmic

\(a ln(x) + b \) where x is in the natural logarithm and a and b are constants. They are only defined for positive x.

7. Sinusoidal

\(y=a sin(bx + c)\) where a,b and c are constants. This kind of functions are useful for describing anything has a wave shape with respect to position and time.

Derivative Rules

General Rules

\(\frac{dc}{dx} = 0\) (derivative of a constant)
\(\frac{dx}{dx} = 1\) (derivative of a variable with respect to itself)
\(\frac{d}{dx}(f+g-h) = \frac{df}{dx}+\frac{dg}{dx}-\frac{dh}{dx}\) (The derivative of a sum or difference is equal to the sum or difference of the derivatives)
\(\frac{d}{dx}cf(x) = c\frac{d}{dx}f(x)\)

Product Rule

\(\frac{d}{dx}f \cdot g = f\frac{dg}{dx} + g\frac{df}{dx}\)

Power Rule

\(\frac{d}{dx}x^n = nx^{n-1}\)

Chain Rule

If f is a function of g and g is a function of x, then the derivative of f with respect to x is equal to the derivative of f(g) with respect to g times the derivative of g(x) with respect to x

\(\frac{df(g(x))}{dx} = \frac{df(g)}{dg}\cdot\frac{dg(x)}{dx}\)

Example:
Let \(f(x) = x^5\) and \(g(x) = x^2+ 1\). Then, \(f(g(x)) = (x^2+ 1)^5\).

The derivative of f(g(x)) can be calculated as below:

\(\frac{df(g)}{dg} = 5g^4 = 5(x^2+1)^4\); \(\frac{dg(x)}{dx} = 2x \). Then \(\frac{df(g(x))}{dx} = 5(x^2+1)^4 \cdot 2x\)

Law of Logarithms

Natural log has the number called e as the base. It is the system we use in all theoretical work. And e is approximately equals to 2.718. You can read this post if you want to get an intuition of what it is. We denote the log function with base e as “In x”. That is to say, \(ln x = log_{e}x\).

\(logAB = logA + logB \)
\(log\frac{A}{B} = logA – logB \)
\(logA^n = nlogA \)
\(\frac{d}{dx} e^x = e^x \)
\(\frac{d}{dx} ln x = \frac{1}{x}\)

Cheatsheet – Math in Machine Learning