Machine Learning Mathematics

The Mathematics of Machine Learning Cover.png

Mathematics is the study of topics such as quantity, structure, space and change. Mathematics provides a means to represent the real world in the form of numbers, relationships and equations.

Machine learning employs a wide variety of mathematical concepts and functions in order to create processes that:

  • analyze vast amounts of data in workable timeframes
  • discover actions, relationships and forces occurring in nature 
  • model the real world
  • interact with the real world

In short, mathematics is the base methodology employed to create machine learning models that are used to address challenging applications such as image recognition, face recognition, natural language processing and pattern recognition: 

Mathematics -> Models -> Applications

Machine Learning Mathematics Topics

Activation Functions

Activation functions define the output of a graph node given a set of inputs, as illustrated below. Activation functions: 

  • provide the capability to introduce non-linearity in order to model non-linear aspects of the real world
  • are monotonic - that is, they either constantly increase or decrease - this is important in neural network training to avoid chaotic behavior  

A good discussion of activation functions can be found here.

An activation function receives the sum of inputs which are modified by weights and a bias, modifies the input sum and produces an output:


During neural network training, the weights and bias are repeatedly modified in order to produce a neural network result that has a minimal error (loss) compared to observed training examples.


Activation Function DATA FLOW Equation

The activation function data flow equation is shown below. Input values v, weights w and bias b are fed to a summation function which passes its output s to an activation function g which in turn passes its outputs t to other network nodes.


the activation function data flow equation is:

Activation Function Examples from Nature

An example from biology is the shift in the action potential of a brain cell:

By Original by en:User:Chris 73, updated by en:User:Diberri, converted to SVG by tiZom - Own work, CC BY-SA 3.0,

An example is the haemodynamic response function, which shows the body responding to environmental changes by delivering nutrients such as oxygen and glucose to stressed tissues:

By Lionfish0 - Own work, Public Domain,

Activation Function Examples in Neural Networks

In neural networks, activation functions are used in combination with other functions such a matrix addition and multiplication. The networks below show two examples from Google research. The network on the left was designed by human experts and the network on the right by an automated process:

Rectifier (RELU) Activation function

A rectifier function (also referred to as a Rectified Linear Unit or ReLU) is defined as: 

Rectified linear units, compared to sigmoid function or similar activation functions, allow for faster and effective training of deep neural architectures on large and complex datasets. As shown below, it changes any negative values to zero over the defined functional space:

Common positive comments about ReLU activation functions include:

  • Faster convergence of stochastic gradient descent. Due to their linear, non-saturating form.
  • Simpler, faster computation. Due to their linear form not involving exponential operations.

Common negative comments include:

  • They can be fragile and 'die'. If the learning rate is set too high, a large gradient flowing through a ReLU neuron can cause weights to update is a way that the neuron will never activate again.

Sigmoid Activation Function

A sigmoid function has a characteristic "S"-shaped curve defined, using e (Euler's Number), by the formula:

The derivative of the function is:

The curve produced has a fairly gradual rise:

By Qef (talk) - Created from scratch with gnuplot, Public Domain,

Common negative comments about the sigmoid activation function include:

  • Sigmoids can saturate and kill gradients. Gradients (change) at the tails are almost zero. 
  • Sigmoid outputs are all positive values. This can bias network results. The effect can be mitigated by not using sigmoids in the final layers of a network.

Softmax Activation Function

The Softmax function normalizes (in the case of softmax, "squashes") a K-dimensional vector z of arbitrary real values to a K-dimensional vector of real values in the range [0, 1] that add up to 1.


then the total of all values of z used as a power of e (Euler's Number) is:

and the individual normalized values j of K dimensioned vector z are:

Note that we're using the values of z as a power of e (Euler's Number) instead of using just z alone. This will change negative values to positive values and give disproportionately more weight to originally positive values.

The output of the softmax function can be used to represent a categorical distribution – that is, a probability distribution over K different possible outcomes, as illustrated below:

TANH Activation Function

The tanh (Hyperbolic Tangent) function is the hyperbolic analogue of the tan circular function used throughout trigonometry.

The equation for tanh is:

The derivative of tanh is:

Compared to the Sigmoid function, tanh produces a more rapid rise in result values.

Common negative comments about tanh activation functions include:

  • Tanh can saturate and kill gradients. Gradients (change) at the tails of are almost zero. 


Calculus is used in machine learning to represent change. Training machine learning models and performing inference against those models involves a lot of change measurement, and calculus is used extensively for these operations.


Backpropagation is a calculus based function used to incrementally modify the weights and biases of neural networks in order to minimize the error of predictions vs. training values. To understand the backpropagation equations, let's first look at the backpropagation flow of data for a very simple 3 node neural network:


Differential Calculus

Differential Calculus is concerned with the study of the rates at which quantities change.

An example is using a derivative to measure change behavior near a point in a dimensional space. If we have a function:

then the derivative function is expressed as the change in f(x), also thought of as the y axis, with respect to every point x:

This is also indicating the slope or gradient of the curve at point x:

By Minestrone Soup - commons, CC BY-SA 3.0,

Differential Calculus Rules Summary

The rules below provide a way to quickly determine the derivative of common functions. More complex functions can be broken down into progressively smaller components until these rules can be applied.


rules are:

Differential Calculus Chain Rule

The differential calculus chain rule is a formula for computing the derivative of the composition of two or more functions. That is, if f and g are functions, then the chain rule expresses the derivative of their composition. So if we let:

then the derivative of this function can be expressed as:

An example is the derivative of:

Splitting it into two functions, it's made up of an outer function f(g) and an inner function g(x):

Taking the derivative of each of these separately:

Putting it all together:

Differential Calculus Sum and Difference Rule

In calculus, the sum rule is used to differentiate functions of the form:

The derivatives of these functions are:

An example is:

Differential Calculus Power Rule

In calculus, the power rule is used to differentiate functions, whenever r is a real number, of the form:

The differential of the function is:

An example is:

Differential Calculus Product Rule

In calculus, the product rule is a formula used to find the derivatives of products of two or more functions. It may be stated as:

An example is:


Gradients are the multi-dimensional equivalent to derivatives in differential calculus. The example below depicts the gradients (arrows in red) below the depiction (curved grid in blue) of a function of two variables:

By IkamusumeFan - Own work, CC BY-SA 4.0,

Gradient Boosting

Gradient boosting uses an ensemble of prediction models to improve prediction results, as illustrated below:

Histogram of Oriented Gradients

The Histogram of Oriented Gradients technique counts occurrences of gradient orientation in localized portions of an image.

Integral Calculus

Integral Calculus is used to sum values within a dimensional space, such as determining the area under a curve. An integral function is defined as:

This is illustrated by the area S below:

By 4C - Own work, based on JPG version, CC BY-SA 3.0,

Stochastic Gradient Descent

Stochastic gradient descent uses iterative calculations to find a minima or maxima in a multi-dimensional space. The words Stochastic Gradient Descent (SGD) in the context of machine learning mean:

  • Stochastic: random processes are used
  • Gradient: a derivative based change in a function output value
  • Descent: moving from greater to lesser values of a function output value

Derivatives are key to making stochastic gradient descent work, so let's take a look at them first before tackling SGD. Derivatives are shown graphically in the illustration below, where:

The curved blue line shows the function:

The derivative of this function is:

The tangent line of a curve shows its slope, which is also the derivative. The two straight lines show the tangent of f(x) at two points, 

red line - shows the tangent for x=2:

green line - shows the tangent for x=1:

Gradients provide a couple of key pieces of information that are essential for performing SGD:

  • The direction of descent - in the graph above, moving left descends the curve
  • The size of descent steps - when the derivative (slope) is larger, steps can be larger

Now let's take a look at a formula for using SGD to find a minimum for our function above. The formula iteratively uses a step function to subtract values from x to arrive at the estimated minimum:


In the graph illustration above:

Using SGD, we're iteratively reducing the value of x, using the formula above, to find the minimal value of y. The multiplier used in the x reduction equation is called the learning rate:

There are a number of approaches to calculating the learning rate. These include:

  • Momentum -remembers the update at each iteration and determines the next update as a linear combination of the gradient and the previous update
  • Averaging - records an average of its parameter vector over time
  • AdaGrad - increases the initial learning rate for more sparse parameters and decreases the learning rate for less sparse ones
  • RMSProp - (for Root Mean Square Propagation) is a method in which the learning rate is adapted for each of the parameters - the idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight
  • Adam - (short for Adaptive Moment Estimation) is an update to the RMSProp optimizer - in this optimization algorithm, running averages of both the gradients and the second moments of the gradients are used

So as we reduce the value of x the value of y will also decrease. Because we want to find the minimum value of y we want to stop iterating the reduction of x when y=0. In our example, we're able to look at the graph and see that the minimum is at y=0. But in actual SGD applications, the value of y is unknown. We won't be able to determine the exact minimum, so we'll want to stop the iteration of x when the reductions in x get very small.

Below are the calculations from running a number of iterations for the calculation of y based on the values shown:

Screen Shot 2017-09-11 at 4.53.15 PM.png

The (x,y) pair values are plotted in the graph below. Notice that as the value of y approaches 0 the descent slows.:

SGD will typically be used to find the minimum for a function of multiple variables, not just one as we used in our example above. Also, the vertical axis dependent variable is typically referred to as a cost, or loss, function. 

The illustration below shows a cost/loss function J with two independent variables:

Moving down the gradient descent curve might look something like this:

Let's say we have a cost/loss function J that is similar to the single variable function we explored but with two variables:

We would want to find the minimum of this function, which can be expressed as:

In our simple example above, we used the derivative of our function as part of the calculation to modify the variable x. Because we're now dealing with two variables, we'll use partial derivatives with respect to each variable.


Geometry (from the Ancient Greekgeo- "earth", -metron "measurement") deals with calculations of shape, size, relative position of figures, and the properties of space.

Euclidean Geometry

Euclidean geometry deals with measurement and configuration of one, two, and three dimensional spaces. An example, shown below, is the simplified representation of real world objects for use in navigating self-driving vehicles:


A manifold is a topological space that locally resembles Euclidean space near each point. More precisely, each point of an n-dimensional manifold has a neighbourhood that is homeomorphic to the Euclidean space of dimension n. Below is an example of a Boy's surface manifold:

CC BY-SA 3.0,

Graph Theory

Graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. Graphs are important in machine learning because they can be used to define complex data flows.


A graph is a structure of related objects. Below is an example of a graph with 6 nodes (also called vertices) 7 lines (also called edges).

The illustration below is of a graph characterized by the vertices V and edges E:

By User:AzaToth - Image:6n-graf.png simlar input data, Public Domain,

Graph Node Functions

Graph nodes perform a variety of machine learning functions, inspired by neuron functions in the brain.


Activation functions generate an output value based on an input value and function equation. Below is a graph segment depicting an activation function: 


An example of an activation function is the sigmoid function:


Convolution performs a data operation using input matrix and bias data to produce an output matrix. For more complex graphs, this process can involve a large number of computations. A good discussion and demonstration of convolution can be found here.

Below is an example of a graph segment depicting a simple convolution node:


An example of calculating one output cell is:


Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex co-adaptations on training data. It is a very efficient way of performing model averaging with neural networks. The term "dropout" refers to dropping out units (both hidden and visible) in a neural network. Below is an example of a graph segment depicting a dropout node:



Input functions collect and organize data that will be passed on to other graph nodes. Below is an example of a graph segment depicting an input node:



Matrix operations such as addition, multiplication and rotation use input matrices to create a new output matrix. Below is a graph segment depicting a matrix operation node:



Memory functions retain information for use by other graph nodes.


Output functions collect and organize data that will be passed out of the graph. Below is an example of a graph segment depicting an output node:



Pooling functions combine and simplify data. Multiple data elements from an input matrix are combined into a single element in an output matrix. A good discussion of pooling can be found here.

Below is a representation of a pooling node:


Types of pooling functions include:

  • max - the maximum element of an input group is used for the output pooled element
  • average - the average of the input group elements is used for the output pooled element

Parameters used for the pooling operation include:

  • filter - sets the dimensions of the filter grouping used for pooling (e.g. 2x2)
  • stride - sets how many cells the pooling filter moves for each pooling operation (e.g. 1); stride is normally set so that the output volume is an integer and not a fraction
  • padding - adds zero value cells around an input matrix in order to maintain the same spatial dimensions for the input and output volume

Graph Data Flows

Graphs such as Neural Networks employ a number of different types of data flow as detailed below.



Backpropagation is used to modify the weights and biases applied to data flowing between neural network nodes. The backpropagation function B( ) repeatedly modifies weights and biases to reach a minimum value for the error value.

Below is an example flow:



A convolutional feed-forward data flow is commonly used for image processing neural networks. Below is an example flow:



Data operations include data transfer, concatenation, and copy. Below is a graph segment showing these flows:



A deep learning graph is commonly used to address complex neural network tasks. Below is an example flow:



In a feed forward graph, connections between nodes do not form a cycle. Data flows forward from node to node and nodes perform graph node functions. These graphs form the basis for feed-forward neural networks. Below is a graph feed-forward data flow segment example:



In a loop data flow, the function node is iterated on to repeatedly process new inputs and outputs. These data flows are used in recurrent and long short-term memory networks. Below is a graph loop data flow segment example:



In a recurrent graph, data is passed from one node back to another. These data flows form the basis for recurrent neural networks. Below is a graph recurrent data flow segment example:


The Brain as a Graph

The brain can be seen as a graph, where neurons act as nodes and axons act as edge connectors. This has given inspiration to the development of Artificial Neural Networks.

By user:Looie496 created file, US National Institutes of Health, National Institute on Aging created original -, Public Domain,

Nearest Neighbor Algorithm

The nearest neighbor algorithm is a computational method for finding an efficient path through a graph, using these steps

  1. start on an arbitrary vertex as current vertex.
  2. find out the shortest edge connecting current vertex and an unvisited vertex V.
  3. set current vertex to V.
  4. mark V as visited.
  5. if all the vertices in domain are visited, then terminate.
  6. Go to step 2.

Information Theory

Information theory studies the quantificationstorage, and communication of information.


Information entropy is defined as the average amount of information produced by a probabilistic stochastic source of data. Generally, information entropy is the average amount of information conveyed by an event, when considering all possible outcomes. Entropy is zero when one outcome is certain to occur.

Cross Entropy (loss)

The cross entropy between two probability distributions over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set.

Cross entropy can be used to calculate loss. The equation for cross entropy is:


An example is:

Linear Algebra

Linear algebra deals with calculations of spaces defined by lines, planes and vectors.

Eigenvalues and Eigenvectors

Eigenvalues together with eigenvectors define aspects of linear transforms on spaces. An eigenvector defines a direction that is not changed. An eigenvalue defines a length change related to the eigenvector.

Below, in a shear mapping linear transform, the red arrow changes direction but the blue arrow does not. The blue arrow is an eigenvector of this shear mapping because it doesn't change direction, and since its length is unchanged, its eigenvalue is 1.

By TreyGreer62 - Image:Mona Lisa-restored.jpg, CC0,

Euclidean Vectors

A Euclidean vector is a geometric object that has magnitude (or length) and direction, such as in the example below: 

By User:Acdx - Self-made, based on en:Image:Plane Cartesian vector.png, CC BY-SA 3.0,


A matrix (plural matrices) is a rectangular array of numberssymbols, or expressions, arranged in rows and columns. Matrices allow the efficient and compact storage of information that can then be used in many types of machine learning functions. Important matrix operations include:

By Lakeworks - Own work, GFDL,


A scalar is an element of a field which is used to define a vector space. In the diagram below, X and Y are scalars. The vector V is defined, in part, using the X and Y scalars.

By Jakob.scholbach - Own work, CC BY-SA 3.0,

Vector Norms

A vector norm is a function that assigns a strictly positive length or size to each vector in a vector space—save for the zero vector, which is assigned a length of zero.

Vector Spaces

A vector space (also called a linear space) is a collection of objects called vectors, which may be added together and multiplied ("scaled") by numbers, called scalars. In the example below, the blue shaded vector space is defined by the vectors r1, r2, and r3:

By Claudio Rocchini - Own work, CC BY 3.0,

Mathematical Constants

A mathematical constant is:

  • a special number
  • usually a real number
  • significantly interesting in some way
  • sometimes a relationship between measurements
  • frequently used in mathematical formulas and calculations

euler's number

Euler's number, or number e, is the base of the natural logarithm: the unique number whose natural logarithm is equal to one. It is approximately equal to:


By Richard F. Lyon - made myself, alt version of Logarithm plots.svg with better text, CC BY-SA 3.0,

The Derivative of e to the Power of x

The function:

gives a curve at which the slope at any value x is also the value of y, which means the derivative is equal to the function itself, or:

Characteristics of e

e also has these important characteristics:

  • e is the base rate of growth shared by all continually growing processes
  • e shows up whenever systems grow exponentially and continuously, such as population, interest calculations and radioactive decay
  • every rate of growth can be considered a scaled version of e (unit growth, perfectly compounded)
  • e allows taking a simple growth rate where all change happens as the end of a finite period and find the impact of compound, continuous growth at infinitely smaller intervals

Compound Growth using e

Let's take a closer look at this last point, the impact of smaller intervals of compound growth calculation. Let:

The equation for calculating F is:

If we perform the calculation only once per period, then, n=1 and:

On the other extreme, if we perform compounding calculations continually letting n --> infinity, then:

So now we see how e fits into this equation. But how did it get there? To explore this, let's start with a simplified version of equation (1) above and let:

With these values, equation (1) becomes:

Now let's solve this equation for values of n. After n=10, we'll use a logarithmic scale and jump by a factor of 10 for each new calculation so we can see what happens with very large values of n:

In graph form, it looks like the chart below. Notice the jump between 10 and 100 as we move from a linear graph to a logarithmic graph for values of F:

What we see is that as the value of n approaches infinity, the value of F converges on the value of e: 2.7182820... This means that:

Another equation for calculating the value of e is:

An Example of Compound Growth using e

Let's look at an example of equation (1) when we let n --> infinity and have other values that are not equal to 1 for I, r, t:

equation (1) looks like this:

Combining the Rate of Growth and Time Periods using e

Using n --> infinity allows us to combine rate of growth and time periods calculated into one number. In the example above:

Deriving the Use of e in Compound Growth Equations

We haven't yet explained how we derived the following substitution in equation (1):

In other words, how did we simplify to using only e and rt as shown in equation (7)? Explaining it is a somewhat lengthy process, so we'll go through it in detailed steps...

Changing r/n to a Denominator

First, let's change r/n to a denominator using the algebraic rule of complex fraction inversion:

Applying the Power Rule to the Exponents

Next we're going to use the algebra power rule which says:

To do this, we'll first let:

and substitute it into our equation (8) so that::

Now we're going to introduce r into the exponent nt by adding it in as both a numerator and denominator that cancel each other out and rearranging the single fraction into two fractions, so that:

Now we'll apply the power rule from equation (9) with our new exponent values of x from equation (11) to get :

Next we'll replace x in equation (13) with the value from above:

Showing How e Enters our Equation

We still have to see how e enters this equation. To do this, within equation (14) let:

Within this equation, now look at:

Remember, we already know from equation (5) above that:

So equation (1) is basically the same as y except we're using n/r instead of just n. Now consider this:

Think about it, n/r might approach infinity at a somewhat slower pace than n, but it's still approaching infinity. So now we can say that for y from equation (15):

Using Backward Substitution to Show a Final Result

Now we can do backward substitutions through our sequence of equations to show a final result.

Combining equations (16) and (15) we have:

Combining equations (17) and (14) we have:

Combining equations (18) and (13) we have:

Combining equations (19) and (11) we have:

Combining equations (20) and (1) we have:

Which, finally, shows how we got to equation (3):


Pi the ratio of a circle's circumference to its diameter. It is approximately equal to:


By Kjoonlee, based on previous work by w:User:Papeschr - Own work, Public Domain,

Mathematical Models

Machine Learning models use mathematical concepts, structures, and algorithms to produce analytic and predictive results. The table and diagrams below provides a framework for understanding the types and varieties of these models using a number of grouping dimensions. 

In the table: x indicates usage, X indicates the most common usage:

Base Models

Base Models are characterized by their fundamentally different approaches to Machine Learning and fall into two groups - graphs and relationships.


These models are based on data graphs.

Artificial Neural Networks

Neural Network.jpg

data flows through graph nodes

Decision Tree.jpg

tree branching movement through graph nodes

Probabilistic Graphical Network.jpg

probabilistic movement between graph nodes


These models are based on data relationships (correlation & dependence).

Cluster Analysis

Cluster Analysis.png

 object groupings

Gaussian Processes.png

continuous function relationships

Regression Analysis.png

variable relationships

Model Aspects

Capabilities & Extensions

Model Extensions are variations on base models the provide additional capabilities and approaches to the model fundamentals.

Application Examples

Model Applications indicate some of the major application areas of Base Models.

Curve (Model) Fitting

Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points.

In the example below, the dotted black line represents fitting points generated with a sine function. The solid lines are various degree polynomials:

  • Red - first degree
  • Green - second degree
  • Orange - third degree
  • Blue - fourth degree

By Krishnavedala - Own work, CC0,


In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.

In the example below, the green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on it and it is likely to have a higher error rate on new unseen data, compared to the black line.

By Chabacano - Own work, GFDL,


Regularization is a process of introducing or removing information in order to solve an ill-posed problem or to prevent overfitting, as illustrated in the example below. Machine learning dropout functions are a form of regularization. Regularization is often accomplished by adding a tuning parameter to a model training process.

Support Vector Machine

A Support Vector Machine (SVM) is a set of algorithms constructed for classification and regression analysis. SVM models are normally trained using example datasets labeled as belonging to specific categories. These data points form clusters in a multi-dimensional space. The term support vector refers to vector lines passing through the data points at the edge of a cluster as shown below.

By Cyc - Own work, Public Domain,

By Cyc - Own work, Public Domain,


Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model would have poor predictive performance, as shown in the figure below:

Mathematical Operations

An operation is a calculation from zero or more input values (called operands) to an output value.

Mathematical Symbols

There are a large number of mathematical symbols, which are used for representing mathematical:

For a list of mathematical symbols, go here.

Basic Mathematical Operations

Basic operations include:

  • Elementary Arithmetic - e.g. addition, subtraction, division, multiplication
  • Exponentiation - one number (base) raised to the power (exponent) of another number
  • Logarithms - inverse of exponentiation, the logarithm of a number is the exponent to which another fixed number, the base, must be raised to produce that number - as an example, since 2^3=8, log2(8)=3
  • Square root - a square root of a number a is a number y such that y^2 = a; in other words, a number y whose square (the result of multiplying the number by itself, or yy) is a.
  • Summationthe addition of a sequence of numbers

Matrix Mathematics

A matrix (plural matrices) is a rectangular array of numberssymbols, or expressions, arranged in rows and columns. Matrix mathematics is important in machine learning because data is often sent between the nodes in a model graph and operated on in a matrix format. Matrices are a mathematically convenient way to apply mathematical operations to sets of data.

Matrix Notation

Matrices are commonly written in box brackets or parentheses:


Basic matrix operations include addition, scalar multiplication and transposition.

Matrix Multiplication

Matrix multiplication or the matrix product is a binary operation that produces a matrix from two matrices.

If A is an n × m matrix and B is an m × p matrix, so that A has the same number of columns as B has rows:

the matrix product AB (denoted without multiplication signs or dots) is defined to be the n × p matrix:

An example of multiplying AB and BA is shown below, notice that in the result, the number of rows is determined by the left matrix operand and the number of columns by the right matrix operand:

For examples of multiplications of matrices of various combinations of dimensions, go here.


Probability is the measure of the likelihood that an event will occur. Probability is quantified as a number between 0 and 1, where, loosely speaking, 0 indicates impossibility and 1 indicates certainty. The higher the probability of an event, the more certain that the event will occur.

Central Limit Theorem

The central limit theorem establishes that, in most situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (a bell curve) even if the original variables themselves are not normally distributed.

This provides a way to understand characteristics of a population of data points using only samples taken from that population.

To see this visually, look at the example below. Random 0s and 1s were generated, and then their means calculated for sample sizes ranging from 1 to 512. Note that as the sample size increases the tails become thinner and the distribution becomes more concentrated around the mean.

By Daniel Resende - [github](, CC BY-SA 4.0,

Independent Events

If two events, A and B, are independent then the joint probability is multiplicative:

Events are independent if the occurrence of one does not affect the probability of occurrence of the other.

A simple example of coin tosses demonstrates this. If a coin is tossed twice, there are four possible combinations of results for which side of the coin is facing up in the first and second toss:

  • head tail
  • head head
  • tail tail
  • tail head

If we let

  • A = first toss
  • B = second toss

then the probability of each of these four result combinations is:

Let's see why. For the first toss, there are only two possible results:

  • head
  • tail

If the coin toss is fair (no bias in the toss is introduced) there's a .50 probability of getting a head facing up. This is because of the finite number of possible results, in this case two - head or tail. No other outcome is possible. This is called a finite geometry, any geometric system that has only a finite number of points, in our case two. We are assuming that the coin cannot land on its edge, so it can only land in one of two ways, showing a head or tail facing up.

Let's say the coin is tossed and a head is the result:

  • head

This means that the first event is now in the past and cannot be changed. This is due to the arrow of time, a concept that says time only moves forward, not backward. So at this point, there is no possibility that the first toss produced a tail. There's a zero probability that the first toss is a tail. This eliminates two of the previously possible results:

  • tail tail
  • tail head 

We are now left with only two possible outcomes for our experiment:

  • head head
  • head tail

The second toss result is not effected by the first toss result. The toss results are independent. Again, there's a .50 probability of a head or tail in the second toss. Let's say we get this result:

  • tail

Now that this result has occurred, it also cannot be changed due to the arrow of time.

So our final result is:

  • head tail

Looking at the tosses and possible results from a point in time before the tosses were made, we can see that our four possible results each had a .25 probability of happening:

  • head tail: .25 probability
  • head head: .25 probability
  • tail tail: .25 probability
  • tail head: .25 probability

Why? Because the second toss probabilities were only .50 of .50. The first toss had already occurred and eliminated two of the potential result combinations, leaving only two of the four available total results.

You can see these results visually with the coin combinations pictured below:

Of the four possible combinations, only one is possible with two tosses of the coin. Since each combination is equally possible, each has a .25 probability of happening.

Mutually Exclusive Events

If two events are mutually exclusive then only one or the other of the two can occur and the probability of either occurring is additive:

In probability theory, events E1, E2, ..., En are said to be mutually exclusive if the occurrence of any one of them implies the non-occurrence of the remaining n − 1 events.

A good example of the probability of mutually exclusive events is calculating the probability of drawing a King or Queen from a deck of cards. There are 4 Kings in the deck, so the probability of drawing a King is 4/52, or 1/13. And since there are also 4 Queens in the deck, the probability of drawing a Queen is also 1/13. So, if:

then the probability of drawing a King or Queen is:

Probability Density Function

The Probability Density Function is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking on any one value. This probability is given by the integral of this variable’s PDF over that range—that is, it is given by the area under the density function but above the horizontal axis and between the lowest and greatest values of the range. The probability density function is nonnegative everywhere, and its integral over the entire space is equal to one. In the graphic below, IQR is the interquartile range.

By Jhguch at en.wikipedia, CC BY-SA 2.5,


Statistics deals with the the collection, analysis, interpretation, presentation, and organization of data. Below are some of the more prominent aspects of statistics used in Machine Learning.

Accuracy and Precision

Accuracy and precision describe the proximity of measurement results to the true value and variation in measurement data, as illustrated below:

By Pekaje at English Wikipedia - Transferred from en.wikipedia to Commons., GFDL,

A/B Testing

A/B testing is a controlled experiment with two variants, A and B.

By Maxime Lorant -, CC BY-SA 4.0,


A statistic is biased if it is systematically different from the statistical (population) parameter being estimated, as illustrated below:


Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.


Correlation is any of a broad class of statistical relationships involving dependence. The example below shows the Pearson correlation coefficient of x and y for each set:

By DenisBoigelot, original uploader was Imagecreator - Own work, original uploader was Imagecreator, CC0,

Cluster Analysis

Cluster analysis is used to group sets of objects based on data characteristics and measurement methods. Below is an example of using k-means clustering to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean:

Coefficient of Determination

The coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

In the example below, since the regression line does not miss any of the points by very much, the R2 of the regression is relatively high:

By Stpasha - Own work, Public Domain,

Data Cleaning

Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data quality screens include:


Extrapolation is the process of estimating, beyond the original observation range, the value of a variable on the basis of its relationship with another variable, such as shown in the example below:

By Berland - self-made in Gnuplot. Question mark added in Inkscape., Public Domain,

False Positive and False Negative Errors

False positive and false negative errors can occur during statistical hypothesis testing. False positives and false negatives are also called type I and type II errors.

To understand these errors, first consider the hypotheses being tested. These can be stated as the null (H0) and alternative hypothesis (H1):

  • Null Hypothesis (H0): also thought of as the default state of nature ... an example being healthy persons in a hypothesis test of the side effects of a new vaccination.
  • Alternative Hypothesis (H1): is the negative of the null hypothesis, and indicates a state that deviates from the default ... an example being sick persons as a result of the side effects of a new vaccination.

During hypothesis testing, false testing results can occur:

  • False Positive (type I error): the null hypothesis is true, but is rejected ... an example is a healthy person being indicated as being made sick in the test of a new vaccination.
  • False Negative (type II error): the null hypothesis is false, but fails to be rejected ... an example is a sick person being indicated as being healthy in the test of a new vaccine.

Hypothesis Testing

A hypothesis is a proposed explanation for a phenomenon. Hypothesis testing tests is a hypothesis on the basis of observing a process that is modeled via a set of random variables.

An example of hypothesis testing a the courtroom trial: 


Interpolation is a method of constructing new data points within the range of a discrete set of known data points. Below is an example of linear interpolation:

By Berland - self-made in Gnuplot, Public Domain,

K-Means Clustering

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

By I, Weston.pace, CC BY-SA 3.0,

The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the "elbow criterion".

CC BY-SA 3.0,

Law of Large Numbers

The law of large numbers is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

Proofs of the law of large numbers are shown here. Intuitively, it's possible to understand what's happening when considering:

  • The experiment results are derived from a universe of potential results where an actual theoretical mean exists. In the example below, it's the mean of dice rolls, which is 3.5.
  • The first experiment result establishes an initial distance from the theoretical mean.
  • In order for the second experiment result has a finite probability of moving the observed averages farther away from the theoretical mean. That probability is a fractional number between 0 and 1. Let's call that probability p-2.
  • Subsequent experiment results also have probabilities, p-n.
  • Each experiment result is an independent event. That is, one does not depend on the other.
  • The joint probability of a group of independent events occurring is determined by their product.
  • So in our example, the joint probability would be: p-2 x p-3 x p-4 x ... p-n
  • An example with a small number of results is: .84 x .76 x .63 x .58 = .2332
  • As you can see, the joint probability gets small very quickly.
  • This is an over simplification, but it helps understand the mathematics at work.

In the sample experiment below, notice that by 1000 trials, the observed averages variation from the theoretical mean has been reduced to almost zero. And by only a very small number of trials, the observed averages have come very close to the theoretical mean.

By Pred - Own work, CC0,

Mean Squared Error

Mean squared error (MSE) is used to measure the average distance of observed data points from predicted data points. Squaring the numbers eliminates negative values. A formula for MSE is:  


Logistic regression

Logistic regression is a regression model where the dependent variable (DV) is categorical.

Loss (cost) Function

A loss function or cost function is a function that calculates the difference between true and estimated values. An optimization problem seeks to minimize a loss function. In machine learning, stochastic gradient descent and backpropagation are techniques used to reduce loss.

Precision and Recall

Precision and recall are based on an understanding and measure of relevance.

  • Precision: (also called positive predictive value) is the fraction of relevant instances among the retrieved instances.
  • Recall: (also known as sensitivity) is the fraction of relevant instances that have been retrieved over the total amount of relevant instances.

By Walber - Own work, CC BY-SA 4.0,

Regression Analysis

Regression analysis is a process for estimating relationships among variables. In a multiple regression model, there are multiple independent variables. The example below shows a liner regression (red line) on a data set (blue dots):

By Sewaqu - Own work, Public Domain,


Repeatability is the variation in measurements taken by a single person or instrument on the same item, under the same conditions, and in a short period of time.


Resampling is a process of systematically changing sample data sets used for testing a model or hypothesis. There are a number of approaches that can be used, such as:

  • Permutation: sample composition is randomly changes.
  • Bootstrap: systematic sample changes.
  • Jacknife: systematically recomputing the statistic estimate, leaving out one or more observations at a time from the sample set.
  • Cross-validation: subsets of the data are held out for use as validating sets.

Sample Size Determination

Sample size determination is the act of choosing the number of observations or replicates to include in a statistical sample.

A general formula for determining sample size is:


  • the size of the population from which the sample is taken is at least 10 times the size of the sample - as the population size increases above that level, the sample size would increase somewhat, but quickly level off at a constant number - this shows how large populations can be accurately sampled using relatively small sample sizes
  • s = sample size
  • z = Z-score - which relates to the desired confidence level of the result
    • 90% confidence = 1.645 Z-score
    • 95% confidence = 1.960 Z-score
    • 99% confidence = 2.576 Z-score
  • d = standard of deviation - how much variance expected in the responses
    • a safe starting value is .5
  • m = margin of error - how much error to allow in the result
    • a typical value would be +/- 5%, or .05

an example using:

  • z = 1.96
  • d = .5
  • m = .05

the calculation is:


Sampling is concerned with the selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population. Important factors for proper sampling include:

By Dan Kernler - Own work, CC BY-SA 4.0,

Selection Bias

Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed.

Standard Deviation

The standard deviation (also represented by the Greek letter sigma σ or the Latin letter s) is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean  of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values. The standard deviation of a random variablestatistical populationdata set, or probability distribution is the square root of its variance. A simplified way to think of this is that roughly two thirds of the data points are within one standard deviation of the mean and 95% within two standard deviations.

Statistical Classification

Statistical classification is used to group observed values into categories. Below is an example of using a Support Vector Machine to group data points into categories based on their distance from a dividing vector line:

By Cyc - Own work, Public Domain,

Statistical Inference

Statistical inference is the process of deducing properties of an underlying distribution by analysis of data.

In machine learning, you'll often see the process of using a model to determine the characteristics of input data as inference. For example, the input might be the image of a person and inference is used to perform face recognition to determine the identity of the person.

Statistical Power

The statistical power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true. It can be equivalently thought of as the probability of accepting the alternative hypothesis (H1) when it is true—that is, the ability of a test to detect an effect, if the effect actually exists.

Power analysis is an experimental technique for determining the effect of different sample sizes.

Statistical Reliability and Validity

Reliability is a measure of the overall consistency of a measurement. Validity is the extent to which a measurement corresponds accurately to the real world, as illustrated below:


The trigonometric functions are functions of an angle. The derivation of the word comes from the ancient Greek (trigon) for triangle. The functions relate the angles of a triangle to the lengths of its sides. The most familiar trigonometric functions are the sinecosine, and tangent. The trigonometric functions of angle theta are illustrated in the diagram below:

By Original: Steven G. Johnson at English WikipediaDerivative work: Limaner - Circle-trig6.png, CC BY-SA 3.0,