AI Tech Blog

Synced Tradition and Machine Learning Series | Part 3: Optimization Basics – Probabilities and Inference

Probability theory is a mathematical framework for quantifying our uncertainty about the world, and is a fundamental building block in the study of machine learning. The purpose of this article is to provide the vocabulary and mathematics needed before applying probability theory to machine learning tasks.


Probability theory is a mathematical framework for quantifying our uncertainty about the world, and is a fundamental building block in the study of machine learning. The purpose of this article is to provide the vocabulary and mathematics needed before applying probability theory to machine learning tasks.

The Mathematics of Probability

In this section, we will briefly go over the basics of probability to provide readers with a quick recap of some vocabulary and important axioms needed to fully leverage the theory as a tool for machine learning.

Probability is all about the possibility of various outcomes. The set of all possible outcomes is called the sample space, often denoted as S. For example, the sample space for a coin flip is {heads, tails}.

In order to be consistent with the relative frequency interpretation, any definition of “probability of an event” must satisfy certain properties. Modern probability theory begins with the construction of a set of axioms specifying that probability assignments must satisfy the properties below:


Or, in plain English:

  • A set S of all possible outcomes is identified. Axion I states the probability of an event A is non-negative (between 0 and 1 to be exact)
  • The sum of the probabilities of all events in S should be 1. Axiom II states that there is a fixed total amount of probability (mass)
  • Axiom III states that the total probability (mass) in two disjoint objects is the sum of the individual probabilities (masses)

Note that probability theory does not concern itself with how the probabilities are obtained or with what they mean. Any assignment of probabilities to events that satisfies the above axioms is legitimate.

A random variable, X, is a variable that randomly takes on values from a sample space. For example, if x is an outcome of a coin flip experiment, i.e., x ∈ X, we might discuss a specific outcome as x = heads. Random variables can either be discrete like the coin, or continuous (can take on an uncountably infinite amount of possible values).

To describe the likelihood of each possible value of a random variable X, we specify a probability distribution function (PDF). We write X ~ P(x) to indicate that X is a random variable drawn from a probability distribution P(x). PDFs are described differently depending on if the random variable is discrete or continuous.

Discrete Distributions
Discrete random variables are described with a probability mass function (PMF). A PMF assigns a probability to each value in the variable’s sample space. For example, the PMF of the uniform distribution over n possible outcomes: P(X=x) = 1/n.
That is, “The probability of X taking on the value x is 1 divided by the number of possible values”. This is called the uniform distribution because each outcome is equally likely (the likelihood is spread uniformly over all possible values). Fair dice rolls are modelled by a uniform distribution since each face of the die is equally likely. A biased die is modelled by a categorical distribution, where each outcome is assigned a different probability.

Continuous Distributions
Continuous random variables are described by PDFs. This can be a bit more difficult to understand. We generally indicate the PDF for a random variable x as f(x). PDFs map an infinite sample space to relative likelihood values. Unlike discrete distributions, the value of the PDF at X = x is not the actual probability of x. This is a common misconception when people first start dabbling with probability theory. Since there are infinitely many values that x could take on, the probability of x taking on any specific value is actually 0!

Joint Probability Distributions
A distribution over multiple random variables is called a joint probability distribution. Consider two random variables X and Y. The probability of the pair of random variables is written as P(X=x, Y=y) or just P(x, y). This is understood as “The probability that X has the outcome of x and Y has the outcome of y”. For example, let X be the outcome of a coin toss and Y be the outcome of a dice roll. P(heads, 6) is the probability that the coin flipped heads and the dice rolled a 6. If both random variables are discrete, we can represent the joint distribution as a simple table of probabilities.

Marginal Probability Distributions
The joint PMF/PDF provides information about the joint behaviour of X and Y. Sometimes, we are also interested in the probabilities of events involving each of the random variables in isolation. Consider two discrete random variables X and Y. We can determine the marginal probability mass functions as follows:


and similarly, for Y


The marginal PMFs satisfy all the properties of single-variable PMFs, and they supply the information required to compute the probability of events involving the corresponding random variable.

Note, it is generally not possible to deduce the relative frequencies of pairs of values X and Y from the relative frequencies of X and Y in isolation. The same is true for PMFs: In general, knowledge of the marginal PMFs is insufficient to specify the joint PMF.

Conditional Probability Distributions
Quite often we are interested in determining whether two events, A and B, are related in the sense that knowledge about the occurrence of one, say B, alters the likelihood of the occurrence of the other, A. This requires that we compute the conditional probability P(A|B). That is, “the probability of event A given event B has happened.” Conditional probability is defined as follows:


Or, mathematically, it can be written as


Knowledge that event B has occurred implies that the outcome of the experiment is in the set B. Therefore, we can view the experiment as now having the reduced sample space B when we are dealing with P(A|B).

In addition, by multiplying both sides of the last equation by P(y) we get the chain rule of probability, P(x, y) = P(x|y) ⋅ P(y).

Bayes’ Rule
From the discussion of conditional probability, we have the chain rule for two variables in two equivalent ways:

  • P(x, y) = P(x|y) ⋅ P(y)
  • P(x, y) = P(y|x) ⋅ P(x)

If we set both right sides equal to each other and divide by P(y), we get Bayes’ rule:


Bayes’ rule is crucially important in statistics and machine learning. Bayes’ rule is often applied in the following situation. Suppose we have some random experiment in which the events of interest form a partition. The “a priori probabilities” of these events, P(x), are the probabilities of the events before the experiment is performed. Now suppose that the experiment is performed, and we are informed that y is the outcome. The “a posteriori probabilities” are the probabilities of the events in the partition, P(x|y), given this additional information. Bayes’ rule is the driving force behind Bayesian statistics. This simple rule allows us to update our beliefs about quantities as we gather more observations from data.

Useful Probability Distributions

In data modelling, there are a few collections of probability distributions that are commonly seen. We will discuss these distributions to provide readers with some insight when encountering them in real-life situations.

Distribution over integers

Binomial Distribution
The first we will look at is the binomial distribution. A binomial distribution can be thought of as simply the probability of a “success” or “failure” outcome in an experiment or survey that is repeated multiple times.


A common example of binomial distribution is a biased coin flip, where we flip a biased coin with probability f (of being heads), N times, and observe the number of heads r.

The binomial distribution occurs frequently in real life. For example, when a new drug is introduced to cure a disease, it either cures the disease (success) or it doesn’t cure the disease (failure). If you purchase a lottery ticket, you are either going to win money or not. Essentially, anything you can think of that can only be a success or a failure can be represented by a binomial distribution.

Poisson Distribution
The next integer distribution is the Poisson distribution with parameter λ > 0. The Poisson distribution is used to model the number of events occurring within a given time interval. We can interpret λ as the parameter which indicates the average number of events in the given time interval. It is written as


We can interpret the above probability that exactly r successes occur in a Poisson experiment, when the mean number of successes is λ. We provide an example of Poisson distribution usage below.

Consider births in a hospital that occur randomly at an average rate of 1.8 births per hour. What is the probability of observing 4 births in a given hour at the hospital? In this example, let r = 4, which represents the number of births in an hour. Given the mean λ=1.8, we can use the formula to calculate the probability of observing exactly 4 births in a given hour as


The next distribution we will explore is the exponential distribution (on integers). The exponential distribution often occurs in situations where we are interested in the amount of time (or time steps for integer cases) until some specific event occurs. The exponential distribution is written as:


where λ=ln(1/f).

Distribution over unbounded real numbers

Gaussian Distribution, Student-t distribution
In practice, many datasets consist of real numbers. In contrast to the discrete probability distributions discussed, continuous data do not have discrete values. Instead, we must use a curve or function that describes the probability density over the range of the distribution.
The Gaussian (also known as Normal) distribution describes a special class of such distributions that are symmetric and can be described by two parameters, mean µ and standard deviation σ. It is written as:




It is sometimes useful to work with the quantity τ ≡ 1/σ^2, which is called the precision parameter of the Gaussian. The Gaussian distribution is widely used and often asserted to be a very common distribution in the real world. However, there should be caution when modelling data with Gaussian distributions. I will explain why.

The Gaussian distribution is also a unimodal distribution. A unimodal distribution is a distribution with one clear peak. In fact, the Gaussian distribution is a rather extreme unimodal distribution with very light tails, i.e., log-probability-density decreases quadratically. The typical deviation of x from µ is σ, but the respective probabilities that x deviates from µ by more than 2σ, 3σ, 4σ, and 5σ, are 0.046, 0.003, 6×10^(-5), and 6×10^(-7). However, in my experience, deviations from a mean four or five times greater than the typical deviation may be rare, but never to the point of 6×10^(-5). With that said, the Gaussian distribution is an incredibly common distribution used in analyzing machine learning data (we will discuss this later).

It is very common to see data modelled as a mixture of Gaussians. A mixture of two Gaussians, for example, is defined by two means, two standard deviations, and two mixing coefficients π_1 and π_2, satisfying π_1 + π_2 = 1, π_i ≥ 0. It is written as:


If we take an appropriately weighted mixture of an infinite number of Gaussians, all having mean µ, we obtain a Student-t distribution:




and n is called the number of degrees of freedom and Γ is the gamma function. The Student’s t distribution is often used in hypothesis testing. This distribution is used to estimate the mean of a normally distributed population when the sample size is small, and is used to test the statistical significance of the difference between two sample means or confidence intervals for small sample sizes.

If n > 1 then the Student distribution has a mean and that mean is µ. If n > 2, the distribution also has a finite variance, σ^2 = ns^2/(n-2). As n → ∞, the Student distribution approaches the normal distribution with mean µ and standard deviation s. The Student distribution arises both in classical statistics and in Bayesian inference. From a Bayesian viewpoint it is helpful to think of the t-distribution as a continuous mixture of normal distributions with different variances. In the special case where n = 1, the Student distribution is called the Cauchy distribution.

Distribution over positive real numbers

Exponential Distribution
We will first discuss the exponential distribution. In the study of continuous-time stochastic processes, the exponential distribution is usually used to model the time until something happens in the process. It is written as:




In more classical probability studies, the exponential is the single most important continuous distribution for building and understanding continuous-time Markov chains. From the distribution equation, we can see that as 1/s gets larger, the thing we’re waiting for to happen in the process tends to happen more quickly, hence we think of 1/s as a rate. In some nomenclature, it is more common to substitute 1/s as λ in the equation.


An important characteristic of the exponential distribution is the “memoryless” property, which means that the future lifetime of a given object has the same distribution, regardless of the time it existed. In other words, time has no effect on future outcomes.

This can be illustrated with the following example. Consider that at time 0 we start an alarm clock that will ring after a time X that is exponentially distributed with rate s. Let us call X the lifetime of the clock. For any t > 0, we have that:


Now we go away and come back at time s to discover that the alarm has not yet gone off. That is, we have observed the event {X > s}. If we let Y denote the remaining lifetime of the clock given that {X > s}, then


This implies that the remaining lifetime after we observe that the alarm has not yet gone off at time s has the same distribution as the original lifetime X. The important thing to note is that this implies that the distribution of the remaining lifetime does not depend on s. This is the memoryless property, because we do not need to remember when the clock was started. In other words, given that the alarm has currently not yet gone off, I can forget the past and still know the distribution of the time from my current time to the time the alarm will go off. The memoryless property is crucial for the study of continuous time Markov chains.

The exponential distribution is widely used to describe events recurring at random points in time, such as the time between failures of electronic equipment or the time between arrivals at a service booth. It is related to the Poisson distribution, which describes the number of occurrences of an event in a given interval of time.

Gamma Distribution
The gamma distribution is like a Gaussian distribution, except while the Gaussian goes from −∞ to ∞, gamma distributions go from 0 to ∞. Just as the Gaussian distribution has two parameters, µ and σ, which control the mean and width of the distribution, the gamma distribution has two parameters. The gamma distribution is the product of the one-parameter exponential distribution with a polynomial, x^(c−1). The exponent c in the polynomial is the second parameter. The probability density function is written as




Fun fact: the gamma distribution is named after its normalizing constant.

One useful intuition for understanding the exponential distribution is that it predicts the wait time until the very first event. The gamma distribution, on the other hand, predicts the wait time until the k-th event occurs. The gamma distribution shows up in applications such as wait time modelling, reliability (failure) modelling, and service time modelling (Queuing Theory).

Applications of General Probability in Machine Learning

The beginning of this article served as a brief introduction to important terms in probability to allow us to propose machine learning questions in a probabilistic setting. This section will discuss some of the applications that can be enabled based on what we have discussed.

Supervised Learning

In supervised machine learning, the goal is to learn from labelled data. Data being labelled means that for some inputs X, we know the desired outputs Y. Some possible tasks include:

  • identify/classify an image
  • detect if an email/file is spam/malicious
  • predict the price of a stock given some features about the company

How are these applications accomplished? We can learn the parameters of a mapping from X to Y in various ways. For example, you could learn the conditional probability P(Y|X). That is, a probability distribution over possible values of Y given that you’ve observed a new sample X. Alternatively, we could instead try to learn P(X|Y), the probability distribution over inputs given labels.

Unsupervised Learning

Unsupervised learning is a broad set of techniques for learning from unlabelled data, where we just have some samples X but no output Y. Characterizing the distribution of unlabelled data is useful for many tasks. One example is anomaly detection. If we learn P(X), where X represents normal bank transactions, then we can use P(X) to measure the likelihood of future transactions. If we observe a transaction with low probability, we can flag it as suspicious and possibly fraudulent.

Common unsupervised tasks include:

Clustering is one of the canonical problems of unsupervised learning. Given some data points originating from separate groups, how can we determine to which group each point belongs? One method is to assume that each group is generated from a different probability distribution. Solving the problem then becomes a matter of finding the most likely configuration of these distributions.

Dimension reduction, Embedding
Tasks in this category involve taking high dimensional data and projecting it onto a meaningful lower dimensional space. High dimensional data takes up memory, slows down computations, and is hard to visualize and interpret. We’d like to have ways of reducing the data to a lower dimension without losing too much information. One can think of this problem as finding a distribution in a lower dimensional space with similar characteristics to the distribution of the original data.

Reinforcement Learning

Reinforcement learning is concerned with training artificial agents to perform specific tasks. The agents learn by taking actions in their environment and observing reward signals based on their decision/behaviour. The goal of the agent is to maximize its expected long-term reward. Probability is used in several aspects of the reinforcement learning process. The agent’s learning process often revolves around quantifying the “reward” for taking one specific action over another.

Applications of Common Distributions in Machine Learning

In this section, we will discuss some of the common use-cases in which probability distributions show up in machine learning.

Binomial Distribution
The most common use-case in machine learning or AI is a binary classification problem. That is, you want to train and validate an algorithm that predicts whether or not the particular observation belongs to one class or not (0 or 1 scenarios). The most basic algorithm used to do this is a logistic regression model.

For example, you want to predict, given an image of an animal, whether it is a dog or not. Here, the algorithm finally outputs a probability value such that it is the probability that the animal in the image is a dog given the input characteristics (pixel values of image).

Another example would be if you want to predict the risk for hospital readmission within 30-days following patient discharge. The ‘risk’ here is nothing but the probability of getting re-admitted within 30 days given the patient characteristics (demographics, medical history, biochemical profile, etc.).

In all such cases, the dependent variable (aka target variable, response variable) is a binary variable that can take either 0 or 1. Under this setting, the objective of a logistic regression model is to estimate the ‘p’ of success given the input characteristics.

Exponential Distribution
The exponential distribution we discussed earlier is a building block in the construction of Markov chains. Briefly, a Markov chain is a mathematical system that experiences transitions from one state to another according to certain probabilistic rules. The defining characteristic of a Markov chain is that no matter how the process arrived at its present state, the possible future states are fixed. In other words, the probability of transitioning to any particular state depends solely on the current state and time elapsed (recall the memoryless property). We will not go into the mathematical details of Markov chains, but only discuss their applications.

In recent years, Markov chains have become a popular graph-based model in data mining and machine learning areas. As a popular graph model, Markov chains have been incorporated into different semi-supervised learning algorithms. They have also been used in the task of classification. This is achieved by using Markov chains to estimate the posterior probability of data belonging to different classes by computing the probability of labelled data reaching (a different state of the Markov chain) unlabelled data. Markov chains have also been used to estimate the posterior probability of unlabelled data belonging to different classes by computing its probability of reaching labelled data in a Markov random walk.

Gaussian Distribution
Gaussian distributions are the most “natural” distributions, and they show up everywhere in our daily lives. In addition, they have many properties that make them easy to analyze mathematically. Thus, the Gaussian distribution is more often used in the analysis of ML algorithms.

For example, we often assume that a data or signal error is a Gaussian distribution. This is because machine learning (and statistics as well) treats data as a mix of deterministic and random parts. The random part of data usually has a Gaussian distribution or is assumed to be Gaussian. Why can we do this? Because of the Central Limit Theorem (CLE), which states that the sum of a large number of variables each having a small influence on the result approximates a normal distribution.

In machine learning, we often want to express a dependent variable as some function of a number of independent variables (for mathematical simplicity). If this function is summed (or expressed as a sum of some other functions) and we are suggesting that the number of independent variables is high, then the dependent variable should have a normal distribution (due to CLE).

As for treating the error signal as Gaussian, in a typical machine learning problem we might come across errors from many different sources (e.g. measurement errors, data entry errors, classification errors, data corruption, etc.), and it’s not unreasonable to conclude that the combined effect of all of these errors is approximately normal due to the CLE (although of course, we should always double-check!).

Gamma Distribution
Gamma distributions show up in many inference problems in which a positive quantity is inferred from data. Examples include inferring the variance of Gaussian noise from some noise samples and inferring the rate parameter of a Poisson distribution from the count.

Furthermore, it is important to note the significance of the gamma function. The gamma function, in addition to its use in the gamma distribution, it is also used to define several other distributions such as the Beta distribution, Dirichlet distribution, Chi-squared distribution, and Student’s t-distribution.

For data scientists, machine learning engineers and researchers, the Gamma function is probably one of the most widely used functions because it is employed in many distributions. These distributions are then used for Bayesian inference and stochastic processes (such as queueing models).


The goal of this article was to introduce and utilize the language of probability so that we can view machine learning problems in a probabilistic setting. We first went over the basic terminology and notation of important concepts in probability. Next, we discussed a few very important distributions that come up in practice. Finally, we provided some discussion of the applications in which general probability and probability distributions may show up.


Information Theory, Inference and Learning Algorithms, David J. C. MacKay.

Author: Joshua Chou | Editor: H4O & Michael Sarazen

Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors

This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.

Click here to find more reports from us.

3 comments on “Synced Tradition and Machine Learning Series | Part 3: Optimization Basics – Probabilities and Inference

  1. Pingback: Synced Tradition and Machine Learning Series | Part 3: Optimization Basics – Probabilities and Inference – Njxxllc

  2. Pingback: Part 3: Optimization Basics – Probabilities And Inference - AI Summary

  3. nice topic

Leave a Reply

Your email address will not be published. Required fields are marked *