This is the first in a special *Synced* series of introductory articles on traditionally theoretical fields of studies and their impact on modern day machine learning.

## 1. Introduction

A cornerstone of information theory is the idea of quantifying how much information there is in a message. More generally, to quantify the information of an event. The basic measure used to quantify information in this regard is entropy, which will be the topic of this article.

Classically, information-theoretic quantities such as entropy and relative entropy arise again and again in response to fundamental questions in communication and statistics. These quantities have also found their way into machine learning, where they have become increasingly important.

The concept of *information* is too broad to be captured completely by a single definition. However, for any probability distribution, the concept of *entropy* is defined to provide properties that agree with the intuitive notion of what a measure of information should be. Other related notions of uncertainty include *conditional entropy* H (X|Y), which is the entropy of a random variable X conditional on the knowledge of another random variable Y.

In this review, we introduce most of the basic definitions required for the subsequent development of information theory. After defining concepts such as entropy and mutual information, we establish useful properties such as the chain rule of the quantities. Finally, we try to provide some examples of these concepts in the realm of machine learning.

## 2. Entropy, Relative Entropy, and Mutual Information

**2.1 Entropy**

We first introduce a common definition of entropy. Let X be a discrete random variable with alphabet *X* and probability mass function *p(x)* = Pr{X = *x*}, *x* ∈ *X*. The entropy H(X) of X is defined as

The logarithm is taken to be base 2, and entropy can be expressed in bits.

We can think of entropy as a measure of the uncertainty of the random variable X. Take a simple example, consider random variable X as a coin flip with Pr{x = heads} = 1/2 = Pr{x = tails} (a fair coin). H(X) = -p(heads) x log p(heads) + ( -p(tails) x log p(tails) ) = -(1/2) x log (1/2) – (1/2) x log(1/2) = 1. Thus, the number of bits (entropy) required to express the outcome of the coin flip (random variable X) is 1 bit. In other words, the outcome of the coin flip can be encoded into a single bit of information, i.e, 0 = heads, 1 = tails.

Take a second example, which is an extreme of the above case. Consider flipping a biased coin with only heads, i.e., Pr{x = heads} = 1, Pr{x = tails} = 0. The entropy H(X) = -(1) x log (1) – (0) x log(0) = 0 + 0 = 0, where we use the convention (0) x log(0) = 0. Thus, the entropy of flipping a biased coin is 0. In other words, we require no bits to describe the event of flipping the biased coin because we already know the outcome. Moreover, going back to the concept of uncertainty, flipping the biased coin has 0, or no uncertainty since the outcome is always known.

Note that entropy is a function of the distribution of X. It does not depend on the actual values taken by the random variable X, but only on the probabilities.

The entropy of X can also be interpreted as the expected value of the random variable log (1/p(X)), where X is drawn according to probability mass function *p(x)*. This is defined as

We denote expectation by E. That is, if X ∼ p(x), the expected value of the random variable g(X) is written as

or E g(X) for simplicity. Thus, we can see that H(x) = E g(X), where g(X) = log (1/p(X)).

Some important properties of entropy are discussed below.**Entropy is non-negative, H(X) ≥ 0.**

This can be easily proven as p(x) of a random variable is always non-negative and in the range 0 ≤ p(x) ≤ 1. The second condition implies that log (1/p(X)) ≥ 0, and therefore, the summation of positive numbers cannot be negative.**Entropy can be changed from one base to another by multiplying by the appropriate factor, Hb(X) = (logb a)Ha(X).**

This can be seen by applying the log property log_b(p) = log_b x (a log_a) (p) ). This enables us to change the base of the logarithm in the definition.

### 2.2 Joint Entropy and Conditional Entropy

The definition of Entropy given in 2.1 is with respect to a single random variable. We now extend the definition to a pair of random variables (X, Y).

The **joint entropy** H(X, Y) of a pair of discrete random variables (X, Y) with a joint distribution p(x, y) is defined as

which can also be expressed as

The **conditional entropy** of a random variable given another random variable is defined as the expected value of the entropies of the conditional distributions (remember that entropy is a function of some distribution), averaged over the conditioning random variable. The conditional entropy *H(Y|X)* is defined as

Where the random variables (X, Y) have joint distribution p(x, y).

A useful property (or theorem) is the **chain rule of entropy**, conditional entropy, and joint entropy. It is described as

We can think of this as the entropy of a pair of random variables being the entropy of one plus the conditional entropy of the other.

It is important to note that **conditional entropy is not (in general) symmetric**, i.e., H(Y|X) ≠ H (X|Y ). However,

H(X) – H(X|Y) = H (Y)− H (Y|X), and is often a property exploited in information and coding theory.

Another useful property is **conditioning reduces entropy**. Mathematically, this is expressed as

with equality if and only if X and Y are independent. This will be proven after we introduce mutual information.

### 2.2 Relative Entropy and Mutual Information

We now introduce two related concepts: **relative entropy** and **mutual information**. Relative entropy, denoted D(p||q), is a measure of the distance between two distributions. In statistics, it arises as an expected logarithm of the likelihood ratio. In machine learning, relative entropy has also been adopted as the objective function to be minimized during training. For example, given a set of input data and target (reference) data, we pass the input data into a model to obtain a test output. We can then measure and compare the distributions of the test output and the target output data. The closer the two distributions (smaller relative entropy), the better the model is at predicting outputs for the input data.

Another way to interpret relative entropy is as the measure of the inefficiency of assuming that the distribution is q when the true distribution is p. Let us use the coin flip example again. Consider a bias coin, where the (true) distribution is p, i.e., Pr{heads}=1, Pr{tails}=0. For this coin, if we construct a code to describe the result of the coin flip, the code would have 0 bits since we already know the outcome. However, someone who does not know that the coin is biased might construct a code with respect to a fair coin with distribution denoted as q, i.e., Pr{heads}= 1/2 = Pr{tails}. The code will have a length of 1 (bit), and can be described as H(p) + D(p||q) = 1, where H(p) = 0 and D(p||q) = 1.

To see this calculation, we will first define relative entropy. The relative entropy or **Kullback–Leibler distance **between two probability mass functions p(x) and q(x) is defined as

In the above definition, we use the convention that,

and the convention (based on continuity arguments) that

Thus, if there is any symbol x ∈ X such that p(x) > 0 and q(x) = 0, then D(p||q) = ∞. D(p||q) can often be used as a distance measure defined on p and q. Intuitively, when p=q, our coding would be optimal, so D(p||q)=0. In general, the closer q is to p, the smaller D(p||q) would be. Note that D(p||q) is not equal to D(q||p), and they have different interpretations.

With this definition, let us go back to the coin example and show the calculations.

Thus, we get H(p) + D(p||q) = 1 as stated. We can think of H(p) as the minimum number of bits required to express the (coin flip) event and D(p||q) as a penalty to the required number of bits if using a wrong distribution.

Next, we introduce **mutual information**, which is a measure of the amount of information that one random variable contains about another random variable. The mutual information, denoted I (X; Y), is the relative entropy between the joint distribution p(x, y) and the product distribution p(x)p(y).

Here, p(x, y) is the joint probability mass function of the two random variables X and Y, and p(x), p(y) are marginal probability mass functions of X and Y, respectively.

By rearranging mutual information, we can see that

I(X; Y) = H(X) - H(X|Y).

That is, mutual information is the reduction in the uncertainty of one random variable due to the knowledge of the other. In other words, this interpretation indicates that I(X;Y) measures the reduction of uncertainty of X (or Y) due to knowing Y (or X). Note that the conditional entropy H(Y|X) measures the uncertainty of Y given that we know X.

In addition, since H (X, Y) = H (X) + H (Y|X), we can write mutual information as

I(X; Y) = H (X) + H (Y) − H (X, Y).

Finally, we note that

I (X; X) = H (X) − H (X|X) = H (X).

That is, the mutual information of a random variable with itself is the entropy of the random variable. This is the reason that entropy is sometimes referred to as self-information. The above three variations of mutual information can be summarized from the following Venn diagram.

We will now introduce **conditional mutual information**. The conditional mutual information of random variables X and Y given Z is defined by

### 2.3 Useful Properties and Their Consequences

Here we will discuss a few properties of entropy, relative entropy, and mutual information. These properties will allow us to gain insight on how they are used in machine learning and why they work. We also introduce some relevant prerequisites.**Convexity of a function**

A function f (x) is said to be convex over an interval (a, b) if for every x1, x2 ∈ (a, b) and 0 ≤ λ ≤ 1,

f(λ x_1 + (1−λ) x_2 ) ≤ λ f(x_1) + (1−λ) f(x_2).

In addition,

- a function f is said to be strictly convex if equality holds only if λ = 0 or λ = 1.
- a function f is concave if −f is convex.
- examples of convex functions include x^2, e^x, xlog(x) (for x≥0), etc.
- examples of concave functions include sqrt(x) and log(x) for x≥0.
- linear functions ax + b are both convex and concave.

**Jensen’s Inequality**

Jensen’s inequality states that if f is a convex function and X is a random variable, then

In other words, the expectation of the function f(X) is greater or equal to the function evaluated at the mean of random variable X. A proof by induction is provided. First, by definition of convexity, we have

where in this case the probability mass function is a two-mass-point distribution. Suppose the theorem is true for distributions with k-1 mass points. Then we can write p’_i = p_i/(1-p_k) for i = 1, 2, 3, …, (k-1), and this will give us

where the first inequality follows from the induction hypothesis and the second inequality follows from the definition of convexity. Jensen’s inequality is used to prove the non-negativity of D(p||q).**Non-negativity of D(p||q)**

Let p(x) and q(x), where x ∈ X, be two probability mass functions. Then D(p||q) ≥ 0, with equality if and only if p(x) = q(x) for all x. The proof of this is as follows. Let A = {x: p(x) > 0} be the support set of p(x). Then

where the first inequality (in line 3) is from Jensen’s inequality. Multiply both sides by a -1, and we get our claim D(p||q) ≥ 0. In addition, since log(t) is a strictly concave function of t, if we multiply both sides by a negative, log(1/t) becomes a strictly convex function of t. From the above analysis, we can understand why relative entropy, or KL-divergence, is important in the context of machine learning.

- Convexity of D(p||q) allows for guarantees on minimum values if a feasible point is found.
- The non-negativity property makes it a good “penalty” function to add to a minimization problem.

When p is taken as an empirical distribution of values in our observed data and q is taken as the distribution given by a probabilistic model to be estimated, minimizing the KL-divergence D(p||q) would be equivalent to maximizing the likelihood of our data. This is why relative entropy is often adapted into machine learning applications.**Chain rule for entropy and mutual information**

We first introduce the chain rule of entropy. Let X_1, X_2, …, X_n be drawn according to joint distribution p(x_1, x_2, …, x_n). Then

This can be easily proven by continuously expanding the joint entropy of two random variables as shown below.

Similarly, the chain rule also applies to mutual information. Recalling the definition of conditional mutual information, the chain rule of information is the following.

The proof is provided below.

**Nonnegativity of mutual information**

For any two random variables, X, Y, I (X; Y) ≥ 0, equality occurs if and only if X and Y are independent. The proof of this theorem can be easily seen by expressing mutual information as the relative entropy as follows

where the last inequality is from the fact that relative entropy is non-negative.

Let us take a closer look at this definition of mutual information. Note that we now have to think of a distribution over (X,Y) pairs when looking at the KL-divergence. That is, when we compute the KL-divergence, the sum would be taken over all the (x,y) pairs where x is a value of X and y is a value of Y. We can see that p(X)p(Y) also gives us a distribution over such pairs. This is why the interpretation of I(X;Y) as measures of the association/dependency of X and Y makes sense. The more independent X and Y are, the closer p(X,Y) and p(X)p(Y) would be.**Conditioning reduces entropy**

Recall in section 2.1 we stated H(X|Y) ≤ H(X), or “information cannot hurt.” The proof can be easily seen as

H (X) − H (X|Y) = I(X; Y) ≥ 0,

where the inequality is from the non-negativity of mutual information. This theorem has a very intuitive meaning. It says that knowing another random variable Y can only reduce the uncertainty in X. Note that this is true only on the average. Specifically, H (X|Y = y) may be greater than or less than or equal to H (X), but on the average

H (X|Y) = sum_y p(y)H(X|Y=y) ≤ H(X)

For example, in a court case, specific new evidence might increase uncertainty, but on average, evidence decreases uncertainty.

We can take a look at some examples of how these concepts are applied to machine learning. For example, if X is the category of a document and Y is a word, I(X;Y) can be used to select words that are associated with topics. Such words are presumably most useful for classifying documents into categories. KL-divergence D(p||q) can be used to perform retrieval if p is a query unigram language model and q is a document unigram language model.

Another example is the use of mutual information as a criterion for feature selection and feature transformations. It can be used to characterize both the relevance and redundancy of variables, such as the minimum redundancy feature selection.

## 3. An Information/Coding Theory Perspective

In this section, I would like to present a few different interpretations of these concepts from a more traditional information and coding theory perspective.

The entropy of a random variable is a measure of the uncertainty of the random variable; it is a measure of the (minimum) amount of information required on the average to describe the random variable.

Let’s revisit the idea (section 2.2) of constructing a “code” to represent the outcome of an experiment (or event). For example, if we wish to encode 4 possible messages A, B, C, and D. What are the possible ways to encode this?

Let’s assume that each symbol is equally likely, e.g., p(A) = p(B) = p(C) = p(D) = (1/4). First, let’s calculate the entropy of this code ENC_1.

H(ENC_1) = Pr{A}log(1/Pr{A})+Pr{B}log(1/Pr{B})+Pr{C}log(1/Pr{C})+Pr{D}log(1/Pr{D})

= (1/4)log(4) + (1/4)log(4) + (1/4)log(4) + (1/4)log(4)

= 2 bits

If we use the following encoding ENC_1 = {A↦00, B↦01, C↦10, D↦11}, the average length of the code is 2 bits (calculated below). This is equal to entropy calculated.

Length(ENC_2) = 2xPr{A}+2xPr{B}+2xPr{C}+2xPr{D}

= 2x(1/4)+2x(1/4)+2x(1/4)+2x(1/4)

= 2

However, if the probability distribution of A, B, C, and D are not equal, this encoding may not be optimal. For example, let {0.8, 0.1, 0.05, 0.05} be the probability distribution of A, B, C, and D, respectively. The entropy of this distribution is

H(ENC_2) = Pr{A}log(1/Pr{A})+Pr{B}log(1/Pr{B})+Pr{C}log(1/Pr{C})+Pr{D}log(1/Pr{D})

= (0.8)log(1/0.8) + (0.1)log(1/0.1) + (0.05)log(1/0.05) + (0.05)log(1/0.05)

= 1.0219 bits

Thus, we can see that only 1.0219 bits are required to represent the information. Since there is no such thing as 1.02 bit, we can only construct codes where A, B, C, D are mapped to an integer number of bits. However, H(ENC_2) tells us that we can do much better than the 2 bits of ENC_1.

A Huffman tree is provided in Figure 2. to illustrate the idea of “decreasing the cost of representing high probability symbols.” This is also known as **entropy coding**.

Under the coding in Figure 12., we have ENC’ = {A↦0, B↦01, C↦011, D↦111}. The average length is now L(ENC’) = 0.8×1 + 0.1×2 + 0.05×3 +0.05×3 = 1.5 bits. Note that this number is still far from 1.02 bits, but is also less than just encoding each symbol with 2 bits. This is why we can interpret entropy as the *minimum required number* of bits to represent a distribution.

Next, we will discuss KL-divergence. As mentioned in section 2.2, we can interpret the KL-divergence as the average number of bits that are wasted by encoding events from a distribution p with a code based on a “wrong” distribution q. Following up on entropy of ENC_2 being the minimum number of bits required to represent the distribution p = {0.8, 0.1, 0.05, 0.05}, we now look at the KL-divergence when we use q = {(1/4), (1/4), (1/4), (1/4)} to encode p.

We can see that if we use distribution q (equal probability) to construct code, we will need 2 bits, which is exactly H(p) + D(p||q) bits on average.

H(p) + D(p||q) = 1.0219 + 0.9781

= 2

Hence, this is why D(p||q) can be thought of as a penalty for using the wrong assumption on the underlying distribution.

Finally, mutual information I(X;Y) gives us a measure of how much information is shared between two random variables X and Y. It is the relative entropy between the joint distribution and product of the marginal distributions of the two random variables. The interpretation of mutual information is the average reduction in the length of a codeword for Y given that X is known.

## 4. Concluding Remarks

In this article, we have introduced some basic information theory concepts — entropy, conditional entropy, relative entropy, and mutual information. We have also discussed several properties related to these concepts and utilized them to prove additional useful theorems. Finally, we have provided interpretations of the concepts from a classical information and coding theory perspective and discussed their use in machine learning.

Upcoming installments of this special *Synced *series will introduce other traditionally analysis-and-optimization-focused fields of studies such as convex analysis and linear and non-linear programming and their impact on modern day machine learning.

**Author**: Joshua Chou | **Editor**: H4O & Michael Sarazen

**Synced Report | ****A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors**

This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. **Along with this report, we also introduced a ****database**** covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.**

Click here to find more reports from us.

We know you don’t want to miss any news or research breakthroughs.** Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.**

Pingback: Synced Tradition and Machine Learning Series | Part 2: Optimization Basics | Synced

Pingback: [TechBlog] Synced Tradition and Machine Learning Series | Part 2: Optimization Basics – ONEO AI