UL — Info Theory

2 min readApr 23, 2020

In machine learning context, every input vector X and output vector Y can be considered as probability density function. Information Theory is mathematical framework which enables us to compare these probability density functions to ask questions such as — are these input vectors similar? does this feature has any information at all?

Entropy — is a measure of unpredictability of the state or average information content. So if the coin is fair, i.e. probability of head or tell is exactly half, then then entropy of the event of tossing a coin is 1. Inversely, if the coin is unfair, and always give either head or tail, then the entropy of tossing this unfair coin is 0, as we always know what is the outcome. In case of unfair coin there is no (or zero) new information that we get out of event of tossing that coin.

Information between Two Variables

If value of one variable is given, and if that leads to better ability to predict second variable, then there is information in first variable, which helps in prediction of second variable. There are two cases as follows:

Joint Entropy
The joint entropy is given by following formula
H(X, Y) = — ∑ P(X, Y) log P(X,Y)
Joint entropy is the randomness contained in two variables together.

Conditional Entropy
The conditional entropy is given by following formula
H(Y |X) = — ∑ P(X, Y) log P(X |Y)
Conditional entropy is the randomness of one variable given the randomness of other variable.

If X || Y, then
H(Y | X) = H(Y)
H(X, Y) = H(X) + H(Y)

Mutual Information

Consider conditional entropy H(Y |X). This conditional entropy may be small if X gives great deal of information about Y. For example — when people buy lot of water bottles, it surely will bring tornado or hurricane.

I(X, Y) = H(Y) — H(Y |X) — So mutual information is the reduction of randomness of a variable given knowledge of other variable.

Kullback — Leibler Divergence

KL Divergence measure the difference between any two distributions. The KL divergence is given by following formula:

D(p || q) = ∫ p(x) log (p(x)/q(x))

the divergence is always non-negative and zero only when p = q. KL Divergence is a different way of trying to fit your data to your existing model.

Conclusion

We looked at information theory. Entropy — joint and conditional entropy. We also looked at mutual information and KL divergence.

References — Heavily borrowed from my notes on CS7641, so thanks to Prof Charles Isbell, Prof Michael Littman and TA Pushkar. Errors are all mine.

UL — Info Theory

Written by Ganesh Walavalkar