Entropy and Information Gain

This is mostly based PowerPoint slides written by Andrew W. Moore of Carnegie Mellon University

Let's say we have a variable X that can have four possible values, A, B, C, and D.

We might see something like BAACACCDDCDADABCDBBB

Where:

Example 1

 P(X=A) = 1/4 P(X=B) = 1/4 P(X=C) = 1/4 P(X=D) = 1/4

If we can only encode this data using 0's and 1's (think of Morse code), you can code each value using 2 bits.

 Value Code A 0 0 B 0 1 C 1 0 D 1 1

What if, as in the English language, all letters are not equal? i.e. different letters appear with different probabilities.

Example 2

 P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8

Can we invent a code that uses, on average, 1.75 bits?

 Value Code A 0 B 1 0 C 1 1 0 D 1 1 1

(This code is not unique)

1/2 * 1 + 1/4 * 2 + 1/8 * 3 + 1/8 * 3 = 1.75

What if we had a variable with three equally likely values?

 P(X=A) = 1/3 P(X=B) = 1/3 P(X=C) = 1/3

A naïve coding would use 2 bits each value

 Value Code A 0 0 B 0 1 C 1 0

Can we devise a code that uses only 1.6 bits?  (Actually, theoretically we only need 1.58496 bits)

 Value Code A 0 B 1 0 C 1 1

There is obviously something "mathematical" going on here!!

Let's go back to Example 2.

Example 2

 P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8

Intuitively speaking, if A occurs half the time, then we need only one bit to code A.

B occurs only 1/4 of the time.  Therefore we should need less than 1 bit, proportionally, to represent B.

Where's the math?

The number of bits needed to code value A occurring with a probability of 1/2 is;

bits = - log 2 p  where p is the probability with which a particular value occurs

bits(A) = - log 2 1/2 = 1

bits(B) = - log 2 1/4 = 2

bits (C) = bits(D) = - log 2 1/8 = 3

Substituting to find the average we have:

H(X) = - pA log 2 pA  - pB log 2 pB - pC log 2 pC - pD log 2 pD

H(X ) =  1/2 * 1 + 1/4 * 2 + 1/8 * 3 + 1/8 * 3 = 1.75

Therefore values occurring with higher probability are coded with fewer bits.

H(X) is the Entropy of X.

General Case:  Given that S is our sample,

Entropy is a measure of how "mixed up" an attribute is.  It is sometimes equated to the purity or impurity of a variable.

High Entropy means that we are sampling from a uniform (boring) distribution.  Would have a flat histogram, therefore we have an equal chance of obtaining any possible value.

Low Entropy means that the distribution varies, it has peaks and valleys.  The histogram of frequency distribution would have many lows and maybe one or two highs.  Hence it is more predictable.

Entropy is a measure of how pure or impure a variable is.

Birdie seed example.  A BagA that is equal peanuts and sunflower seeds.

A BagB  that has a few peanuts and is mostly sunflower seeds.

BagB has low entropy, BagA has high entropy.  We can easier make a decision as to what seed we would most likely pull out of BagB.

Entropy ranges from 0 (all instances of a variable have the same value) to 1 (equal number of instances of each value)

How can we predict output Y given input X.

Example 3

 Outlook Play Tennis Sunny No Sunny No Overcast Yes Rain Yes Rain Yes Rain No Overcast Yes Sunny No Sunny Yes Rain Yes Sunny Yes Overcast Yes Overcast Yes Rain No

P(Play Tennis = Yes) = 9/14

P(Play Tennis = No) = 5/14

Entropy(Play Tennis) = - 9/14 log 2  (9/14) - 5/14 log 2   (5/14) = .940

Suppose we try to predict output of Play Tennis based on an input of Outlook.  In other words, knowing the value of Outlook can I decide whether or not to play tennis?

Srain = [3+, 2-]

P(Outlook = Rain and Play Tennis = yes) = 3/5

P(Outlook = Rain and Play Tennis = no) = 2/5

= .971

This is the specific conditional entropy for rain, H(Y | X = rain)

General Format:  H(Y | X = v) where v is some value

Sovercast = [2+, 2-]

=0

Ssunny = [2+, 3-]

=.971

Conditional Entropy is the expected number of bits needed to transmit Y if both sides will know the value of X.

This is equal to the average conditional entropy of Y.

P(rain) = 5/14   P(overcast) = 4/14   P(sunny) = 5/14

Entropy (Play Tennis | Outlook) =

By knowing Outlook, how much information have I gained?

I have reduced the number of bits needed to send my message by:

Entropy (Play Tennis) - Entropy (Play Tennis | Outlook) = .940 - .694 = .246

I need .246 bits less to send my message if I know the Outlook.

Information Gain is the number of bits saved, on average, if we transmit Y and both receiver and sender know X

Gain = Entropy(X) - Entropy(X|Y) = Entropy(X) -