Entropy and Information Gain

This is mostly based PowerPoint slides written by Andrew W. Moore of Carnegie Mellon University

http://www.autonlab.org/tutorials/infogain.html

 

Let's say we have a variable X that can have four possible values, A, B, C, and D.

 

We might see something like BAACACCDDCDADABCDBBB

 

Where:

Example 1

P(X=A) = 1/4

P(X=B) = 1/4

P(X=C) = 1/4

P(X=D) = 1/4

 

If we can only encode this data using 0's and 1's (think of Morse code), you can code each value using 2 bits.

 

Value

Code

A

0 0

B

0 1

C

1 0

D

1 1

 

What if, as in the English language, all letters are not equal? i.e. different letters appear with different probabilities.

 

Example 2

P(X=A) = 1/2

P(X=B) = 1/4

P(X=C) = 1/8

P(X=D) = 1/8

 

Can we invent a code that uses, on average, 1.75 bits?

 


 



Value

Code

A

0

B

1 0

C

1 1 0

D

1 1 1

 

(This code is not unique)

 

1/2 * 1 + 1/4 * 2 + 1/8 * 3 + 1/8 * 3 = 1.75

What if we had a variable with three equally likely values?

 

P(X=A) = 1/3

P(X=B) = 1/3

P(X=C) = 1/3

 

A naïve coding would use 2 bits each value

 

Value

Code

A

0 0

B

0 1

C

1 0

 

 

Can we devise a code that uses only 1.6 bits?  (Actually, theoretically we only need 1.58496 bits)

 

Value

Code

A

0

B

1 0

C

1 1

 

 

There is obviously something "mathematical" going on here!!

 

Let's go back to Example 2.

 

Example 2

P(X=A) = 1/2

P(X=B) = 1/4

P(X=C) = 1/8

P(X=D) = 1/8

 

 

Intuitively speaking, if A occurs half the time, then we need only one bit to code A. 

 

B occurs only 1/4 of the time.  Therefore we should need less than 1 bit, proportionally, to represent B.

 

Where's the math?

 

The number of bits needed to code value A occurring with a probability of 1/2 is;

 

bits = - log 2 p  where p is the probability with which a particular value occurs

 

bits(A) = - log 2 1/2 = 1

 

bits(B) = - log 2 1/4 = 2

 

bits (C) = bits(D) = - log 2 1/8 = 3

 

Substituting to find the average we have:

 

H(X) = - pA log 2 pA  - pB log 2 pB - pC log 2 pC - pD log 2 pD

H(X ) =  1/2 * 1 + 1/4 * 2 + 1/8 * 3 + 1/8 * 3 = 1.75

 

 


Therefore values occurring with higher probability are coded with fewer bits. 

 

H(X) is the Entropy of X.

 

General Case:  Given that S is our sample,

 

 

Entropy is a measure of how "mixed up" an attribute is.  It is sometimes equated to the purity or impurity of a variable.

 

High Entropy means that we are sampling from a uniform (boring) distribution.  Would have a flat histogram, therefore we have an equal chance of obtaining any possible value.

 

Low Entropy means that the distribution varies, it has peaks and valleys.  The histogram of frequency distribution would have many lows and maybe one or two highs.  Hence it is more predictable.

 

Entropy is a measure of how pure or impure a variable is.

 

Birdie seed example.  A BagA that is equal peanuts and sunflower seeds.

                             A BagB  that has a few peanuts and is mostly sunflower seeds.

 

BagB has low entropy, BagA has high entropy.  We can easier make a decision as to what seed we would most likely pull out of BagB.

 

Entropy ranges from 0 (all instances of a variable have the same value) to 1 (equal number of instances of each value)

 

How can we predict output Y given input X.

 

Example 3

Outlook

Play Tennis

Sunny

No

Sunny

No

Overcast

Yes

Rain

Yes

Rain

Yes

Rain

No

Overcast

Yes

Sunny

No

Sunny

Yes

Rain

Yes

Sunny

Yes

Overcast

Yes

Overcast

Yes

Rain

No

 

P(Play Tennis = Yes) = 9/14

P(Play Tennis = No) = 5/14

 

Entropy(Play Tennis) = - 9/14 log 2  (9/14) - 5/14 log 2   (5/14) = .940

 

Suppose we try to predict output of Play Tennis based on an input of Outlook.  In other words, knowing the value of Outlook can I decide whether or not to play tennis?

 

Srain = [3+, 2-]

 

P(Outlook = Rain and Play Tennis = yes) = 3/5

P(Outlook = Rain and Play Tennis = no) = 2/5

 

 

 = .971

 

This is the specific conditional entropy for rain, H(Y | X = rain)

General Format:  H(Y | X = v) where v is some value

Sovercast = [2+, 2-]

 

=0

 

Ssunny = [2+, 3-]

=.971

 

 

Conditional Entropy is the expected number of bits needed to transmit Y if both sides will know the value of X.

This is equal to the average conditional entropy of Y.

 

P(rain) = 5/14   P(overcast) = 4/14   P(sunny) = 5/14

Entropy (Play Tennis | Outlook) =

 

By knowing Outlook, how much information have I gained?

 

I have reduced the number of bits needed to send my message by:

 

Entropy (Play Tennis) - Entropy (Play Tennis | Outlook) = .940 - .694 = .246

 

I need .246 bits less to send my message if I know the Outlook.

 

Information Gain is the number of bits saved, on average, if we transmit Y and both receiver and sender know X

 

Gain = Entropy(X) - Entropy(X|Y) = Entropy(X) -