Log likelihood statistic and topic signature words Lecture 10 1
2
Grows very slowly log(1) = 0 log(2) = 0. 6931472 log(3) = 1. 098612 log(4) = 1. 386294 log(5) = 1. 609438 log(6) = 1. 791759 log(7) = 1. 94591 log(8) = 2. 079442 log(9) = 2. 197225 log(10) = 2. 302585 3
l 4 Log(x) < 0 if 0 < x < 1
5
Log probability arithmetic l Probabilities P(w) are very small numbers – l l Using log probabilities solves the problem Working with logs is also more efficient – – 6 Underflow Summation instead of multiplication log(p 1 p 2 p 3…pn) = log(p 1) + log(p 2) +…+ log(pn)
Plain text l 7 Howe, the fourth Cabinet member to resign because of disputes over policy toward the European Community, shocked the House of Commons on Nov. 13 by calling Mrs. Thatcher a threat to Britain 's vital interests.
Part of speech tagging l 8 Howe_NNP , _, the_DT fourth_JJ Cabinet_NNP member_NN to_TO resign_VB because_IN of_IN disputes_NNS over_IN policy_NN toward_IN the_DT European_NNP Community_NNP , _, shocked_VBD the_DT House_NNP of_IN Commons_NNPS on_IN Nov. _NNP 13_CD by_IN calling_VBG Mrs. _NNP Thatcher_NNP a_DT threat_NN to_TO Britain_NNP 's_POS vital_JJ interests_NNS. _.
Named entity recognition l 9 Howe/PERSON , /O the/O fourth/O Cabinet/ORGANIZATION member/O to/O resign/O because/O of/O disputes/O over/O policy/O toward/O the/O European/LOCATION Community/LOCATION , /O shocked/O the/O House/ORGANIZATION of/ORGANIZATION Commons/ORGANIZATION on/O Nov. /O 13/O by/O calling/O Mrs. /PERSON Thatcher/PERSON a/O threat/O to/O Britain/LOCATION 's/O vital/O interests/O. /O
Likelihood ratio (section 5. 3. 4) l One more way of deciding “is a given word representative of what an article is about? ” l T---cluster of articles (such as those we have seen in homework 1) NT---background collection l – 10 For homework 1 we also used a background collection to compute idf. This was all the articles that we are not currently summarzing
Is a word w a topic word? l Two possibilities – – 11 Either w is very indicative of the topic of the cluster and appears more often in T than in NT Or, w occurs with the same frequency in both T and NT
What is the likelihood of a word occurring n times in a document? l Binomial distribution – – 12 Word w occurring == success Other word occurring == failure
What is the likelihood of the data under the two models? -2 log(lambda) has a chi square distribution 13 So we can look up the probability of getting a given value; if the probability is very low, we can assume H 2 holds