Natural Language Processing 4 Zhao Hai 赵海 Department

Natural Language Processing (4) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs. sjtu. edu. cn 1

Outline q Lexicons and Lexical Analysis Ø Finite State Models and Morphological Analysis Ø Collocation 2

Lexicons and Lexical Analysis (202) Morphemes q Morphemes are the smallest meaningful units of language and are typically word stems or affixes. For example, the word “books” can be divided into two morphemes; ‘book’ and ‘s’, where the meaning of ‘s’ is as a plural suffix. 3

Lexicons and Lexical Analysis (203) Morphology (1/3) Morphology is generally divided into two types: 1. Inflectional morphology covers the variant forms of nouns, adjectives and verbs owing to changes in: Person (first, second, third); Number (singular, plural); Tense (present, future, past); Gender (male, female, neuter). 4

Lexicons and Lexical Analysis (204) Morphology (2/3) 2. Derivational morphology is the formation of a new word by addition of an affix, but it also includes cases of derivation without an affix: disenchant (V) + -ment disenchantment (N); reduce (V) + -tion reduction (N); record (V) record (N); progress (N) progress (V). 5

Lexicons and Lexical Analysis (205) Morphology (3/3) q Most morphological analysis programs tend to deal only with inflectional morphology, and assume that derivational variants will be listed separately in the lexicon. 6

Lexicons and Lexical Analysis (206) Analysis of Plurals q The following word-stems obey regular rules for the generation of plurals: CHURCHES CHURCH + ES; SPOUSES SPOUSE + S; FLIES FLY + IES; PIES PIE + S. q The remaining word-stems are irregular: MICE MOUSE; FISH; ROOVES ROOF + VES; BOOK ENDS BOOK END + S; LIEUTENANTS GENERAL LIEUTENANT (+S) GENERAL. 7

Lexicons and Lexical Analysis (207) Analysis of Inflectional Variants q The following word-stems obey regular rules: LODGING LODGE + ING; BANNED BAN + NED; FUMED FUME + D; BREACHED BREACH + ED; TAKEN TAKE + N. q The following word-stems are irregular: TAUGHT TEACH; FAUGHT FIGHT; TOOK TAKE. 8

Lexicons and Lexical Analysis (208) Morphological Analyzer q A morphological analyzer must be able to undo the spelling rules for adding affixes. q q For example, the analyzer must be able to interpret “moved” as ‘move’ plus ‘ed’. For English, a few rules cover the generation of plurals and other inflections such as verb endings. q The main problem is where a rule has exceptions, which have to be listed explicitly, or where it is not clear which rule applies, if any. 9

Lexicons and Lexical Analysis (209) Finite State Transducers (FSTs) (1/4) Finite-state transducers (FST) are automata for which each transition has an output label in addition to the more familiar input label. q Transducers transform (transduce) input strings into output strings. The output symbols come from a finite set, usually called output alphabet. q Since the input and output alphabet are frequently the same, there is usually no distinction between them, that is, only the input label is given. q 10

Lexicons and Lexical Analysis (210) Finite State Transducers (FSTs) (2/4) Definition: A finite-state transducer (FST) is a 5 -tuple M = (Q , Σ, E, i, F) , where Q is a finite set of states, i ∈Q is the initial state, F ⊆ Q is a set of final states, Σ is a finite alphabet and E : Q ×( Σ ∪{ε}) × Σ* × Q is the set of transitions (arcs). Σ* is the set of all possible words over the Σ: Σ* = {v | v = v 1 v 2…vn for n ≥ 1 and vi ∈ Σ for all 1≤ i ≤n} ∪ {ε} 11

Lexicons and Lexical Analysis (211) Finite State Transducers (FSTs) (3/4) Definition: Further, we define the state transition function δ : Q ×( Σ∪{ε}) → 2 Q (the power set of Q) as follows: δ (p, a) = { q ∈ Q | ∃ v ∈Σ* : ( p, a, v, q ) ∈ E }, and the emission function λ : Q × (Σ ∪ {ε})× Q → 2 Σ* is defined as: λ(p, a, q) = {v ∈Σ* | (p, a, v, q) ∈ E} 12

Lexicons and Lexical Analysis (212) Finite State Transducers (FSTs) (4/4) Ex. : Let M = (QM, ΣM, EM, i. M, FM) be an FST, where QM ={0, 1, 2}, Σ M = {a, b, c}, δM = {(0, a, b, 1), (0, a, c, 2)} , i. M = 0 and FM ={1, 2}. M transduces a to b or a to c. Note that for visualizing transducers we use the colon to separate the input and output labels of a transduction. 13

Lexicons and Lexical Analysis (213) A Simple FST Ex. : Morphological analysis for the word “happy” and its derived forms: happy; happier happy+er; happiest happy+est 12 14

Lexicons and Lexical Analysis (214) Specification for the Simple FST Arcs labeled by a single letter have that letter as both the input and the output. Nodes that are double circles indicate success states, that is, acceptable words. q The dashed link, indicating a jump, is not formally necessary but is useful for showing the break between the processing of the root form and the processing of the suffix. q No input is represented as an empty symbolε. q 15

Lexicons and Lexical Analysis (215) A Fragment of an FST This FST accepts the following words, which all start with t: tie (state 4), ties (10), trap (7), traps (10), try (11), tries (15), to (16), torch (19), torches (15), toss (21), and tosses (15). In addition, it outputs tie, tie+s, trap+s, try+s, torch, torch+s, toss+s. 16

Lexicons and Lexical Analysis (216) Specification for the Fragment of an FST (1/2) q The entire lexicon can be encoded as an FST that encodes all the legal input words and transforms them into morphemic sequences. q The FSTs for the different suffixes need only be defined once, and all root forms that allow that suffix can point to the same node. 17

Lexicons and Lexical Analysis (217) Specification for the Fragment of an FST (2/2) q Words that share a common prefix (such as torch, toss, and so on) also can share the same nodes, greatly reducing the size of the network. q Note that you may pass through acceptable states along the way when processing a word. 18

Lexicons and Lexical Analysis (218) References J. Hopcroft, J. Ullman. 1979. Introduction to Automata Theory, Languages and Computation. Addison-Wesley Series in Computer Science, Addison-Wesley, Reading, Massachusetts, Menlo Park, California, London. M. Mohri. 1997. Finite-state transducers in language and speech processing. Computational Linguistics 23. 19

Outline q Lexicons and Lexical Analysis Ø Finite State Models and Morphological Analysis Ø Collocation Ø Motivations Ø Frequency Ø Mean and Variance 20

Lexicons and Lexical Analysis (219) Collocation (1) Definition (1/6) q A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. q For example, Ø noun phrases: strong tea; weapons of mass destruction; Ø phrasal Ø other verbs: to make up; phrases: the rich and powerful. 21

Lexicons and Lexical Analysis (220) Collocation (2) Compositionality (2/6) q We call a natural language expression compositional if the meaning of the expression can be predicted from the meaning of the parts. q Collocations are characterized by limited compositionality, in which there is usually an element of meaning added to the combination. q For example, in the case of strong tea, strong has acquired the meaning rich in some active agent which is closely related, but slightly different from the basic sense having great physical strength. 22

Lexicons and Lexical Analysis (221) Collocation (3) Non-Compositionality (3/6) q Idioms are the most extreme examples of non-compositionality. For instance, the idioms to kick the bucket (翘辫子) and to hear it through the grapevine (听小道消息) only have an indirect historical relationship to the meanings of the parts of the expression. q q Most collocations exhibit milder forms of non-compositionality, like the expression international best practice. It is very nearly a systematic composition of its parts, but still has an element of added meaning. 23

Lexicons and Lexical Analysis (222) Collocation (4) Other Terms (4/6) q There is considerable overlap between the concept of collocation and notions like term, technical term, and terminological phrase. q The above three terms are commonly used when collocations are extracted from technical domains (in a process called terminology extraction). 24

Lexicons and Lexical Analysis (223) Collocation (5) Applications (5/6) q Collocations are important for a number of applications: Ø natural language generation (to make sure that the output sounds natural and mistakes like powerful tea or to take a decision are avoided); Ø computational lexicography (to automatically identify the important collocations to be listed in a dictionary entry); 25

Lexicons and Lexical Analysis (224) Collocation (6) Applications (6/6) Ø parsing (so that preference can be given to parses with natural collocations) Ø corpus linguistic research (for instance, the study of social phenomena like the reinforcement of cultural stereotypes through language). 26

Lexicons and Lexical Analysis (225) Collocation (7) Frequency (1/11) q Surely the simplest method for finding collocations in a text corpus is counting. q If two words occur together a lot, then that is evidence that they have a special function that is not simply explained as the function that results from their combination. 27

Lexicons and Lexical Analysis (226) Collocation (8) New York Times corpus Frequency (2/11) The table shows the bigrams (sequences of two adjacent words) that are most frequent in the corpus and their frequency. Except for New York, all the bigrams are pairs of function words. A function word is a word which have no lexical meaning, and whose sole function is to express grammatical relationships, such as prepositions, articles, and conjunctions. 28

Lexicons and Lexical Analysis (227) Collocation (9) Frequency (3/11) q But just selecting the most frequently occurring bigrams is not very interesting. q Justeson and Katz (1995): pass the candidate phrases through a part-of-speech filter which only lets through those patterns that are likely to be “phrases”. 29

Lexicons and Lexical Analysis (228) Collocation (10) Frequency (4/11) 30

Lexicons and Lexical Analysis (229) Collocation (11) Frequency (5/11) q Each is followed by an example from the text which is used as a test set. In these patterns A refers to an adjective, P to a preposition, and N to a noun. q The next table shows the most highly ranked phrases after applying the ﬁlter. q The results are surprisingly good. There are only 3 bigrams that we would not regard as non-compositional phrases: q last year, last week, and first time. 31

Lexicons and Lexical Analysis (230) Collocation (12) Frequency (6/11) 32

Lexicons and Lexical Analysis (231) Collocation (13) Frequency (7/11) q York City is an artefact of the way we have implemented the Justeson and Katz filter. q The full implementation would search for the longest sequence that fits one of the part-of-speech patterns and would thus find the longer phrase New York City. q The twenty highest ranking phrases containing strong and powerful all have the form A N (where A is either strong or powerful). They have been listed in the following table. 33

Lexicons and Lexical Analysis (232) Collocation (14) Frequency (8/11) 34

Lexicons and Lexical Analysis (233) Collocation (15) Frequency (9/11) q Given the simplicity of the method, these results are surprisingly accurate. q For example, they give evidence that q strong challenge and powerful computers are correct whereas q powerful challenge and strong computers are not. q However, we can also see the limits of a frequency-based method. 35

Lexicons and Lexical Analysis (234) Collocation (16) Frequency (10/11) q Neither strong tea nor powerful tea occurs in New York Times corpus. q However, searching the larger corpus of the World Wide Web we find 799 examples of strong tea and 17 examples of powerful tea (the latter mostly in the computational linguistics literature on collocations), which indicates that the correct phrase is strong tea. Google search in 2013 indicates q 520, 000 for ‘strong tea’ q 27, 000 for ‘powerful tea’ 36

Lexicons and Lexical Analysis (235) Collocation (17) Frequency (11/11) q A simple quantitative technique (the frequency filter) combined with a small amount of linguistic knowledge (the importance of parts of speech) goes a long way. q Later we will use a stop list that excludes words whose most frequent tag is not a verb, noun or adjective. 37

Lexicons and Lexical Analysis (236) Collocation (18) Mean and Variance (1) q Frequency-based search works well for fixed phrases. q But many collocations consist of two words that stand in a more flexible relationship to one another. q Consider the verb knock and one of its most frequent arguments, door. 38

Lexicons and Lexical Analysis (237) Collocation (19) Mean and Variance (2) q Here are some examples of knocking on or at a door : Ø She knocked on his door. Ø They knocked at the door. Ø 100 women knocked on Donaldson’s door. Ø A man knocked on the metal front door. 39

Lexicons and Lexical Analysis (238) Collocation (20) Mean and Variance (3) q The words that appear between knocked and door vary and the distance between the two words is not constant so a fixed phrase approach would not work here. q But there is enough regularity in the patterns to allow us to determine that knock is the right verb to use in English for this situation, not hit, beat or rap. 40

Lexicons and Lexical Analysis (239) Collocation (21) Mean and Variance (4) q To simplify matters we only look at fixed phrase collocations in most cases, and usually at just bi-grams. q We define a collocational window (usually a window of 3 to 4 words on each side of a word), and we enter every word pair in there as a collocational bigram. Then we proceed to do our calculations as usual on this larger pool of bigrams: 41

Lexicons and Lexical Analysis (240) Collocation (22) Mean and Variance (5) Using a three word collocational window to capture bigrams at a distance 42

Lexicons and Lexical Analysis (241) Collocation (23) Mean and Variance (6) q The mean and variance based methods described by definition look at the pattern of varying distance between two words. q One way of discovering the relationship between knocked and door is to compute the mean and variance of the offsets (signed distances) between the two words in the corpus. 43

Lexicons and Lexical Analysis (242) Collocation (24) Mean and Variance (7) q The mean is simply the average offset. For the examples previously, we compute the mean offset between knocked and door as follows: q This assumes a tokenization of Donaldson’s as three words Donaldson, apostrophe, and s. 44

Lexicons and Lexical Analysis (243) Collocation (25) Mean and Variance (8) q The variance measures how much the individual offsets deviate from the mean. We estimate it as follows: where n is the number of times the two words co-occur, di is the offset for co-occurrence i, andμis the mean. 45

Lexicons and Lexical Analysis (244) Collocation (26) Mean and Variance (9) q As is customary, we use the standard deviation , the square root of the variance, to assess how variable the offset between two words is. The standard deviation for the four examples of knocked / door in the above case is : 46

Lexicons and Lexical Analysis (245) Collocation (27) Mean and Variance (10) q The mean and standard deviation characterize the distribution of distances between two words in a corpus. q We can use this information to discover collocations by looking for pairs with low standard deviation. q We can also explain the information that variance gets at in terms of peaks in the distribution of one word with respect to another. 47

Lexicons and Lexical Analysis (246) Collocation (28) Mean and Variance (11) The variance of strong with respect to opposition is small 48

Lexicons and Lexical Analysis (247) Collocation (29) Mean and Variance (12) Because of this greater variability we get a higher and a mean that is between positions -1 and -2 (-1. 45) . 49

Lexicons and Lexical Analysis (248) Collocation (30) Mean and Variance (13) The high standard deviation of indicates this randomness. This indicates that for and strong don’t form interesting collocations. 50

Lexicons and Lexical Analysis (249) Collocation (31) Mean and Variance (14) 51

Lexicons and Lexical Analysis (250) Collocation (32) Mean and Variance (15) q If the mean is close to 1. 0 and the standard deviation low, as is the case for New York, then we have the type of phrase that Justeson and Katz’ frequency-based approach will also discover. q If the mean is much greater than 1. 0, then a low standard deviation indicates an interesting phrase. 52

Lexicons and Lexical Analysis (251) Collocation (33) Mean and Variance (16) q High standard deviation indicates that the two words of the pair stand in no interesting relationship as demonstrated by the four high-variance. q More interesting are the cases in between, word pairs that have large counts for several distances in their collocational distribution. 53

Lexicons and Lexical Analysis (252) Collocation (34) References J. S. Justeson and S. M. Katz. 1995. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1. M. A. K. Halliday. 1966. Lexis as a linguistic level. In C. E. Bazell, J. C. Catford, M. A. K. Halliday, and R. H. Robins (eds. ), In memory of J. R. Firth. London: Longmans. F. Smadja. 1993. Retrieving collocations from text: Xtract. Computational Linguistics 19. 54

Outline q Lexicons and Lexical Analysis Ø Collocation Ø Hypothesis Testing Ø T Test Ø Mutual Information 55

Lexicons and Lexical Analysis (254) Collocation (35) Hypothesis Testing (1/5) q One difficulty that we have glossed over so far is that high frequency and low variance can be accidental. q For example, if the two constituent words of a frequent bigram like new companies are frequently occurring words (as new and companies are), then we expect the two words to cooccur a lot just by chance, even if they do not form a collocation. 56

Lexicons and Lexical Analysis (255) Collocation (36) Hypothesis Testing (2/5) q What we really want to know is whether two words occur together more often than chance. q Assessing whether or not something is a chance event is one of the classical problems of statistics. It is usually couched in terms of hypothesis testing. 57

Lexicons and Lexical Analysis (256) Collocation (37) Hypothesis Testing (3/5) q We formulate a null hypothesis H 0 that there is no association between the words beyond chance occurrences, and compute the probability p that the event would occur. q If H 0 were true, and then reject H if p is too low (typically if beneath a significance level of p < 0. 05, 0. 01, 0. 0005, or 0. 001) and retain H 0 as possible otherwise. 58

Lexicons and Lexical Analysis (257) Collocation (38) Hypothesis Testing (4/5) q We need to formulate a null hypothesis which states what should be true if two words do not form a collocation. q For such a free combination of two words we will assume that each of the words w 1 and w 2 is generated completely independently of the other, and so their chance of coming together is simply given by: 59

Lexicons and Lexical Analysis (258) Collocation (39) Hypothesis Testing (5/5) q The model implies that the probability of co-occurrence is just the product of the probabilities of the individual words. q This is a rather simplistic model, and not empirically accurate, but for now we adopt independence as our null hypothesis. 60

Lexicons and Lexical Analysis (259) Collocation (40) The T Test (1) q We need a statistical test that tells us how probable or improbable it is that a certain constellation will occur. q A test that has been widely used for collocation discovery is the t test. q The t test looks at the mean and variance of a sample of measurements, where the null hypothesis is that the sample is drawn from a distribution with mean μ. 61

Lexicons and Lexical Analysis (260) Collocation (41) The T Test (2) where is the sample mean, is the sample variance, N is the sample size, andμis the mean of the distribution. q If the t statistic is large enough we can reject the null hypothesis. 62

Lexicons and Lexical Analysis (261) Collocation (42) The T Test (3) q The test t looks at the difference between the observed ( ) and expected (μ) means, scaled by the variance of the data. q It tells us how likely one is to get a sample of that mean and variance (or a more extreme mean and variance) assuming that the sample is drawn from a normal distribution with mean μ. 63

Lexicons and Lexical Analysis (262) Collocation (43) The T Test (4) q For instance, our null hypothesis is that the mean height of a population of men is 158 cm. q We are given a sample of 200 men with = 169 and = 2600 and want to know whether this sample is from the general population (the null hypothesis) or whether it is from a different population of smaller men. 64

Lexicons and Lexical Analysis (263) Collocation (44) The T Test (5) q This gives us the following t according to the above formula: q We can also find out exactly how large it has to be by looking up the table of the t distribution. 65

Lexicons and Lexical Analysis (264) Collocation (45) The T Test (6) q If we look up the value of t that corresponds to a confidence level of α= 0. 005, we will find 2. 576. Since the t we got is larger than 2. 576, we can reject the null hypothesis with 99. 5% confidence. q So we can say that the sample is not drawn from a population with mean 158 cm, and our probability of error is less than 0. 5%. 66

Lexicons and Lexical Analysis (265) Collocation (46) The T Test (7) q To see how to use the t test for finding collocations, let us compute the t value for new companies. q We think of the text corpus as a long sequence of N bigrams, and the samples are then indicator random variables that take on the value 1 when the bigram of interest occurs, and are 0 otherwise. 67

Lexicons and Lexical Analysis (266) Collocation (47) The T Test (8) q Using maximum likelihood estimates, we can compute the probabilities of new and companies as follows. q In the corpus, new occurs 15, 828 times, companies 4, 675 times, and there are 14, 307, 668 tokens overall. 68

Lexicons and Lexical Analysis (267) Collocation (48) The T Test (9) q The null hypothesis is that occurrences of new and companies are independent. q If the null hypothesis is true, then the process of randomly generating bi- grams of words and assigning q 1 to the outcome new companies and q 0 to any other outcome q can be treated as a Bernoulli trial. 69

Lexicons and Lexical Analysis (268) Collocation (49) The T Test (10) q The mean for this distribution is ; and the variance is , which is approximately p. The approximation holds since for most bigrams p is small. q It turns out that there actually 8 occurrences of new companies among the 14307668 bigrams in our corpus. So, for the sample, we have that the sample mean is: 70

Lexicons and Lexical Analysis (269) Collocation (50) The T Test (11) q Now we have everything we need to apply the t test: q This t value of 0. 999932 is not larger than 2. 576, the critical value for α= 0. 005. So we cannot reject the null hypothesis that new and companies occur independently and do not form a collocation. 71

Lexicons and Lexical Analysis (270) Collocation (51) The T Test (12) 72

Lexicons and Lexical Analysis (271) Collocation (52) The T Test (13) q The above table shows t values for ten bigrams that occur exactly 20 times in the corpus. q For the top five bigrams, we can reject the null hypothesis that the component words occur independently for α= 0. 005, so these are good candidates for collocations. q The bottom five bigrams fail the test for significance, so we will not regard them as good candidates for collocations. 73

Lexicons and Lexical Analysis (272) Collocation (53) The T Test (14) q Note that a frequency-based method would not be able to rank the ten bigrams since they occur with exactly the same frequency. q We can see that the t test takes into account the number of co- occurrences of the bigram relative to the frequencies of the component words. 74

Lexicons and Lexical Analysis (273) Collocation (54) The T Test (15) q If a high proportion of the occurrences of both words (Ayatollah Ruhollah, videocassette recorder) or at least a very high proportion of the occurrences of one of the words (unsalted) occurs in the bigram, then its t value is high. q This criterion makes intuitive sense. 75

Lexicons and Lexical Analysis (274) Collocation (55) The T Test (16) The analysis in the table includes some stop words (Note: A stop word is a word that is common and frequently used, such as the, a, for, of, etc. ) – without stop words, it is actually hard to find examples that fail significance. q It turns out that most bigrams attested in a corpus occur significantly more often than chance. q q For 824 out of the 831 bigrams that occurred 20 times in our corpus the null hypothesis of independence can be rejected. 76

Lexicons and Lexical Analysis (275) Collocation (56) The T Test (17) q But we would only classify a fraction as true q collocations. The reason for this surprisingly high proportion of possibly dependent bigrams is that language itself – if compared with a random word generator – is very regular so that few completely unpredictable events happen. q The t test and other statistical tests are most useful as a method for ranking collocations. The level of significance itself is less useful. 77

Outline q Lexicons and Lexical Analysis Ø Collocation Ø Hypothesis Testing Ø T Test Ø Mutual Information 78

Lexicons and Lexical Analysis (276) Entropy (1/5) q The entropy (or self-information) is the average uncertainty of a single random variable: H(p) = H(X) = -∑p(x)log 2 p(x) x ∈χ Note: Let p(x) be the probability mass function of a random variable X, over a discrete set of symbols (or alphabet) χ: p(x) = P (X = x), x ∈χ 79

Lexicons and Lexical Analysis (277) Entropy (2/5) q Entropy measures the amount of information in a random variable. It is normally measured in bits (hence the log to the base 2), but using any other base yields only a linear scaling of results. For example, suppose you are reporting the result of rolling an 8 -sided die. Then the entropy is: 8 1 1 H(X) = -∑p(i)log 2 p(i) = -∑ log = -log =log 8 = 3 bits i=1 8 8 80

Lexicons and Lexical Analysis (278) Entropy (3/5) q The joint entropy of a pair of discrete random variables X, Y is the amount of information needed on average to specify both their values. It is defined as: H(X, Y) = - ∑∑ p(x, y)logp(x, y) x ∈χy ∈У 81

Lexicons and Lexical Analysis (279) Entropy (4/5) q The condition entropy of a discrete random variables Y given another X, for X, Y, p(x, y), expresses how much extra information you still need to supply on average to communicate Y given that the other party knows X: H(Y|X) = ∑p(x) H(Y|X=x) = ∑p(x) [-∑ p(y|x)logp(y|x)] x ∈χ y ∈У = - ∑ ∑ p(x, y)logp(y|x) x ∈χy ∈У 82

Lexicons and Lexical Analysis (281) Mutual Information (1/7) q This difference is called the mutual information between X and Y: I(X, Y)=H(X) - H(X|Y) = H(Y) - H(Y|X) q It is the reduction in uncertainty of one random variable due to knowing about another. q In other words, the amount of information one random variable contains about another. 84

Lexicons and Lexical Analysis (282) Mutual Information (2/7) H(X, Y) H(X|Y) H(X) I(X; Y) H(Y|X) H(Y) 85

Lexicons and Lexical Analysis (283) Mutual Information (3/7) q Mutual information is a symmetric, non-negative measure of the common information in the two variables. q People often think of mutual information as a measure of dependence between variables. q However, it is actually better to think of it as a measure of independence because: 86

Lexicons and Lexical Analysis (284) Mutual Information (4/7) Ø It is 0 only when two variables are independent, but Ø For two dependent variables, mutual information grows not only with the degree of dependence, but also according to the entropy of the variables. q I(X; Y) = H(X) - H(X|Y) = H(X) + H(Y) - H(X, Y) 1 = ∑p(x)log χ p(x) + ∑p(y)log У p(y) + ∑p(x, y)logp(x, y) χ, У 87

Lexicons and Lexical Analysis (285) Mutual Information (5/7) p(x, y) = ∑p(x, y)log χ, У p(x) p(y) Since H(X|X) = 0, note that: H(X) = H(X) – H(X|X) = I(X; X) q This illustrates both why entropy is also called self- information, and how the mutual information between two totally dependent variables is not constant but depends on their entropy. 88

Lexicons and Lexical Analysis (286) Mutual Information (6/7) q An information-theoretically motivated measure for discovering interesting collocations is pointwise mutual information q (Church et al. (1991), Church & Hanks (1989) and Hindle (1990)). q Fano (1961) originally defined mutual information between particular events x’ and y’, in our case the event is occurrence of particular words. 89

Lexicons and Lexical Analysis (287) Mutual Information (7/7) q This type of mutual information is roughly a measure of how much one word tells us about the other. 90

Lexicons and Lexical Analysis (287) About Definitions q The definition of mutual information used here is common in corpus linguistic studies, but is less common in Information Theory. It is important to check what a mathematical concept is a formalization of. q q We will see that pointwise mutual information is of limited utility for acquiring the types of linguistic properties. 91

Lexicons and Lexical Analysis (288) Mutual Information Exp. (1/3) q These two types of mutual information are quite different creatures. q When we apply this definition to the 10 collocations from the previous table, we get the same ranking as with the t test. See the following table: 92

Lexicons and Lexical Analysis (289) Mutual Information Exp. (2/3) 93

Lexicons and Lexical Analysis (290) Mutual Information Exp. (3/3) q As usual, we use maximum likelihood estimates to compute the probabilities, for example: q The amount of information we have about the occurrence of Ayatollahat position i in the corpus increases by 18. 38 bits if we are told that Ruhollah occurs at position i + 1. q In other words, we can be much more certain that Ruhollah will occur next if we are told that Ayatollah is the current word. 94

Lexicons and Lexical Analysis (291) Mutual Information Fails: χ2 test (1/5) q Unfortunately, this measure of “increased information” is in many cases not a good measure of what an interesting correspondence between two events is. q Consider the two examples in the following table of counts of word correspondences between French and English sentences in the Hansard corpus, an aligned corpus of debates of the Canadian parliament. Let’s see two French words, q q chambre room, house communes common 95

Lexicons and Lexical Analysis (292) Mutual Information Fails: χ2 Test (2/5) • Mutual information gives a higher score to (communes, house) Note: χ2 test is Pearson’s chi-square test. The χ2 statistic sums the differences between observed and expected values in all squares of the table, scaled by the magnitude of the expected values. 96

Lexicons and Lexical Analysis (293) Mutual Information Fails: χ2 Test (3/5) q The reason that house frequently appears in translations of French sentences containing chambre and communes is that the most common use of house is the phrase House of Commons which corresponds to Chambre de communes in French. q But it is easy to see that communes is a worse match for house than chambre since most occurrences of house occur without communes on the French side. 97

Lexicons and Lexical Analysis (294) Mutual Information Fails: χ2 Test (4/5) q The χ2 test is able to infer the correct correspondence whereas mutual information gives preference to the incorrect pair (communes, house). 98

Lexicons and Lexical Analysis (295) Mutual Information Fails: χ2 Test (5/5) q The higher mutual information value for communes reflects the fact that communes causes a larger decrease in uncertainty. q In contrast, the χ2 is a direct test of probabilistic dependence, which in this context we can interpret as the degree of association between two words and hence as a measure of their quality as translation pairs and collocations. 99

Lexicons and Lexical Analysis (296) Frequency Matters (1/5) The table shows a second problem with using mutual information for finding collocations. Statistics over different sized corpora. 100

Lexicons and Lexical Analysis (297) Frequency Matters (2/5) q We show ten bigrams that occur exactly once in the first 1000 documents of the reference corpus and their mutual information score based on the 1000 documents. q The right half of the table shows the mutual information score based on the entire reference corpus (about 23, 000 documents). 101

Lexicons and Lexical Analysis (298) Frequency Matters (3/5) q The larger corpus of 23, 000 documents makes some better estimates possible, which in turn leads to a slightly better ranking. q The bigrams marijuana growing and new converts (arguably collocations) have moved up and Reds survived (definitely not a collocation) has moved down. 102

Lexicons and Lexical Analysis (299) Frequency Matters (4/5) q However, what is striking is that even after going to a 10 times larger corpus 6 of the bigrams still only occur once. q As a consequence, they have inaccurate maximum likelihood estimates and artificially inflated mutual information scores. q All 6 are not collocations. 103

Lexicons and Lexical Analysis (300) Frequency Matters (5/5) q None of the measures works very well for low-frequency events. q But there is evidence that sparseness is a particularly difficult problem for mutual information. 104

Lexicons and Lexical Analysis (301) When Mutual Information Works (1/5) q Consider two extreme cases: perfect dependence of the occurrences of the two words and perfect independence of that. q For perfect dependence (they only occur together ) we have: That is, among perfectly dependent bigrams, as they get rarer, their mutual information increases. 105

Lexicons and Lexical Analysis (302) When Mutual Information Works (2/5) q For perfect independence (the occurrence of one does not give us any information about the occurrence of the other ) we have: 106

Lexicons and Lexical Analysis (303) When Mutual Information Works (3/5) q We can say that mutual information is a good measure of independence. Values close to 0 indicate independence (independent of frequency). q But it is a bad measure of dependence because for dependence the score depends on the frequency of the individual words. 107

Lexicons and Lexical Analysis (304) When Mutual Information Works (4/5) q Other things being equal, bigrams composed of low-frequency words will receive a higher score than bigrams composed of highfrequency words. q That is the opposite of what we would want a good measure to do since higher frequency means more evidence and we would prefer a higher rank for bigrams for whose interestingness we have more evidence. 108

Lexicons and Lexical Analysis (305) When Mutual Information Works (5/5) q One solution that has been proposed for this is to use a cutoff and to only look at words with a frequency of at least 3. However, such a move does not solve the underlying problem, but only ameliorates its effects. q Since pointwise mutual information does not capture the intuitive notion of an interesting collocation very well, it is often not used when it is made available in practical applications. 109

Lexicons and Lexical Analysis (307) Summary (1/5) q There actually different definitions of the notion of collocation. q For instance, a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components (Choueka, 1988). 110

Lexicons and Lexical Analysis (308) Summary (2/5) q The following criteria are typical of linguistic treatments of collocations. Non-compositionality is the main one we have relied on here. Ø Non-compositionality. The meaning of a collocation is not a straightforward composition of the meanings of its parts. Either the meaning is completely different from the free combination (such as idioms) or there is a connotation or added element of meaning that cannot be predicted from the parts. 111

Lexicons and Lexical Analysis (309) Summary (3/5) Ø Non-substitutability. We cannot substitute near-synonyms for the components of a collocation. For example, we can’t say yellow wine instead of white wine even though yellow is as a good description of the color of white wine as white is (it is kind of a yellowish white). 112

Lexicons and Lexical Analysis (310) Summary (4/5) Ø Non-modifiability. Many collocations cannot be freely modified with additional lexical material or through grammatical transformations. This is especially true for frozen expressions like idioms. For example, we can’t modify frog in to get a frog in one’s throat (喉咙不适) into to get an ugly frog in one’s throat although usually nouns like frog can be modified by adjectives like ugly. 113

Lexicons and Lexical Analysis (311) Summary (5/5) q A nice way to test whether a combination is a collocation is to translate it into another language. q If we cannot translate the combination word by word, then that is evidence that we are dealing with a collocation. For example, q translating make a decision into French one word at a time we get faire une décision which is incorrect. q prendre une décision should be the correct translation 114

Lexicons and Lexical Analysis (312) References Ø K. W. Church and P. Hanks. 1990. Word Association Norms, Mutual Information and Lexicography. Computational Linguistics, Vol. 16, No. 1. Ø T. Fontenelle et al. 1994. Survey of Collocation Extraction Tools. Technical Report, University of Liege, Belgium. Ø J. Hodges et al. 1996. An Automated System that Assists in the Generation of Document Indexes. Natural Language Engineering No. 2. 115