- Количество слайдов: 24
Automatic Term Weighting, Lexical Statistics and …… Quantitative Terminology Kyo Kageura National Institute of Informatics July 05, 2003
Project n n To rescue/recover the sphere of lexicology To release the richness and productivity of lexico-conceptual sets from the dominance of discourse …… q q while maintaining the traceable procedure in the process of doing this and starting from textual corpora
Contents n n n Sphere of Texts and Sphere of Lexicon/ology Three (representative) methods of automatic term weighting and their meanings From corpus-based lexical statistics to (still) corpus-based quantitative lexicology Measuring lexical productivity in lexicon (i. e. lexicological concept of productivity) from textual data, with some experiments Conclusions
Textual Sphere and Lexicological Sphere lex ico complex terms log y n o xic le y olog xic ve le ati it uant q This exists term s So what about talking about lexicology when talking about corpus-based… Textual Sphere
Lexicological Sphere and Texts n Lexicology deals with actual set of words q n Lexicological model with expectations addresses “realistic possibility of existence, ” not permissible forms or fantasy land q q n which does not mean it’s natural history thus actual data is required primary language data is texts Thus becomes recovery of lexicological characteristics the task of lexicology
Automatic Term Weighting (ATW) n To review some representative ATW methods gives important insights into the current topic q n while at the same time giving insights into ATWs We look at q q q Tfidf (its info-theoretic interpretation by Aizawa) Term representativeness (by Hisamitsu) Lexical measure (by Nakagawa) which goes from texts to lexicology, almost.
ATW 1: tfidf d 1 t 1 … t. T d 2 … d. D fij Tfidf and many other similar measures, in fact most of what are used in IR, are based on the document-term matrix which has formal duality. Thus the weight of terms is always and only meaningful vis -à-vis the given set of documents or its population (Dfitf thus makes sense, as in probabilistic model).
ATW 2: Term representativeness n n You shall know the meaning of a word by the company it keeps (or see friends to know a person … if there is any, anyway) To calculate the weight of a term ti, take the distribution of words that accompany ti in a certain window size and calculate the distance between this and the distribution of random chunk of the same window size (NB: size normalisation is necessary due to LNRE nature of language data).
ATW 2: Term representativeness n n This method discards the factor of dominant discourse or minor discourse at the level of observed texts (or does not do favor to people who randomly buy friends by money). This method calculates the characteristic that the term ti, if appears at all, can attract at the level of discourse (depending on the nature of window the method takes, of course).
ATW 3: Nakagawa’s method n n Observe the number of different elements (element types) that accompany ti within the complex lexical units in texts. This reflects, therefore, a nature of lexical productivity of the focal element ti, but together with the degree of its use in discourse (texts)
ATW to Quantitative Lexicology n To characterise lexicological nature of elements from their occurrence in texts: q q As in the method of term representativeness in Hisamitsu, the “discourse size” factor should be reduced, more essentially; As in Nakagawa’s method, the point of observation should be limited to complex terms (or those which are supposed to be registered or can be registered to the lexicon/lexicological sphere).
A Quantitative Terminonlogical Study n n Aim: To recover the productivity of constituent elements of simplex and complex terms as head. Observe, like Nakagawa, the window range of simplex and complex terms in texts, e. g. ＜理論/物理＞と＜教育/心理＞は＜観察＞できる ＜範囲＞では似通っており、＜計算/機/科学＞は＜ 理論/物理＞より高い＜値＞で＜推移＞している。
Some preconditions/assumptions n Corpus and the target terminological space should: q q q n belong to and represent the same domain cover the same period of time in general matches qualitatively We are concerned with defining a measure which can compare “productivity” of elements in the same lexicological/terminological sphere.
Definition of measures (a) n f(i, N): frequency of ti in the text of size N q n This is the extent of use in discourse, nothing to do with lexicological productivity d(i, N): number of different complex words whose head is ti in the text of size N q q q the first manifestation of lexicological productivity basically identical to Nakagawa (2000) thus this is the point of departure
Definition of measures (b) n n d(i, N) means the manifestation of the productivity of ti as it occurs in the corpus d(i, N) is sensitive to the extent of use of the focal element in the textual corpus, q e. g. the following can be the case… X=N X=2 N d(i, X) 500 600 d(j, X) 400 800
Definition of measures (c) n n n Better measure for manifested productivity d(i, λN)：the overall transition pattern of d(i, λN) whereλtakes a positive real value (a la Hisamitsu). The measure for potential productivity ｄ(i) = d(i, λN); λ→∞：discard all the quantitative factor Can be computed by LNRE models
The measures and prob. distributions n Three distributions 1) The occurrence probability of heads in theoretical lexicological space. 2) The occurrence probability of modifiers for each head. 3) The probability of use of the head in the text. n Relations… q q q f(i, N) ⇔ 3) d(i) ⇔ 1) d(i, N) ⇔ 2), 3)
Experiments (1/5) n Artificial intelligence abstracts in Japanese #Abst 1816 n #Token（Smp/Cmp） #Type（Smp/Cmp） 299846 / 230708 8764 / 23243 4 elements, i. e. 「System」「Model｣（general） and 「knonwledge」「information」(specific) are observed
Experiments (2/5) f(i, N) f単(i, N) f複(i, N) d(i, N) system 1970 723 1247 502 model 1015 328 687 263 knowledge 1191 748 443 137 information 637 369 268 155
Experiments (4/5) LNRE p-value MSE d(i) system GIGP 0. 96 2. 19 273, 402, 688, 337 model IGP 0. 47 2. 88 3, 676, 671, 255 knowledge Log. N 0. 88 2. 72 689 information IGP 0. 84 2. 32 667
Experiments (5/5) f(i, N) d(i, λN) d(i) S S S ＞ ＞ ＞ K M M ＞ ＞ ＞ M I K ＞ ＞ ＞ I K I General elements, such as “system” or “model, ” have high lexicological productivity, while subjectspecific elements, such as “knowledge” or “information, ” have rather low productivity.
Summary n Starting from the observation of ATW methods and going into examining corpusbased quantitative terminological study, we q q q clarified the position of lexicology/lexicon clarified the basic framework of quantitative lexicology/terminology, with relevant measures. gave some corresponding distributions gave the framework of interpretation to measures carried out experiments …
Remaining problems n Concepts of “lexicologisation” and “word” q q n Distribution of complex words in texts and word unit q n To be registered to the lexicon To be consolidated as a lexical unit within the syntagmatic stream of language manifestations “reference+head” vs. “modifier+head” The former is related to an essential concept(ualisation) of lexicon/lexicology…