
13e5e6c020bb5ee2018c18cab99f42bc.ppt
- Количество слайдов: 35
Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software. ict. ac. cn LCC Group, Software Division, Inst. of Computing Tech. , CAS 2003 -10 -20
Inst. Of Computing Tech, CAS 祝 王 树 西 生 日 快 乐 祝 于 满 泉 生 日 快 乐
Inst. Of Computing Tech, CAS Outline • • Motivation Semantic distance computation Overview on Word. Net-based Sentence semantic distance computation Novelty Detection using sentence semantic distance Experiments and analysis Problems and future works Conclusion
Inst. Of Computing Tech, CAS Motivation • Introduction to Novelty Track in TREC The Novelty Track is designed to investigate systems' abilities to locate relevant AND new information within a set of documents relevant to a TREC topic. Systems are given the topic and a set of relevant documents ordered by date, and must identify sentences containing relevant and/or new information in those documents.
Inst. Of Computing Tech, CAS Motivation II • Topic – –
Inst. Of Computing Tech, CAS Motivation III • Contrast between sentence and document – Content: average 6 -12 words VS. over 30 sentences – Overlapping: very less VS. key words occur frequently – Information(After stemming and removal of stop words): 4 or 5 pairs and frequent are mostly 1 VS. dozens of (word, frequent ) pair
Inst. Of Computing Tech, CAS Motivation IV • Sample – Event: 1) However, the Scottish team was the first to make a clone from adult animal cell. 2) The seminar was held in the context of a recently reported sheep cloning case in Britain. – Opinion 1) Daily we read news stories about dissatisfaction with managed care, Medicare fraud and overbilling. 2) Eighty percent agreed with this. • Conclusion: Only considering words is far from requirement; We must extend a sentence as possible as we can.
Inst. Of Computing Tech, CAS Semantic distance computation • Semantic distance or similarity computation • Previous works tends on word-word semantic distance – Corpus-based: co-occurrence distribution [Church and Hanks 1989, Grefenstette 1992] knowledgefree – Ontology-based: Taxonomy, or other psudoknowledge: Word. Net, How. Net, Tongyi Ci. Lin [刘 群(2002),王斌(1999), 李素建(2002),车万翔 (2003)] – Hybrid [Jay J Jiang 1997]
Inst. Of Computing Tech, CAS Semantic distance computation II • Corpus-based approach – Select a group of words to be features, and train a feature vector for each word. Then compute its similarity using vector computation (i. e. cosine) – Assumption: similar word <- -> similar context – Lu Song 1999, Degan(1999) – “Invariable Point” [Bai Shuo 1995] to make word clustering • [张三]{吃}<饭> • [李四]{穿}<衣> • A word, a category_____? _____All words, a category
Inst. Of Computing Tech, CAS Semantic distance computation III • Liu Qun, Li Su. Jian, 2002 Using How. Net 对于两个汉语词语W 1和W 2,如果W 1有n个义项(概念):S 11,S 12, ……,S 1 n,W 2有m个义项(概念):S 21,S 22,……,S 2 m,我们规 定,W 1和W 2的相似度各个概念的相似度之最大值,也就是说: …… (3) 其中p 1和p 2表示两个义原(primitive),d是p 1和p 2在义原层次体 系中的路径长度,是一个正整数。α是一个可调节的参数。
Inst. Of Computing Tech, CAS Semantic distance computation IV • Wang Bin, Using Tongyi Ci Lin O A a b 01 02. . . 01… B L a …… …… l 01 01 02. . . 01 …… … 01 02… 01. . . 01 01 …… . . . 虚线用于标识某上层节点到下层节点的路径
Inst. Of Computing Tech, CAS Semantic distance computation V • 车万翔 – 编辑距离:删除、插入、替换(How. Net, 同 义词林估计代价) (a)“爱吃苹果”与“喜欢吃香蕉”之间的编辑距离为 4,如四条虚 线所显示; (b)“爱吃苹果”与“喜欢吃香蕉”之间的改进编辑距离为 1. 1,其 中“爱” “喜欢”代价为 0. 5,“苹果” “香蕉”代价为
Inst. Of Computing Tech, CAS Semantic distance computation VI • Ontology-based: Simple but effective, understandable, however subjective • Corpus-based: ignoring the inherent semantic knowledge, practical for application in parsing, semantic or language usage. • Hybrid: Using taxonomy as its frame and using corpus to compute edge value instead of arbitrary setting.
Inst. Of Computing Tech, CAS Overview on Word. Net • URL: – http: //www. cogsci. princeton. edu/~wn/ • 开发单位: – 普林斯顿大学心理语言学实验室 – 初衷是作为研究人类词汇记忆的心理语言学成果 – 在自然语言处理中得到广泛的应用 • 免费的在线词汇数据库 • 世界很多语种都开发了相应的版本 – 各种欧洲语言:Euro. Net – 汉语:CCD(Chinese Concept Dictioanry)
Inst. Of Computing Tech, CAS Word. Net II • 同义词集 Synset – 用一组同义词的集合Synset来表示一个概念 – 每一个概念有一段描述性的说明 • 关系 –上下位关系(hyponymy,troponymy) –同义反义关系(synonymy,antonymy) –部分整体关系(entailment,meronymy) –……
Inst. Of Computing Tech, CAS Word. Net III 名词概念的组织:
Inst. Of Computing Tech, CAS Word. Net IV 形容词概念的组织:
Inst. Of Computing Tech, CAS Wordnet V • Only covering open class word : adj, n, v, adv • Scale: see attachment • Data format: see attachment • API: How to use it? wninit() (Word, POS) [getindex(char *, int)] (synset offset)[read_synset(int, long, char *) ->tagged frequency be%2: 42: 03: : 1 10720 sense_key sense_number tag_cnt
Inst. Of Computing Tech, CAS Word. Net-based Sentence semantic distance • Information content – P(c) 1/P(c) – IC(c)= -log. P(c) A dog bites a man VS. A man bites a dog – Entropy(H)=∑P(c)*IC(c) • Information content of synset IC(c)=-log. P(c)=freq(c)/N freq(c)= ∑freq(w) where w∈c or w ∈c* and c* is a child of c
Inst. Of Computing Tech, CAS Word. Net-based Sentence semantic distance II • Edge value: Wt(c, p)=(β+(1β)E’/E(p)){1+1/d(p)}^a[IC(c)-IC(p)]T(c, p) Where: – c child, p: parent – E density – T(c, p) : Link type • Only focus IC, therefore, Wt(c, p)=IC(c)-IC(p) • [Jay J. Jiang, 1997] Dist(w 1, w 2)=IC(c 1)+IC(c 2)2*IC(LSuper(c 1, c 2)) – LSuper: lowest super-ordinate of c 1 and c 2
Inst. Of Computing Tech, CAS Word. Net-based Sentence semantic distance III • It’s not enough if only focus hyperlink. • We introduce more relationship such as: – VERBGROUP$: develop 6 acquire 5 evolve – Similar & : aborning 0 003 & 00003552 a 0000 & 00003671 a 0000 ! 00003777 a 0101 – Derivation adj->adv; v<->n, v->adj; – Noun-Noun: ISMEMBERPTR SSTUFFPTR , ISPARTPTR , HASMEMBERPTR , HASSTUFFPTR, HASPARTPTR – Example: Friend->friendly->friendliness
Inst. Of Computing Tech, CAS Word. Net-based Sentence semantic distance IV • Word-Sentence Semantic Distance (WSSD) WSSD(W, S)=min {WWSD(W, wi)| wi ∈S } where W is a word, S is a sentence and wi is a word in sentence S. • Sentence-Sentence Semantic Distance (SSSD)
Inst. Of Computing Tech, CAS Word. Net-based Sentence semantic distance IV w 1 w 2 w 3 w 4 w 6 w 5 Sentence B W 3 W 1 W 2 Note: wi : Word in Sentence A W 4 Wi : Word in Sentence B Semantic Distance from Word wi to Sentence B Semantic Distance from Word Wi to Sentence A Sentence Semantic Distance
Inst. Of Computing Tech, CAS Novelty Detection using sentence semantic distance • What factors determine whether a sentence S is novel or new? – Semantic distance between S and topic T. Less is better – Semantic distance between S and previous valid content T. More is better – Word overlapping with T: More is better – Word overlapping with previous sentence: less is better – Is a paragraph head? Head sentence is more likely to be new. –…
Inst. Of Computing Tech, CAS Novelty Detection using sentence semantic distance II • How to link various factors with decision? – A binary decision: New or not. – Various factor treated as different dimension, they forms a featured or factor vector. However, different dimension has a different weight. • Problems turns into binary categorization: put the relevant sentence S factor vector into which category: new or not? • Similar approaches could be applied in relevance detection.
Inst. Of Computing Tech, CAS Novelty Detection using sentence semantic distance III • Training on known result using winnow algorithm. – Factor should take same direction and same range: 0 ->1, ascendly – N 2 XIE 19970228. 0169: 05 0. 430 1. 00 0. 174 1. 00 1 1 – N 2 XIE 19970228. 0169: 06 0. 364 0. 52 0. 143 0. 81 1 0 – N 2 XIE 19970302. 0039: 04 0. 5978 1. 0 0. 20 1. 0 1 ? –∑Wi*Fi>N: new; otherwise: not new
Inst. Of Computing Tech, CAS Experiments and analysis • ICT 03 NOV 4 OTP: word overlapping with topic and previous valid context – Averages over 50 topics: Average precision: 0. 59; Average recall: 0. 70; Average F: 0. 610 – Best: N 5 209 227 200 0. 88 0. 96 0. 917 – Worst: N 49 50 5 0 0. 000 – Sounds not bad with the simple information.
Inst. Of Computing Tech, CAS Experiments and analysis II • ICT 03 NOV 4 LFF: word overlapping + semantic similarity with topic and previous valid context – Averages over 50 topics: Average precision: 0. 59 Average recall: 0. 64 Average F: 0. 568 – Best: N 5 209 224 197 0. 88 0. 94 0. 910 – Worst: N 46 93 2 0 0. 000 – Seems little reduction after introducing semantic distance.
Inst. Of Computing Tech, CAS Experiments and analysis III • ICT 03 NOV 4 ALL: word overlapping + semantic similarity with topic and previous valid context+head sentence – Averages over 50 topics: Average precision: 0. 60 Average recall: 0. 68 Average F: 0. 598 – Best: N 5 209 224 197 0. 88 0. 94 0. 910 – Worst: N 46 93 2 0 0. 000 – Enhance after adding head sentence information
Inst. Of Computing Tech, CAS Problems and future works • Semantic computation using corpus depends more on corpus. Data sparseness is bottleneck especially when we lack large scale of semantic corpus till now. • Semantic computation model should unify taxonomy relationship and other relationship. • Winnow training algorithm has deficiency in (factors, decision) modeling
Inst. Of Computing Tech, CAS Problems and future works II • Some modification could be applied in semantic computation. • Along with the current approach, some works could be optimized. 1) semantic distance between selected sentence and its previous one extends to valid contexts; 2) winnow init; 3) winnow step 4) KNN or other way to replace winnow
Inst. Of Computing Tech, CAS Problems and future works III • Synset-based VSM or Synset overlapping to expand the sentence though overlapping/VSM is proved simple but effective.
Inst. Of Computing Tech, CAS Conclusion • Semantic distance computation using Word. Net is effective for many applications such as disambiguation, bilingual word alignment. It could help compare words whose forms are different. • Factors and decision could be converted into categorization • Relevance and Novelty detection seems promising though it is difficult. It is still in earlier stage. More works or approaches could be tried. Best results determine best approach. Good ideas should be checked with final result.
Inst. Of Computing Tech, CAS Acknowledgements • Dr. Jian Sun for instructive discussion and providing papers on semantic computation and novelty. • Mr. Wei-Feng Pan for winnow program and • Associate Prof. Qun Liu for discussion on semantic similarity. • Stanford Univ. for providing Word. Net
Inst. Of Computing Tech, CAS Thanks for your attention!