05019d1aea6728ba8fed31d4af3a4881.ppt
- Количество слайдов: 69
Implications of Web 2. 0 on Information Research Wen-Lian Hsu Academia Sinica, Taiwan 中央研究院資訊所 許聞廉 hsu@iis. sinica. edu. tw 1
Outline o o What is Web 2. 0? Web 2. 0 and Research n n o Human-based Computation Folksonomy (Social Tagging) Academic Data Analysis GIO-Info Conclusion 2
3
What is Web 2. 0? o o Web 2. 0 Conference (October 2004) Tim O'Reilly n n n n The Web As a Platform Harnessing Collective Intelligence Data is the Next Intel Inside End of the Software Release Cycle Lightweight Programming Models Software Above the Level of a Single Device Rich User Experiences 4
Key Web 2. 0 services/applications o o o o Blogs Wikis Tagging and social bookmarking Multimedia sharing RSS and syndication Podcasting P 2 P 5
Social Bookmarking Source: http: //funp. com/push/ 6
Source: http: //digg. com/ Soruce: http: //www. hemidemi. com/ 7
Social bookmark adsense Blog Content comments Source: http: //carol. bluecircus. net/ 8
Skype Source: S. A Baset, H. Schulzrinne (September 14, 2004). An Analysis of the Skype Peer-to-Peer Internet Telephony Protocol. Technical Report. Columbia University. 9
Wikipedia 10
Second Life 11
Symbiosis (共生機制) is the Key Blog Social bookmark 12
The Web Changes in Several Dimensions o o o Dynamics Heterogeneity Collaboration Composition Socialization 13
Current Research Activities o o o Information Retrieval on Blogs n NTCIR-7 CLIRB (Cross-Lingual Information Retrieval for Blog) Question Answering on Blogs n TREC 2007 QA Track Question Answering on Wikipedia n QA@CLEF 2007 CLEF 2006 Wi. QA n given a Wikipedia page, locate information snippets in Wikipedia PASCAL Ontology Learning Challenge n Ontology construction n Ontology extension n Ontology population n Concept naming Link. KDD 2006, Textlink 2007, MRDM 2007 14
International Competition o o o 1 st/9 place in the NTCIR 5 2005 CLQA Chinese Question Answering Contest (44. 5%) 1 st/13 place in the WS City. U closed track of the SIGHAN 2006 Word Segmentation Contest (97. 2%) 2 nd/10 place in the WS CKIP closed track of the SIGHAN 2006 Word Segmentation Contest (95. 7%) 2 nd/8 place in the NER City. U closed track of the SIGHAN 2006 Named Entity Recognition Contest (88%) 1 st place in the NTCIR 6 2006 CLQA Chinese Question Answering Contest (55. 3%) 1 st place in the NTCIR 6 2006 CLQA English-Chinese Question Answering Contest (34%) 15
Factoid Questions o PERSON: 請問芬蘭第一位女總統為誰? Who is Finland's first woman president? o LOCATION: 請問狂牛症最早起源於何國? Which country is the mad cow disease originated from? o ORGANIZATION: 請問收購南韓三星汽車的外國廠商為何? Which corporation bought South Korea's Samsung Motors? o TIME NUMBER ARTIFACT o o 16
IASL QA Architecture Answer Extraction Question Processing SVM Info. Map Mencius ME Auto. Tag Mencius Answer Ranking Passage Retrieval Lucene Filter Auto. Tag Answers word index char index documents 17
Chinese Question Taxonomy for NTCIR CLQA Factoid Question Answering 18
Knowledge Representation of Chinese Questions Chinese Question: 2004年奧運在哪一個城市舉行? (In which city were the Olympics held in 2004? ) [5 Time]: [3 Organization]: [7 Q_Location]: ([9 Locaiton. Related. Event]) 19
QC by SVM o Two types of feature used for CQC n Syntactic features o Bag-of-Words n n o Part-of-Speech (POS) n n character-based bigram (CB) word-based bigram (WB) AUTOTAG § POS tagger developed by CKIP, Academia Sinica Semantic Features o How. Net Senses n n How. Net Main Definition (HNMD) How. Net Definition (HND) 20
Question Classification Accuracy 21
Answer Extraction 廿一世紀美國總統 總統父子檔美國第二對 美國總統性事錄 翻開美國總統傳訊史 美國總統匆忙赴晚宴 陸文斯基瘋狂愛上美國總統大選選舉人票分析 前越南總統阮文紹病逝美國 美國總統柯林頓表示 Answer Extraction Mencius Filter 陸文斯基 阮文紹 柯林頓 22
Templates generated by local alignment o . . 因/Cbb/O 台中縣/Nc/LOC 議長/Na/OCC 顏清標/Nb/PER 涉嫌/VK/O. . 清朝/Nd/O 台灣/Nc/LOC 巡撫/Na/OCC 劉銘傳/Nb/PER 所/D/O. . LOC OCC PER (contains only NEs) o 被/P/O 大陸/Nc/LOC 國家/Na/O 主席/Na/OCC 江澤民/Nb/O 形容為/VG/O. . /COMMA/O 香港/Nc/LOC 行政/Na/O 長官/Na/OCC 董建華/Nb/PER 近日. . 俄羅斯/Nc/LOC 男子/Na/O 選手/Na/OCC 史莫契柯夫/Nb/O 在/P/O. . LOC Na OCC Nb (template contains POS-tag) o 由/P/O 建業/Nc/O 所長/Na/OCC 張龍憲/Nb/PER 擔任/VG/O 由/P/O 安侯/Nb/O 所長/Na/OCC 魏忠華/Nb/PER 擔任/VG/O 由 N 所長 PER 擔任 (template contains paritial POS-tag, word) o 在/P/O 卡達首都/Nc/LOC 多哈/D/PER, LOC 舉行/VC/O 於/P/O 國父紀念館/Nc/ORG 舉行/VC/O 在/P/O 國父紀念館/Nc/ORG 廣場/Nc/O 舉行/VC/O P Nc – 舉行 (template with gap ‘-’ ) 23
Answer Extraction from Template o Question: 誰是台灣國防部長? Q-Type: PERSON Q-KEYWORD: 台灣 國防部長 o Tagged Passages n n 美國/Nc/LOC 國防部長/Na/OCC柯恩/Nb/PER 今天/Nd/O 表示/VE/O ,/COMMA/O 華府 /Nc/ORG, LOC 當局/Na/O 正/D/O 設法/VF/O 釐清/VC/O 台灣/Nc/LOC n o 前任/A/O 美國/Nc/LOC 國防部長/Na/OCC溫柏格/Nb/PER 認為/VE/O , /COMMACATEGORY/O 【/PAR/O 路透/Nb/ORG 東京/Nc/LOC 十九日/Nd/TIME 電/VC/ART 】/PAREN/O 台灣 /Nc/LOC 國防部長/Na/OCC唐飛/Nb/PER 昨天/Nd/O Template matching and Relation building n Template: LOC OCC PER n Relation: o 美國, 國防部長, 溫柏格, 柯恩 o 台灣, 國防部長, 唐飛 24
Answer Extraction from Template o Question: 黛安娜王妃的死亡車禍事故發生在哪裡? n Q-TYPE: LOCATION Q-KEYWORD: 黛安娜 王妃 死亡 車禍 事故 發生 o Tagged Passages n. . 則/D/O 把/P/O 英國/Nc/LOC 黛安娜/Nb/PER 王妃/Na/O 的/DE/O 巴黎/Nc/LOC 死 亡/VH/O 車禍/Na/O ,/COMMA/O 搬上/VC/O 舞台/Na/O. . n. . 英國/Nc/LOC 王妃/Na/O 黛安娜/Nb/PER 離開/VC/O 人世/Nc/O 四個多月/Nd/TIME. . o Template matching and Relation building n Template: o o n PER Na DE LOC – Na LOC Na PER - VC Relation: o o 黛安娜/PER, 王妃/Na, 巴黎/LOC, 車禍/Na 英國/LOC, 黛安娜/PER, 王妃/Na, 離開/VC 25
Answer Ranking o o Features are combined as weighted sum Answer Ranking Features n IR Score n Answer Frequency (voting) n * QFocus adjacency: o o n n “美國總統[布希]表示” “前往[惠氏藥廠]參觀” * Question Term and Answer Term (QAT) Co-occurrence * Answer Template 26
Web 2. 0 and Research o o Human-based Computation Folksonomy (Social Tagging) Academic Data Analysis GIO-Info 27
Human-based Computation 28
Human-based Computation o Social Search n o CAPTCHA n o wayfinding tools informed by human judgment reversed Turing test (Turing test 是由人來詢問系統,這裡 則是由系統來詢問使用者) Interactive Genetic Algorithm (IGA) n n a genetic algorithm informed by human judgment. 由人 提供fitness function結果 o 例子:描繪罪犯畫像,系統以GA方式產生嫌犯畫像,目 擊者負責評分看那個比較像,不斷重複過程直到接近罪 犯樣子為止 29
CAPTCHA Completely Automated Public Turing test to tell Computers and Humans Apart o A CAPTCHA is a type of challenge-response test used in computing to determine whether the user is human. wikipedia SOURCE: http: //recaptcha. net/ 30
CAPTCHA blog Recognized text CAPTCHA blog CAPTCHA Unrecognized text 31
The ESP Game o o o a two-player game The goal is to guess what your partner is typing on each image. Once you both type the same word(s), you get scores. ESP 32 Source: http: //www. espgame. org/
The Phetch Game Play as a describer 33
The Phetch Game Play as a seeker Phetch 34
How about a game for describing idioms? 壞事做太多 高抬貴手 不動如山 罄竹難書: 壞事做太多 虎頭蛇尾: 做事沒有毅力 ……… 如沐春風 35
Folksonomy (Social Tagging) 36
Folksonomy (Social Tagging) o o Also known as social tagging, collaborative tagging, social classification, social indexing Folksonomy is the practice and method of collaboratively creating and managing tags to annotate and categorize content. Wikipedia 37
38
del. icio. us Tags: Descriptive words applied by users to links. Tags are searchable My Tags: Words I’ve used to describe links in a way that makes sense to me 39
Semantic Web Source: Tim Berners-Lee 40
Using Folksonomy to Help Semantic Web o Top-down Semantic Annotation n Approach o o o n Define an ontology first Use the ontology to add semantic markups to web resources. The semantics is provided by the ontology which is shared among different web agents and applications. Problem o o o Negotiation Evolution (hard to maintain) High Barrier (background) Source: Xian Wu, Lei Zhang, Yong Yu. “Exploring Social Annotations for the Semantic Web” 41
Using Folksonomy to Help Semantic Web o o Bottom-up approach with social tagging Advantage n n n o No common ontology or dictionary are needed Easy to access Sensitive to information drift Disadvantage n n Ambiguity Problem: For example, “XP” can refer to either “Extreme Programming” or “Windows XP”. Group Synonymy Problem: two seemingly different annotations may bear the same meaning. Source: Xian Wu, Lei Zhang, Yong Yu. “Exploring Social Annotations for the Semantic Web” 42
Or Folksonomy is the Solution? o Ontology is Overrated n n n Classification of the web has failed Classification itself is filled with bias and error Tagging is the solution Source: http: //www. shirky. com/writings/ontology_overrated. html 43
Academic Data Analysis 44
Academic Data Analysis Users participate and interact with data and people e-Lib, Lib 2. 0 concept adding into application, so search platform provide open API for collecting more data Add My Library, Tag Ex. Citeulike, Bib. Sonomy Add Comments, Rating, Recommendation Ex. Techlens Domain Focus Groups Ex. Botanicus Arxiv Google Scholar Windows Live Academic Search Cite. Seer Pud. Med Citation index Papers , journal/conference, authors 45
An Example o Let’s use an example of Tech. Len to imagine what research on IR /NLP can do. Authors Readers Papers 46
The Terminology Entities References Links Aho, A. V. Alfred V Aho Alfred Aho AV Aho Alfred Aho, John Hopcroft, Jeffrey Ullman AV Aho, BW Kernighan, PJ Weinberger Entity Groups G 1 (Programming Languages) G 3 (Algorithms) G 2 (Databases) 47
Imagine how we can make use of them Papers Reference Extraction Entity Resolution Authors Rating Comments Readers 48
New Research Topics o o From those changes, key emerging challenge for “Data Mining” is tackling the problem of dealing with richly structured, finding patterns behind heterogeneous datasets, …, etc. Several researches focus on those problem like n (Social) Network Analysis n Link Mining n PASCAL Ontology Learning Challenge n … 49
Society Nodes: individuals (Authors, Readers) Links: social relationship (family/work/friendship/belong to, …etc. ) S. Milgram (1967) Six Degrees of Separation, John Guare Science Social networks: Many individuals with diverse social interactions between them. 50 source: www. cs. uiuc. edu/~hanj
Communication networks The Earth is developing an electronic nervous system, a network with diverse nodes and links are -computers -phone lines -routers -TV cables -satellites -EM waves -Papers - Relations between artifacts -User IP Artifacts in -Comments Techlens -Response -… Communication networks: Many nonidentical components with diverse connections between them. 51 source: www. cs. uiuc. edu/~hanj
Link-based Object Ranking o o Perhaps the most well known link mining task is that of link-based object ranking (LBR), which is a primary focus of the link analysis community. The objective of LBR is to exploit the link structure of a graph to order or prioritize the set of objects within the graph. Example n Page. Rank n What paper is most important in this area? n What journal/conference is most important in this area? n What topic is important in this area? 52
Link-based Object Classification/ Linkbased Classification (LBC) o Predicting the category of an object based on its attributes and its links and attributes of linked objects o Web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc. o Citation: Predict the topic of a paper, based on word occurrence, citations, co-citations o Epidemic : Predict disease type based on characteristics of the people; predict person’s age based on ages of people they have been in contact with and disease type 53
Group Detection o Cluster the nodes in the graph into groups that share common characteristics. That is, Predicting when a set of entities belong to the same group based on clustering both object attribute values and link structure. n n Web: identifying communities Citation: identifying research communities 54
Entity Resolution o Predicting when two objects are the same, based on their attributes and their links n n Web: predict when two sites are mirrors of each other Citation: predicting when two citations are referring to the same paper Epidemics: predicting when two disease strains are the same Biology: learning when two names refer to the same protein 55
Link Prediction o Predict whether a link exists between two entities, based on attributes and other observed links n n n Web: predict if there will be a link between two pages Citation: predicting if a paper will cite another paper, or predict the venue type of a publication (conference, journal, workshop) based on properties of the paper Epidemics: predicting who a patient’s contacts are ( 在流行病學上需要去找出病源(灶)/傳染源) 56
Other Possible Research Directions o Expert Finding n o like giving a suggestion of Paper Reviewer, Conference committee member Ecological Evolution of Some Research n n Like one topic with different solution in a time period A domain’s topic distribution 57
GEO-Info 地理資訊 58
GEO-Info User Participate Google Earth Community Google Earth Blog Ogle Earth …. GML Photo-sharing User Annotation Open for every one Google Earth/Map Limited user, limited usage GIS 59
Some Research Topics o o Until now, a lot of information can be combined into google earth/map by KML. Hence such information can be integrated by geocoding, some models become very interesting, such as n n n n Photo Annotation, Sharing, and Search Live information Planning 3 D, Flights Animation Travel experience, comments Transportation information, survival information Climate Change 60
Some Information bundled with Google Earth/Map (中山公園) Photo sharing, (photo & Tags) Integrated with Youtube (video & tags) 61
Some Application Integrate more Information on Map Personal Life Information Integration Geo. DDupe: A Novel Interface for Interactive Entity Resolution in Geospatial Data 62
Photo link with Map Source: http: //www. panoramio. com 63 63
Image-based Rendering (IBR) o IBR relies on a set of two-dimensional images of a scene to generate a three-dimensional model and then render some novel views of this scene. o Web 2. 0 enables sharing of photographs on a truly massive scale 64
Microsoft Photo. Synth o From SIFT to Photo. Synth 65
Conclusion o o Research results can be easily integrated on the Web 2. 0 platform make restricted-domain research more useful for the public (such as image-based rendering) n o o Software agent Benefit human-based computation Certain research topics will be easier to tackle, such as personalization in virtual world (more data available) Data becomes more task oriented (e. g. Wikipedia) More versatile data networks available 66
誠徵研究助理(歡迎替代役) 1. 2. 3. 4. 5. 6. 資訊相關研究所畢業。 具備研讀英文論文能力。 對 「中文自然語言處理」(「自然輸入法」、「 問答系統」)或「生物資訊」(「生物資訊演算法 」、「生物文獻檢索分析」)研究有熱忱。 熟悉下列任一程式語言:C/C++/C#/JAVA 與問題 解決能力 應徵輸入法相關 作者具下列任一條件尤佳: Win. CE/Win 32 API。 善於溝通與團隊合作。 67
Acknowledgement o I would also like to thank two Ph. D. students of mine who help organize the slides: n 李政緯,呂俊宏 68
Thank You 69
05019d1aea6728ba8fed31d4af3a4881.ppt