
1b51f02079d1218f24bb29c3015a6ed8.ppt
- Количество слайдов: 1
Cross-Corpus Analysis with Topic Models Padhraic Smyth, Mark Steyvers, Dave Newman, Chaitanya Chemudugunta University of California, Irvine • Example: two corpora, Enron emails and New York Times articles that mention “Enron” • E. g. emails, intelligence reports, news articles. We looked at: Query: NYT Article • Problem: how to find Enron emails relevant to New York Times article (or vice versa)? • Approach: 1) Train two separate topic models 2) map the query into the topic space of the other corpus 3) Calculate relevance by proximity in topic space (e. g. using Jensen-Shannon divergence) New York Times Articles 3000 articles that mention “Enron” Enron email data 500, 000 emails 11 k different authors 1999 -2002 the four remaining directors of the enron corporation who oversaw the company s rise to the top 10 of the fortune 500 and its collapse into bankruptcy last year quit today the resignations of the directors robert a belfer norman p blake_jr wendy l gramm and herbert s winokur_jr were accepted unanimously by the board enron said the board announced in february 2002 its intent to conduct an orderly transition to a board composed of new independent directors … Best matching ENRON email earlier today we issued a press_release announcing further progress on the transition of enron s board of directors at a board meeting this morning the board accepted the resignations of the remaining four long standing directors robert a belfer norman p blake dr wendy l gramm and herbert s winokur jr the board also elected raymond s troubh as interim chairman of the board and noted that they have identified three independent director candidates whose election is pending feedback from the creditors committee … Pub. Med 15, 000 articles Queries can return 100 k or more articles • Applications: board members committee directors gisb ken_lay ken ceo chairman president board members committee directors gisb plan week plans process working enron_north_america enron_corp enron_employee enron_stock P( topic ) Analysis, Exploration, and Retrieval of Information across Multiple Corpora P( topic ) Example Cross Corpus Retrieval ken_lay ken ceo chairman president Corpus comparison: automatically compare topics across 2 different corpora Cross-corpus retrieval: given a document in corpus A, find similar documents in corpus B “Gate. Keeper”: given a document in corpus A, compute the likelihood of finding matching documents in corpus B, without looking at individual document records. • document = mixture of topics • Topic models can be learned automatically using statistical learning [e. g. Griffiths and Steyvers (2004) ] • What are the topical similarities and differences between two large sets of documents? Terrorism Wall Street Stock Market Bankruptcy SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP Collocation Topic Model • New model combines frequent word combinations (collocations) with topics Ricin binding (6. 1) Common Topics Child mortality Animal infections (3. 7) Proteins (3. 3) • Example: Pub. Med papers from China and Israel… China Topics Cell marrow (30. 0) Serum levels (24. 5) Gene sequences (22. 2) Antibodies (13. 5) SARS (10. 0) Plague study Patient diagnosis Cases reported Common Topics Animal infections Acid mass detection Cattle diseases Nerve motor study Vaccination Gate. Keeper 2003 Topics SARS (11. 0) Gene mutations (5. 5) Cell marrow Brucellosis (4. 1) Biological agents (5. 5) Gene sequences (5. 0) HIV (4. 5) Israel Topics Biological agents(24. 5) Terrorist injuries (14. 9) West nile virus (12. 2) September 11 (11. 0) Public health (8. 2) • Application: analyst wants to check whether some report X (query) has any similar documents in secure database at a different agency. Analyst uses “gatekeeper” to assess whethere any relevant documents before going through lengthy process of securing access • Problem: information retrieval model cannot have access to individual documents either -- only has summaries of topics across whole database • Solution: use log likelihood of query document with the topic model using only the topics. • Simulation: assume Biobase docs as secure database. Probe with (relevant) new Biobase docs or (irrelevant) computer science docs from Cite. Seer. Figure shows that relevant documents can be discriminated from irrelevant documents based on this global measure. BIOBASE For each document, choose a mixture of topics TOPIC MIXTURE TOPIC . . . TOPIC • Model automatically extracts topics and word combinations WORD • Collocations in topics improve interpretability: e. g. “United_States”, “Sept_11”, “Osama_Bin_Laden” Pre 1980 Topics Cattle diseases (6. 7) • Example: Pub. Med papers before 1980 compared with 2003 … Probabilistic Topic Models • topic = distribution over words Corpus Comparison WORD X . . . CITESEER TOPIC MODEL For every word slot, sample a topic If x=0, sample a word from the topic If x=1, sample a word from the distribution based on previous word BIOBASE CITESEER WORD MODEL