72cd177d8f1c5b5373e1baba1a3c9210.ppt
- Количество слайдов: 54
Managing Shared Information: Statistical Methods for Validating, Integrating, and Analyzing Text Data from Multiple Sources Cheng. Xiang (“Cheng”) Zhai Department of Computer Science University of Illinois at Urbana-Champaign http: //www. cs. uiuc. edu/homes/czhai AFOSR Information Operations and Security Annual PI Meeting, Arlington, VA, Aug. 7, 2013 1
Managing Shared Information Decision=? Information Need How to manage shared information? - validate textual claims? - integrate scattered opinions? - summarize & analyze opinions? Information Nuggets Information Sharing Network 2
Opinionated text data are very useful for decision support Opinionated Text Data Decision Making & Analytics “Which cell phone should I buy? ” “What are the winning features of i. Phone over blackberry? ” “How do people like this new drug? ” “How is Obama’s health care policy received? ” “Which presidential candidate should I vote for? ” … 3
However, a user will be overwhelmed with lots of text data… How can I integrate and organize all opinions? How can I digest them all? What can I believe ? Any pattern? 4
Research Questions • How can we integrate scattered opinions? • How can we validate and summarize opinions? • How can we analyze online opinions to discover patterns? • How can we do all these in a general way with no or minimum human effort? – Must work for all topics – Must work for different natural languages Solutions: Statistical Methods for Text Data Mining 5
Rest of the talk: general methods for 1. Opinion Integration & Validation 2. Opinion Summarization 3. Opinion Analysis 6
Outline 1. Opinion Integration & Validation 2. Opinion Summarization 3. Opinion Analysis 7
How to digest all scattered opinions? Need tools to automatically integrate all scattered opinions 190, 451 posts 4, 773, 658 results 8
Observation: two kinds of opinions 190, 451 posts 4, 773, 658 results Can we combine. Ordinary opinions them? Expert opinions • CNET editor’s review • Wikipedia article • Well-structured • Easy to access • Maybe biased • Outdated soon • Forum discussions • Blog articles • Represent the majority • Up to date • Hard to access • fragmented 9
Opinion Integration Strategy 1 [Lu & Zhai WWW 2008] Align scattered opinions with well-structured expert reviews Yue Lu, Cheng. Xiang Zhai. Opinion Integration Through Semi-supervised Topic Modeling, Proceedings of the World Wide Conference 2008 ( WWW'08), pages 121 -130. 10
Review-Based Opinion Integration Output Input Expert review with aspects Text collection of ordinary opinions, e. g. Weblogs Design Battery Price. . Extra Aspects Topic: i. Pod Review Aspects Similar opinions Supplementary opinions Design Battery cute… tiny…. . thicker. . Price could afford still it expensive i. Tunes warranty last many hrs die out soon … easy to use… …better to extend. . Integrated Summary 11
Results: Product (i. Phone) • Opinion Integration with review aspects Review article Similar opinions You can make N/A emergency calls, but you can't use any other functions… Confirm the Activation opinions from the review rated battery life of 8 i. Phone will Feature hours talk time, 24 Up to 8 Hours of Talk hours of music Time, 6 Hours of playback, 7 hours of Internet Use, 7 Hours video playback, and 6 of Video Playback or hours on Internet use. 24 Hours of Audio Playback Battery Supplementary opinions … methods for unlocking the i. Phone have emerged on the Unlock/hack past few weeks, Internet in the i. Phone although they involve tinkering with the i. Phone hardware… Playing relatively high bitrate VGA H. 264 videos, our i. Phone lasted almost exactly 9 freaking hours of continuous playback with cell and Wi. Fi on (but Bluetooth off). Additional info under real usage 12
Results: Product (i. Phone) • Opinions on extra aspects support Supplementary opinions on extra aspects 15 You may have heard of i. ASign … an i. Phone Dev Wiki tool that Another way to allows you to activate your phone without going through the activate i. Phone i. Tunes rigamarole. 13 Cisco has owned the trademark on the name "i. Phone" since 2000, when it acquired Info. Geari. Phone trademark which Technology Corp. , originally registered the name. originally owned by 13 Cisco With the imminent availability of Apple's uber cool i. Phone, a look at 10 things current smartphones like the Nokia N 95 have been able to. A better while and that the i. Phone can't currently do for a choice for smart phones? match. . . 13
As a result of integration… What matters most to people? Price Bluetooth & Wireless Activation 14
Sample Results of “Obama” Wikipedia article about “Barack Obama” Relevant discussions in blog 15
What if we don’t have expert reviews? How can we organize scattered opinions? 190, 451 posts 4, 773, 658 results Exploit online ontology! Expert opinions • CNET editor’s review • Wikipedia article • Well-structured • Easy to access • Maybe biased • Outdated soon Ordinary opinions • Forum discussions • Blog articles • Represent the majority • Up to date • Hard to access • fragmented 16
Opinion Integration Strategy 2 [Lu et al. COLING 2010] Organize scattered opinions using an ontology Yue Lu, Huizhong Duan, Hongning Wang and Cheng. Xiang Zhai. Exploiting Structured Ontology to Organize Scattered Online Opinions, Proceedings of COLING 2010 (COLING 10), pages 734 -742. 17
Sample Ontology: 18
Ontology-Based Opinion Integration Topic = “Abraham Lincoln” (Exists in ontology) Subset of Aspects Ordered to optimize readability Matching Opinions Aspects from Ontology (more than 50) Professions Quotations Date of Birth Professions Parents … Quotations Scattered Opinion Sentences … Place of Death 19
Sample Results: Free. Base + Blog Ronald Reagan Sony Cybershot DSC-W 200 20
More opinion integration results are available at: http: //sifaka. cs. uiuc. edu/~yuelu 2/opinionintegration/ 21
Opinion Validation: What to Believe? [Vydiswaran et al. KDD 2011] A General Framework for Textual Claim Validation V. G. Vinod Vydiswaran, Cheng. Xiang Zhai, Dan Roth, Content-driven Trust Propagation Framework, Proc. 2011 ACM SIGKDD Int. Conf. . Knowledge Discovery and Data Mining (KDD'11), Aug. 2011. 22
Motivation • How does a user know what to believe and what to be trusted? • Task: annotate textual claims and their sources with trustworthiness scores • Solution: a general content-based trust framework, capturing the following intuition: – Not all claims are equally trustable – Not all sources are equally trustable – Trustable sources tend to provide trustable claims 23
Content-Based Trust Propagation Framework Different Sources (e. g. , news, Web) Textual Support Evidence Textual Claims (e. g. blog articles) Claim 1 Claim 2. . . Claim n • • • First trust framework to model trust of textual claims directly Combines multiple evidences (source trust, evidence trust, and claim trust) in a unified framework to enable mutual reinforcement of different scores General and thus applicable to multiple domains 24
Sample Results of the New Framework Which news source should you trust? It depends on the genres! Incorporating the trust model improves retrieval accuracy for news 25
Outline 1. Opinion Integration & validation 2. Opinion Summarization 3. Opinion Analysis 26
Need for opinion summarization 1, 432 customer reviews How can we help users digest these opinions? 27
Nice to have…. Can we do this in a general way? 28
Opinion Summarization 1: [Mei et al. WWW 07] Multi-Aspect Topic Sentiment Summarization Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, Cheng. Xiang Zhai, Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs, Proceedings of the World Wide Conference 2007 ( WWW'07), pages 171 -180 29
Multi-Faceted Sentiment Summary (query=“Da Vinci Code”) Neutral Tom Hanks stars in the movie, who can be mad at that? But the movie might get delayed, and even killed off if he loses. Directed by: Ron Howard Writing credits: Akiva Goldsman. . . Tom Hanks, who is my favorite movie star act the leading role. protesting. . . will lose your faith by. . . watching the movie. After watching the movie I went online and some research on. . . Facet 2: Book Negative . . . Ron Howards selection of Tom Hanks to play Robert Langdon. Facet 1: Movie Positive Anybody is interested in it? . . . so sick of people making such a big deal about a FICTION book and movie. I remembered when i first read the book, I finished the book in two days. Awesome book. . so sick of people making such a big deal about a FICTION book and movie. I’m reading “Da Vinci Code” now. So still a good book to past time. This controversy book cause lots conflict in west society. … 30
Separate Theme Sentiment Dynamics “book” “religious beliefs” 31
Can we make the summary more concise? Neutral Negative . . . Ron Howards selection of Tom Hanks to play Robert Langdon. Tom Hanks stars in the movie, who can be mad at that? But the movie might get delayed, and even killed off if he loses. Directed by: Ron Howard Writing credits: Akiva Goldsman. . . Tom Hanks, who is my favorite movie star act the leading role. protesting. . . will lose your faith by. . . watching the movie. went online and some research on. . . Facet 1: Movie Positive in it? such a big deal about a FICTION book and movie. I remembered when i first read the book, I finished the book in two days. Awesome book. . so sick of people making such a big deal about a FICTION book and movie. I’m reading “Da Vinci Code” now. So still a good book to past time. This controversy book cause lots conflict in west society. What if thethe movie I is using a smartsick of people making user Anybody is interested. . . so phone? After watching Facet 2: Book … 32
Opinion Summarization 2: [Ganesan et al. WWW 12] “Micro” Opinion Summarization Kavita Ganesan, Chengxiang Zhai and Evelyne Viegas, Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions, Proceedings of the World Wide Conference 2012 ( WWW'12), pages 869 -878, 2012. 33
Micro Opinion Summarization • Generate a set of non-redundant phrases: – Summarizing key opinions in text – Short (2 -5 words) Micropinions – Readable Micropinion summary for a restaurant: “Good service” “Delicious soup dishes” • Emphasize (1) ultra-concise nature of phrases; (2) abstractive summarization “Room is large” “Room is clean” “large clean room” 34
Overview of summarization algorithm Input Unigrams …. very nice place clean Text to be summarized problem dirty room … Step 1: Shortlist high freq unigrams (count > median) Seed Bigrams very + clean + dirty + nice clean dirty place Srep > σ rep room place … Step 2: Form seed bigrams by pairing unigrams. Shortlist by Srep. (Srep > σrep) 35
Overview of summarization algorithm Summary Higher order n-grams Candidates + Seed Bi-grams = New Candidates + clean rooms clean bed = very clean rooms very clean bed very dirty + dirty room dirty pool = very dirty room very dirty pool very nice + nice place nice room = very nice place very nice room Srep<σrep ; Sread<σread very clean Step 3: Generate higher order n-grams. • Concatenate existing candidates + seed bigrams • Prune non-promising candidates (Srep & Sread) • Eliminate redundancies (sim(mi, mj)) • Repeat process on shortlisted candidates (until no possbility of expansion) 0. 9 0. 8 0. 7 0. 5 …. . very clean rooms friendly service dirty lobby and pool nice and polite staff Sorted Candidates Step 4: Final summary. Sort by objective function value. Add phrases until |M|< σss 36
Performance comparisons (reviews of 330 products) Proposed method works the best 0. 09 0. 08 ROUGE-2 RECALL 0. 07 0. 06 KEA Tfidf Opinosis Web. NGram 0. 05 0. 04 0. 03 0. 02 0. 01 0. 00 5 10 15 20 25 Summary Size (max words) 30 37
The program can generate meaningful novel phrases Example: Unseen N-Gram (Acer AL 2216 Monitor) “wide screen lcd monitor is bright” readability : -1. 88 representativeness: 4. 25 “…plus the monitor is very bright…” Related “…it is a wide screen, great color, great quality…” snippets in original text “…this lcd monitor is quite bright and clear…” 38
A Sample Summary Canon Powershot SX 120 IS Easy to use Good picture quality Crisp and clear Good video quality Useful for pushing opinions to devices where the screen is small E-reader/ Tablet Smart Phones Cell Phones 39
Outline 1. Opinion Integration & Validation 2. Opinion Summarization 3. Opinion Analysis 40
Motivation How to infer aspect ratings? How to infer aspect weights? Value Location Service … 41
Opinion Analysis: [Wang et al. KDD 2010] & [Wang et al. KDD 2011] Latent Aspect Rating Analysis Hongning Wang, Yue Lu, Cheng. Xiang Zhai. Latent Aspect Rating Analysis on Review Text Data: A Rating Regression Approach, Proceedings of the 17 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'10), pages 115 -124, 2010. Hongning Wang, Yue Lu, Cheng. Xiang Zhai, Latent Aspect Rating Analysis without Aspect Keyword Supervision, Proceedings of the 18 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11), 2011, pages 618 -626. 42
Latent Aspect Rating Analysis • Given a set of review articles about a topic with overall ratings • Output – Major aspects commented on in the reviews – Ratings on each aspect – Relative weights placed on different aspects by reviewers • Many applications – – – Opinion-based entity ranking Aspect-level opinion summarization Reviewer preference analysis Personalized recommendation of products … 43
Solution: Latent Rating Regression Model Aspect Segmentation Reviews + overall ratings + Aspect segments location: 1 amazing: 1 walk: 1 anywhere: 1 Latent Rating Regression Term weights room: 1 nicely: 1 appointed: 1 comfortable: 1 nice: 1 accommodating: 1 smile: 1 friendliness: 1 attentiveness: 1 0. 0 0. 9 0. 1 0. 3 0. 1 0. 7 0. 1 0. 9 0. 6 0. 8 0. 7 0. 8 0. 9 Aspect Rating Aspect Weight 1. 3 0. 2 1. 8 0. 2 3. 8 0. 6 Topic model for aspect discovery 44
Sample Result 1: Aspect-Specific Sentiment Lexicon Value Rooms Location Cleanliness resort 22. 80 view 28. 05 restaurant 24. 47 clean 55. 35 value 19. 64 comfortable 23. 15 walk 18. 89 smell 14. 38 excellent 19. 54 modern 15. 82 bus 14. 32 linen 14. 25 worth 19. 20 quiet 15. 37 beach 14. 11 maintain 13. 51 bad -24. 09 carpet -9. 88 wall -11. 70 smelly -0. 53 money -11. 02 smell -8. 83 bad -5. 40 urine -0. 43 terrible -10. 01 dirty -7. 85 road -2. 90 filthy -0. 42 overprice -9. 06 stain -5. 85 website -1. 67 dingy -0. 38 Uncover sentimental information directly from the data 45
Sample Result 2: Rated Aspect Summarization Aspect Summary Rating Location Business Service 3. 1 Overall not a negative experience, however considering that the hotel industry is very much in the impressing business there was a lot of room for improvement. 1. 7 The location, a short walk to downtown and Pike Place market, made the hotel a good choice. 3. 7 When you visit a big metropolitan city, be prepared to hear a little traffic outside! Value Truly unique character and a great location at a reasonable price Hotel Max was an excellent choice for our recent three night stay in Seattle. 1. 2 You can pay for wireless by the day or use the complimentary Internet in the business center behind the lobby though. 2. 7 My only complaint is the daily charge for internet access when you can pretty much connect to wireless on the streets anymore. 0. 9 (Hotel Max in Seattle) 46
Sample Result 3: Comparative analysis of opinion holders Reviewers emphasizing more on ‘value’ tend to prefer cheaper hotels City Amsterdam Barcelona San Francisco Florence Avg. Price 241. 6 280. 8 261. 3 272. 1 Group Val/Loc Val/Rm Val/Ser top-10 190. 7 214. 9 221. 1 bot-10 270. 8 333. 9 236. 2 top-10 270. 2 196. 9 263. 4 bot-10 330. 7 266. 0 203. 0 top-10 214. 5 249. 0 225. 3 bot-10 321. 1 311. 4 top-10 269. 4 248. 9 220. 3 bot-10 298. 9 293. 4 292. 6 47
Sample Result 4: Analysis of Latent Preferences Expensive Hotel Cheap Hotel 5 Stars 3 Stars 5 Stars 1 Star Value 0. 134 0. 148 0. 171 0. 093 Room 0. 098 0. 162 0. 126 0. 121 Location 0. 171 0. 074 0. 161 0. 082 Cleanliness 0. 081 0. 163 0. 116 0. 294 Service 0. 251 0. 101 0. 049 People like expensive hotels because of good service People like cheap hotels because of good value 48
Sample Result 5: Personalized Ranking of Entities Query: 0. 9 value 0. 1 others Non-Personalized 49
Sample Result 6: Discover consumer preferences battery life accessory service file format volume video 50
Summary 1. Opinion Integration & Validation - Leverage expert reviews [WWW 08] - Leverage ontology [COLING 10] - Content-based trust framework [KDD 11] Users face significant challenges in 2. Opinion Summarization integrating, validating, - Aspect sentiment summary [WWW 07] and digesting opinions - Micro opinion summary [WWW 12] Shared opinionated text data are very useful for decision making 3. Opinion Analysis - Two-stage rating analysis [KDD 10] - Unified rating analysis [KDD 11] 51
Future Work: Putting All Together Decision=? Preliminary Result: Findi. Like System Information Need Information Sharing Network 52
Acknowledgments • Collaborators: Yue Lu, Qiaozhu Mei, Kavita Ganesan, Hongning Wang, Vinod Vydiswaran, and many others • Funding Thanks to MURI! 53
Thank You! Questions/Comments? 54


