f7a66405b6dbc31fb214f6c169b1176e.ppt
- Количество слайдов: 39
Criticism Mining: Text Mining Experiments on Book, Movie and Music Reviews Xiao Hu, J. Stephen Downie, M. Cameron Jones The International Music Information Retrieval Systems Evaluation Lab (IMIRSEL) University of Illinois at Urbana-Champaign THE ANDREW W. MELLON FOUNDATION
Agenda z. Motivation z. Customer reviews in epinions. com z. Experimental Setup z. Data set z. Results z. Conclusions & Future Work
Motivation z Critical consumer-generated reviews of humanities materials Ø a rich resource of reviewers’ opinions, and background / contextual information Ø self-organized: pave ways to automatic processing z Text mining: mature and ready to use z Criticism mining: provides a tool to assist humanities scholars Ø Locating Ø Organizing Ø Analyzing critical review content
Customer Reviews z Published on www. epinions. com z Focused on the book, movie and music z Each review associated with: Ø a genre label Ø a numerical quality rating
numerical rating associated used in our experiments
Music Genres 28 Major Genre Categories Jazz, Rock, Country, Classical, Blues, Gospel, Punk, . … Renaissance, Medieval, Baroque, Romantic, …
Experimental Setup zto build and evaluate a prototype criticism mining system that could automatically : Øpredict the genre of the work being reviewed Øpredict the quality rating assigned to the reviewed item Ødifferentiate book reviews and movie reviews, especially for items in the same genre Ødifferentiate fiction and non-fiction book reviews
Data set Reviews on Book Movie Music #. Of reviews 1800 1650 11 1800 12 #. Of genres 9 Mean of 1, 095 words 1, 514 words 1, 547 words review length Std. Dev. of 446 words review length Term list size 41, 060 672 words 784 words 47, 015 47, 864
Genre Taxonomy Book Movie Action / Thriller 1 Juvenile Fiction 2 Humor 3 Action /Adventure 1 Children 2 Comedies 3 Horror 4 Music & Performing Arts 5 Science Fiction & Fantasy 6 Biography & Autobiography Mystery & Crime Romance Horror/Suspense 4 Musical & Performing Arts 5 Science-Fiction / Fantasy 6 Documentary Dramas Education/General Interest Japanimation (Anime) War
Genre Taxonomy : Book Fiction Non-fiction Action / Thriller 1 Juvenile Fiction 2 Horror 4 Humor 3 Music & Performing Arts 5 Biography & Autobiography Science Fiction & Fantasy Mystery & Crime Romance 6
Genre Taxonomy : Music Blues Classical Country Electronic Heavy Metal International Jazz Instrument Pop Vocal Gospel R&B Hardcore/Punk Rock & Pop z The genre labels and the rating information provided the ground truth for experiments
Data Preprocessing z. HTML tags were stripped out; z. Stop words were NOT stripped out; z. Punctuation was NOT stripped out; ØThey may contain stylistic information z. Tokens were stemmed
Categorization Model & Implementation z. Naïve Bayesian (NB) Classifier ØComputationally efficient ØEmpirically effective z. Text-to-Knowledge (T 2 K) Toolkit ØA text mining framework ØReady-to-use modules and itineraries ØNatural Language Processing tools integrated ØSupporting fast prototyping of text mining
NB itinerary in T 2 K Data Preprocessing NB Classifier
Results & Discussions
Genre Classification Reviews on Book Movie Music Number of genres 9 11 12 Reviews in each genre Term list size (terms) 200 41, 060 150 47, 015 150 47, 864 Mean of review length (words) Std Dev of review length (words) Mean of precision Std Dev of precision 1, 095 446 72. 18% 1. 89% 1, 514 672 67. 70% 3. 51% 1, 547 784 78. 89% 4. 11% 5 fold random cross validation for book and movie reviews 3 fold random cross validation for music reviews
Confusion : Book Reviews Classified As Action Bio. Hor. Hum. Juv. Mus. Mys. Rom. Sci. Action 0. 61 0. 06 0. 01 0. 02 0. 03 0. 20 0. 05 0. 02 Bio. 0. 04 0. 70 0. 01 0. 05 0. 03 0. 13 0. 01 0. 03 0 Horror 0. 09 0 0. 66 0 0. 05 0 0. 12 0. 06 Humor 0. 01 0. 10 0 0. 74 0. 03 0. 08 0. 01 0. 03 Juvenile 0. 01 0 0. 07 0. 86 0. 02 0 Music 0 0. 09 0 0 0. 01 0. 89 0 0 0. 01 Mystery 0. 20 0 0. 01 0 0. 70 0. 05 0. 04 Romance 0. 06 0. 01 0 0. 04 0 0. 08 0. 78 0. 03 Science 0. 03 0 0. 02 0. 01 0. 11 0. 03 0. 01 0. 13 0. 66
Confusion : Movie Classified As Act. Ani. Chi. Com. Doc. Dra. Edu. Hor. Mus. Sci. War Action 0. 77 0 0 Anime 0 0. 89 0. 03 Children 0. 02 0. 01 Comedy 0. 09 Docu. 0. 01 0. 02 0 0 0. 10 0. 09 0. 03 0 0 0. 05 0 0. 95 0 0. 01 0. 06 0. 52 0. 03 0. 17 0. 06 0. 01 0. 03 0. 01 0. 02 0 0 0. 04 0. 63 0. 01 0. 19 0 0. 02 Drama 0. 16 0 0 0. 12 0. 10 0. 45 0. 03 0. 01 0. 04 Edu. 0 0 0. 02 0. 31 0. 03 0. 57 0 0 0. 01 0. 03 Horror 0. 15 0. 02 0. 03 0. 02 0. 05 0. 69 0 0. 10 0. 02 Music 0 0. 01 0. 18 0 0. 81 0 Science 0. 04 0. 01 0. 02 0 0. 06 0. 01 0. 02 0. 03 0 0. 76 0. 05 War 0. 11 0 0. 01 0. 08 0. 05 0. 03 0. 02 0 0. 59
Confusion : Music Classified Blu. Cla. Cou Ele. Gos. Pun. Met. Int’l Jazz Pop. RB As Roc. Blues 0. 61 0 0. 10 0 0. 29 0. 03 0 0 0 0. 06 0 0 0. 03 0 0 0. 05 0. 10 Classical 0 0. 94 0 Country 0 0 0. 92 0 Electr. 0 0 0 Gospel 0 0 0. 05 0 Punk 0 0. 05 0 0. 71 0. 05 0 0 0. 19 Metal 0 0 0 0. 89 0 0 0. 11 Int’l 0 0. 04 0. 00 0. 04 0 0. 81 0 0. 04 Jazz 0 0. 04 0 0 0. 89 0. 04 0 0. 04 Pop Vo. 0 0 0. 04 0. 07 0. 68 0 0. 11 R&B 0 0 0 Rock 0. 03 0 0. 92 0 0 0. 80 0 0 0 0. 06 0. 88 0. 06 0. 03 0 0. 89
Rating Classification z. Five-classification Ø 1 star vs. 2 stars vs. 3 stars vs. 4 stars vs 5 stars z. Binary Group classification Ø 1 star + 2 stars vs. 4 stars + 5 stars zad extremis classification Ø 1 star vs. 5 stars 5 fold random cross validation for all experiments
Rating : Book Reviews Experiments 5 classes Binary Group Ad extremis Number of classes 5 2 2 Reviews in each class 200 400 300 Term list size (terms) Mean of review length (words) Std Dev of review length (words) Mean of precision Std Dev of precision 34, 123 1, 240 549 36. 70% 1. 15% 28, 339 1, 228 557 80. 13% 4. 01% 23, 131 1, 079 612 80. 67% 2. 16%
Rating : Movie Reviews Experiments 5 classes Binary Group Ad extremis Number of classes 5 2 2 Reviews in each class 220 440 400 Term list size (terms) Mean of review length (words) Std Dev of review length (words) Mean of precision Std Dev of precision 40, 235 1, 640 788 44. 82% 2. 27% 36, 620 1, 645 770 82. 27% 2. 02% 31, 277 1, 409 724 85. 75% 1. 20%
Rating : Music Reviews Experiments 5 classes Binary Group Ad extremis Number of classes 5 2 2 Reviews in each class 200 400 Term list size (terms) Mean of review length (words) Std Dev of review length (words) Mean of precision Std Dev of precision 35, 600 1, 875 913 44. 25% 2. 63% 33, 084 2, 032 912 79. 75% 3. 59% 32, 563 1, 842 956 85. 94% 3. 58%
Confusion : Book Reviews Classified As 1 star 2 stars 3 stars 1 star 0. 45 0. 24 0. 11 2 stars 0. 21 0. 36 0. 17 3 stars 0. 15 0. 19 0. 28 4 stars 0. 09 0. 12 0. 22 5 stars 0. 10 0. 09 0. 21 4 stars 5 stars 0. 05 0. 04 0. 06 0. 07 0. 17 0. 41 0. 26 0. 31 0. 46
Confusion : Movie Reviews Classified As 1 star 2 stars 3 stars 1 star 0. 49 0. 15 0. 04 2 stars 0. 19 0. 45 0. 24 3 stars 0. 17 0. 23 0. 28 4 stars 0. 08 0. 11 0. 27 5 stars 0. 07 0. 06 0. 17 4 stars 5 stars 0. 05 0. 07 0. 13 0. 03 0. 16 0. 41 0. 20 0. 27 0. 54
Confusion : Music Reviews Classified As 1 star 2 stars 3 stars 1 star 0. 61 0. 24 0. 11 2 stars 0. 24 0. 15 0. 13 3 stars 0. 07 0. 36 0. 41 4 stars 0. 05 0. 15 0. 20 5 stars 0. 02 0. 09 0. 15 4 stars 5 stars 0. 03 0 0. 06 0 0. 10 0. 09 0. 32 0. 11 0. 48 0. 80
Classification of Book and Movie Reviews 1 z Reviews on all available genres Ø Books : 9 genres; Movies : 11 genres z Reviews on individual, comparable genres Book Action / Thriller 1 Juvenile Fiction 2 Humor 3 Movie Action /Adventure 1 Children 2 Comedies 3 Horror 4 Horror/Suspense 4 Music & Performing Arts 5 Musical & Performing Arts 5 Science Fiction & Fantasy 6 Science-Fiction / Fantasy 6
Classification of Book and Movie Reviews 2 z Eliminated words that can directly suggest the categories: Ø "book", "movie", "fiction", "film", "novel", "actor", "actress", "read", "watch", "scene" Ø Frequently occurred in each category, but not both Ø To make things harder / avoid oversimplifying z Results suggest stylistic difference in users’ criticisms on books and movies 5 fold random cross validation for all experiments
Book vs. Movie Reviews 1 Genre All Genres Action Horror Number of classes 2 2 2 Reviews in each class 800 400 Term list size (terms) Mean of review length (words) Std Dev of review length (words) Mean of precision Std Dev of precision 49, 263 1, 608 697 94. 28% 1. 18% 24, 552 933 478 95. 63% 0. 99% 25, 509 1, 779 546 98. 12% 1. 40%
Book vs. Movie Reviews 2 Genre Number of classes Humor/ Juvenile / Comedy Children 2 2 Reviews in each class 400 Term list size (terms) Mean of review length (words) Std Dev of review length (words) Mean of precision Std Dev of precision 26, 713 1, 091 625 99. 13% 1. 05% 21, 326 849 333 97. 87% 0. 71%
Book vs. Movie Reviews 3 Genre Number of classes Music & Performing Arts 2 Science Fiction & Fantasy 2 Reviews in each class Term list size (terms) Mean of review length (words) Std Dev of review length (words) Mean of precision Std Dev of precision 400 23, 217 791 531 97. 02% 1. 49% 400 25, 088 1, 011 544 97. 25% 1. 91%
Classification of Fiction and Non-fiction Book Reviews 1 Fiction Action / Thriller 1 Juvenile Fiction 2 Horror 4 Science Fiction & Fantasy Mystery & Crime Romance Non-fiction Humor 3 Music & Performing Arts 5 Biography & Autobiography 6
Classification of Fiction and Non-fiction Book Reviews 2 z Eliminated words that can directly suggest the categories: Ø "fiction", "novel", "character", "plot", and "story" Ø Frequently occurred in each category, but not both Ø To make things harder / avoid oversimplifying z Results suggest stylistic difference in users’ criticisms on fiction books and non-fiction ones 5 fold random cross validation for all experiments
Fiction vs. Non-fiction Book Reviews Experiment Fiction vs. Non-fiction Number of classes 2 Reviews in each class Term list size (terms) 600 35, 210 Mean of review length (words) Std Dev of review length (words) Mean of precision Std Dev of precision 1, 220 493 94. 67% 1. 16%
Confusion : Fiction vs. Nonfiction Book Reviews Classified As Fiction Non-fiction Fiction 0. 98 0. 02 Non-Fiction 0. 09 0. 91
Conclusions z Customer reviews are an excellent resource for studying humanities materials z Successful experiments: Ø High classification precisions: Genres; Ratings; Book vs. movie reviews Fiction vs. non-fiction book reviews Ø Reasonable confusions z Text mining techniques can help find important information about the materials being reviewed Criticism Mining : make the ever-growing consumergenerated review resources useful to humanities scholars.
Future work z. More text mining techniques Ødecision trees, frequent pattern mining z. Other critical text Øblogs, wikis, etc z. Other facets of reviews Ø“usage” in music reviews z. Feature studies Ø answer the “why” questions
References z Argamon, S. , and Levitan, S. (2005). Measuring the Usefulness of Function Words for Authorship Attribution. Proceedings of the 17 th Joined International Conference of ACH/ALLC. z Downie, J. S. , Unsworth, J. , Yu, B. , Tcheng, D. , Rockwell, G. , and Ramsay, S. J. (2005). A Revolutionary Approach to Humanities Computing? : Tools Development and the D 2 K Data-Mining Framework. Proceedings of the 17 th Joined International Conference of ACH/ALLC. z Hu, X. , Downie, J. S. , West, K. , and Ehmann, A. (2005). Mining Music Reviews: Promising Preliminary Results. Proceedings of the Sixth International Conference on Music Information Retrieval (ISMIR). z Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34, 1. z Stamatatos, E. , Fakotakis, N. , and Kokkinakis, G. (2000). Text Genre Detection Using Common Word Frequencies. Proceedings of 18 th International Conference on Computational Linguistics.
Questions? THE ANDREW W. MELLON FOUNDATION IMIRSEL Thank you!
f7a66405b6dbc31fb214f6c169b1176e.ppt