fe16b413bd7945d08bf80d7c23c629cb.ppt
- Количество слайдов: 26
Guillaume Cabanac guillaume. cabanac@univ-tlse 3. fr March 27 th, 2012 Series-O-Rama Search & Recommend TV series with SQL http: //bit. ly/series-o-rama 2012
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Toulouse: A Picture is Worth a Thousand Words 1 3 Capbreton 3 h ride Ax-les-Thermes 1 h 40 ride 4 2 Collioure 2 h 30 ride Toulouse population: 437 000 students: 97 000 Aberdeen population: 210 400 students: ? ? ? 2
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Telly Addicts Need Help to Find TV Series en. wikipedia. org n Main Topics of Grey’s Anatomy? Anatomy ¨ n Text mining, Visualization Series about ‘plane crash island’ island ¨ Search engine amazon. com n What should I watch next? ¨ Recommender system 3
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Text Mining: Let’s Crunch Subtitles n Main Topics of Grey’s Anatomy? Anatomy ¨ n Grey’s Anatomy Text mining, Visualization Series about ‘plane crash island’ island ¨ Search engine Cold Case n What should I watch next? ¨ Recommender system 4
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac What’s in a Subtitle File? n Title – Season – Episode – Language. srt ¨ n Synchronization ¨ n 1 episode = 1 plain text file start --> stop Dialogue We can easily extract words [ a, again*2, and, but, com, cuban, different, favorite, food, for*2, forum, going, great, happen*2, has, hungry, i*2, is, it, love, m, my, nice, night*2, miami, now, pork, s*2, sandwiches, something, the, to*2, tonight, town, www ] 5
Series-O-Rama: Search & Recommend TV series with SQL DB technology at Work! Guillaume Cabanac [Home] 7 527 files = 337 MB 100% Java and Oracle 6
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac DB technology at Work! [Search engine] Ranked list of results 7
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac DB technology at Work! Most popular terms [Infos] Most related series 8
Series-O-Rama: Search & Recommend TV series with SQL DB technology at Work! Guillaume Cabanac [Recommendations] 9
Series-O-Rama: Search & Recommend TV series with SQL DB technology at Work! I liked Guillaume Cabanac [Recommendations] I disliked What should I watch next? 10
Series-O-Rama: Search & Recommend TV series with SQL DB technology at Work! Guillaume Cabanac [Recommendations] Ranked list of recommendations 11
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac How Does this Work? 12
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Architecture and Data Model subtitles DB Series = { id. S, 12 45 45 ind ex ing Dict = { id. T, 8 27 29 Posting = { id. T*, 27 8 8 offline online name} Lost Dexter ? ? id. S*, 45 45 12 term} plane killer crash nb} 89 3 90 GUI browsing searching recommending 13
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Theory Text Indexing Pipeline Tokenization + lowercase [the, plane, crashed, . . . , planes, . . . , is] Stopwords removal [plane, crashed, . . . , planes, . . . ] Stemming Porter’s Stemmer (1980) [plane, crash, . . . , plane, . . . ] Counting http: //qaa. ath. cx/porter_js_demo. html {(plane, 48), (crash, 15). . . } In 1720 Robert Gordon retired to Aberdeen having amassed a considerable fortune in Poland. On his death 11 years later he willed his entire estate to build a residential school for educating young boys. In the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this was converted into a day school to be known as Robert Gordon’s College. This school also began to hold day and evening classes for boys girls and adults in primary secondary mechanical and other subjects … 14
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Theory Similarity of Paired Series n Dice’s Coefficient (1945) ¨ Based on the Set Theory ¨ Example: Let us Model a Series as a Set of Terms House = {hospital, doctor, crazy, psycho} Grey’s = {doctor, care, hospital} A Big Limitation The distribution of terms among series is ignored It makes no difference that a term occurs 1 time or 1, 000 times 15
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Theory Vector Space Model, Term Weighting Vocabulary max Raw TF max survive ? Normalization TF / max(TF) dexter > lost dexter < lost max 16
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Theory Best Match Retrieval 1 45 1467 6790 n 1 TV series = 1 vector Now, we know how to: Find most popular terms for a TV series Compute similarity between TV series Find TV series matching a query 17
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Theory More on Term Weighting 1 45 1467 6790 n 1 TV series = 1 vector All terms are supposed to be equally representative … but ‘survive’ is way more unusual than ‘people’ ‘survive’ better represents Lost than ‘people’ does IDF: Inverse Document Frequency 18
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Theory The Big Picture: TF*IDF An important term for series S is frequent in S and globally unusual 1 TV series = 1 vector Some Limitations Term positions? Stemming? Mixture of languages? e. g. , “ice truck killer” in Dexter e. g. , christmas e. g. , amusant. FR vs. fun. EN 19
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Theory … and Practice Series = { id. S, name, max. Nb} 12 Lost 540 45 Dexter 125 Posting = { id. T*, 27 8 8 Dict = { id. T, 8 27 29 id. S*, 45 45 12 nb, 89 3 90 term plane killer crash idf } 1. 25 2. 87 3. 07 tf } 0. 71 0. 02 0. 16 20
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Description of a TV Series ⋈ Lost Many surnames need to be filtered out 21
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Retrieval of TV Series queries with 1 term survive ⋈ Importance of normalization • Stargate Atlantis nb/max. Nb = 63/1116 = 0. 05645 • Blade nb/max. Nb = 9/163 = 0. 05521 22
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Retrieval of TV Series queries with n terms survive mulder ⋈ 18| X-Files survive|0. 014|0. 107 = 0. 014 * 0. 107 = 0. 001 mulder|1. 000|3. 977 = 1. 000 * 3. 977 = 3. 977 ⁞ + 3. 978 67|The Vampire Diaries survive|0. 028|0. 107 = 0. 028 * 0. 107 = 0. 003 mulder|0. 007|3. 977 = 0. 007 * 3. 977 = 0. 028 + 0. 031 23
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Computing Similarities Among TV Series First, let’s compute the numerator where: Ai = Terms from House Bi = Terms from Another TV series Ai 1/2 Bi ⋈ Similar to House? 24
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Computing Similarities Among TV Series 2/2 ⋈ ⋈ ⋈ Similar to House? 25
Thank you http: //www. irit. fr/~Guillaume. Cabanac
fe16b413bd7945d08bf80d7c23c629cb.ppt