Gegevensbanken 2012 Iets over data mining en Information

Gegevensbanken 2012 Iets over data mining en Information Retrieval Bettina Berendt http: //people. cs. kuleuven. be/~bettina. berendt/

Iets over data mining en Information Retrieval: Motivatie & Samenvatting 2

Waar zijn we? Les # 1 2 2 3 4, 5 6 7 8 10 11 12 13 14 15 -17 18 9 wie ED ED ED KV KV KV BB BB ED wat intro, ER EER, (E)ER naar relationeel schema relationeel model Relationele algebra & relationeel calculus SQL Programma's verbinden met gegevensbanken Functionele afhankelijkheden & normalisatie PHP Beveiliging van gegevensbanken Geheugen en bestandsorganisatie Externe hashing Indexstructuren Queryverwerking Transactieverwerking en concurrentiecontrole Data mining en Information Retrieval XML (en meer over het Web als GB), No. SQL Nieuwe thema‘s / vooruitblik 3

Aan wie zou een bank geld lenen? Gegevensbanken queries: • Wie heeft ooit een krediet niet terugbetaald? SELECT DISTINCT Fname, Lname FROM Clients, Loans WHERE client. ID = loantaker. ID AND paid = „NO“ Data Warehousing / Online Analytical Processing OLAP: • In welke wijken hebben meer dan 20% van de clienten vorig jaar een krediet niet terugbetaald? Data Mining: • Bij welke mensen is te verwachten dat ze een krediet niet terugbetalen? (= wijk, baan, leeftijd, geslacht, . . . ) 4

nog een toepassingsgebied • Het Web • Je gebruikt Web data mining elke dag 5

Indexering en ranking 6

Gedragsanalyse voor recommender systems 7

Tekstmining voor recommender systems 8

Of ook 9

Wie koopt de printer XYZ ? • Mijn Klant (ezf. ): database lookup • „Ik ken het antwoord niet, maar de volgende 2398445 pagina‘s zijn relevant voor uw query“: zoekmachine / information retrieval / document retrieval • Deze gebruiker (omwille van zijn profiel, zijn postings, zijn vrienden en hun eigenschappen, …): data mining • Iemand die pas zijn oude printer verkocht/weggegooid heeft: logica Verschillende methodes voor inferentie; verschillende types van antwoorden Beschrijven / bekende gegevens versus voorspellen 10

Het volgende is ook … … een vooruitblik op verschillende cursussen in de Master, bv. • Advanced Databases • Text-based Information Retrieval • Current Trends in Databases • Data Mining Ook interessant / gerelateerd (logica!), maar niet het onderwerp vandaag: • Modellering van complexe systemen 11

Agenda Hoe worden gegevens machtig? Mining & combinatie Methoden (1): Classifier learning op relaties Methoden (2): Itemset mining Van relaties naar teksten Methoden (3): Classifier learning op teksten (Een beetje) KD proces: Preprocessing Wat doen zoekmachines? Wat kunnen WIJ doen? 12

Agenda Hoe worden gegevens machtig? Mining & combinatie Methoden (1): Classifier learning op relaties Methoden (2): Itemset mining Van relaties naar teksten Methoden (3): Classifier learning op teksten (Een beetje) KD proces: Preprocessing Wat doen zoekmachines? Wat kunnen WIJ doen? 13

Knowledge discovery (en data mining) “het niet-triviale proces voor het identificeren van geldige, nieuwe, mogelijk te gebruiken, en uiteindelijk verstaanbare patronen in data. ” Data mining 14

Data mining technieken • Verkennende data-analyse met interactieve, vaak visuele methoden • Beschrijvende modellering (schatting van de dichtheid, clusteranalyse en segmentatie, afhankelijkheidsmodellering) • Voorspellende modelleringen (classificatie en regressie) • Het doel is een model te bouwen waarmee de waarde van één variable te voorspellen is, op basis van de gekende waarden voor de andere variabelen. • In classificatie is de voorspelde waarde een categorie; • bij regressie is deze waarde quantitatief • Het ontdekken van (lokale) patronen en regels • Typische voorbeelden zijn frequente patronen zoals • verzamelingen, sequenties, subgrafen • en regels die hieruit afgeleid kunnen worden (bv. associatieregels) 15

Bijzonder interessant op basis van gecombineerde gegevens. . . en. . . 16

Gegevens • • • relationele gegevens, teksten, grafen, semi-gestructureerde gegevens (bv. Web clickstreams) beelden, … 17

Agenda Hoe worden gegevens machtig? Mining & combinatie Methoden (1): Classifier learning op relaties Methoden (2): Itemset mining Van relaties naar teksten Methoden (3): Classifier learning op teksten (Een beetje) KD proces: Preprocessing Wat doen zoekmachines? Wat kunnen WIJ doen? 18

Input data. . . Q: when does this person play tennis? Outlook Temp Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No 19

Terminology (using a popular data example) Outlook Temp Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No Rows: • Instances • (think of them as objects) • Days, described by: Columns: • Features • Outlook, Temp, … In this case, there is a feature with a special role: • The class • Play (does X play tennis on this day? ) This is “relational DB mining“. We will later see other types of data and the mining applied to them. 20

The goal: a decision tree for classification / prediction In which weather will someone play (tennis etc. )? 21

Constructing decision trees Strategy: top down Recursive divide-and-conquer fashion First: select attribute for root node Create branch for each possible attribute value Then: split instances into subsets One for each branch extending from the node Finally: repeat recursively for each branch, using only instances that reach the branch Stop if all instances have the same class 22 22

Which attribute to select? 23 23

Which attribute to select? 24 24

Criterion for attribute selection Which is the best attribute? Popular impurity criterion: information gain Want to get the smallest tree Heuristic: choose the attribute that produces the “purest” nodes Information gain increases with the average purity of the subsets Strategy: choose attribute that gives greatest information gain 25 25

Computing information Measure information in bits Given a probability distribution, the info required to predict an event is the distribution’s entropy Entropy gives the information required in bits (can involve fractions of bits!) Formula for computing the entropy: 26 26

Example: attribute Outlook 27 27

Computing information gain Information gain: information before splitting – information after splitting gain(Outlook ) = info([9, 5]) – info([2, 3], [4, 0], [3, 2]) = 0. 940 – 0. 693 = 0. 247 bits Information gain for attributes from weather data: gain(Outlook ) gain(Temperature ) gain(Humidity ) gain(Windy ) = 0. 247 bits = 0. 029 bits = 0. 152 bits = 0. 048 bits 28 28

Continuing to split gain(Temperature ) = 0. 571 bits gain(Humidity ) = 0. 971 bits gain(Windy ) = 0. 020 bits 29 29

Final decision tree Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can’t be split any further 30 30

V: entropy, heeft dit iets te maken met het thermodynamische concept ( een maat voor de wanorde van iets, een grootheid die enkel kan toenemen, ongeacht wat er gebeurd) of staat dit hier helemaal los van? A: Ja en neen … • Aanbevolene bron: • Stanford encyclopedia of Philosophy • http: //plato. stanford. edu/entries/information-entropy/ • Iets korter (maar ik kan de inhoud niet beoordelen): • http: //en. wikipedia. org/wiki/Entropy_in_thermodynamics_and_information_theor y 31

Agenda Hoe worden gegevens machtig? Mining & combinatie Methoden (1): Classifier learning op relaties Methoden (2): Itemset mining Van relaties naar teksten Methoden (3): Classifier learning op teksten (Een beetje) KD proces: Preprocessing Wat doen zoekmachines? Wat kunnen WIJ doen? 32

Gegevens • „Market basket (winkelmandje) data“: attributen met booleaanse domeinen • In een tabel elke rij is een basket (ook: transactie) Transactie Attributen (basket items) ID 1 Spaghetti, tomatensaus 2 Spaghetti, brood 3 Spaghetti, tomatensaus, brood 4 Brood, boter 33

Als relationele tabel Trans Spagh Toma brood boter actie etti tensaus 1 1 1 0 0 2 1 0 3 1 1 1 0 4 0 0 1 1 5 0 1 1 0 34

Solution approach: The apriori principle and the pruning of the search tree (1) Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spaghetti, tomato sauce spaghetti Spagetthi, Tomato sauce, butter Spaghetti, bread Spaghetti, butter Tomato sauce Spagetthi, Bread, butter Tomato s. , bread Tomato sauce, Bread, butter Tomato s. , butter Bread, butter 35

Solution approach: The apriori principle and the pruning of the search tree (2) Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spaghetti, tomato sauce spaghetti Spagetthi, Tomato sauce, butter Spaghetti, bread Spaghetti, butter Tomato sauce Spagetthi, Bread, butter Tomato s. , bread Tomato sauce, Bread, butter Tomato s. , butter Bread, butter 36

Solution approach: The apriori principle and the pruning of the search tree (3) Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spaghetti, tomato sauce spaghetti Spagetthi, Tomato sauce, butter Spaghetti, bread Spaghetti, butter Tomato sauce Spagetthi, Bread, butter Tomato s. , bread Tomato sauce, Bread, butter Tomato s. , butter Bread, butter 37

Solution approach: The apriori principle and the pruning of the search tree (4) Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spaghetti, tomato sauce spaghetti Spagetthi, Tomato sauce, butter Spaghetti, bread Spaghetti, butter Tomato sauce Spagetthi, Bread, butter Tomato s. , bread Tomato sauce, Bread, butter Tomato s. , butter Bread, butter 38

Genereren van grote k-itemsets met Apriori Transa Attributen (basket ctie ID items) 1 Spaghetti, tomatensaus 2 Spaghetti, brood • Min. support = 40% Spaghetti, 3 • Stap 1: kandidaat-1 -itemsets tomatensaus, • • Spaghetti: support = 3 (60%) Tomatensaus: support = 3 (60%) Brood: support = 4 (80%) Boter: support = 1 (20%) brood 4 Brood, boter 5 Brood, tomatensaus 39

Transa Attributen (basket ctie ID items) 1 Spaghetti, tomatensaus 2 Spaghetti, brood • Stap 2: grote 1 -itemsets 3 Spaghetti, • Spaghetti tomatensaus, • Tomatensaus 4 Brood, boter • Brood • kandidaat-2 -itemsets 5 brood Brood, tomatensaus • {Spaghetti, tomatensaus}: support = 2 (40%) • {Spaghetti, brood}: support = 2 (40%) • {tomatensaus, brood}: support = 2 (40%) 40

Transact Attributen (basket items) ie ID 1 Spaghetti, tomatensaus 2 Spaghetti, brood 3 • Spaghetti, tomatensaus, brood Stap 3: grote 2 -itemsets • {Spaghetti, tomatensaus} • {Spaghetti, brood} • {tomatensaus, brood} 4 Brood, boter 5 Brood, tomatensaus • kandidaat-3 -itemsets • {Spaghetti, tomatensaus, brood}: support = 1 (20%) • Stap 4: grote 3 -itemsets • {} 41

Van itemsets naar associatieregels • Schema: Als subset dan grote k-itemset met support s en confidence c • s = (support van grote k-itemset) / # tupels • c = (support van grote k-itemset) / (support van subset) • Voorbeeld: • Als {spaghetti} dan {spaghetti, tomatensaus} • Support: s = 2 / 5 (40%) • Confidence: c = 2 / 3 (66%) 42

Het kan beter … (een mogelijkheid) V: de FP-boom Supp Lin ort k Brood 4 Spagh 3 etti Tomat 3 ensaus NULL Item S: 1 Br: 4 T: 1 S: 2 T: 1 43

Agenda Hoe worden gegevens machtig? Mining & combinatie Methoden (1): Classifier learning op relaties Methoden (2): Itemset mining Van relaties naar teksten Methoden (3): Classifier learning op teksten (Een beetje) KD proces: Preprocessing Wat doen zoekmachines? Wat kunnen WIJ doen? 44

Teksten als relaties Do star Britn Spea Bi Dipp class -cu ey rs g er me nt IF star AND Britney THEN Celebrity IF star AND Dipper THEN Astronomy 1 1 0 0 Celebrity 2 1 0 1 1 1 Astronomy 3 1 1 1 0 0 Celebrity 4 0 1 1 0 0 Celebrity 46

Teksten als itemsets („sets of words“) Do star Britn Spea Bi Dipp -cu ey rs g er me nt IF star AND Britney THEN Spears IF star AND Dipper THEN Big 1 1 0 0 2 1 0 1 1 1 3 1 1 1 0 0 4 0 1 1 0 0 47

Teksten als bags of words Do- sta Brit- SBi cu- r ney pg me e nt a r s 1 1 3 10 2 1 0 11 3 1 1 10 4 0 1 10 Dipp er 0 1 0 0 48

GB-Structuren daarachter: Wat en waarvoor een index? (3) – vinden (hier: volledig geïnverteerde bestanden) 49

Teksten als bags of words Welke documenten zijn waarschijnlijk meest belangrijk voor een zoek naar SBi. Britney Dipp • pg • star ? er Do- sta Britcu- r ney Gelijkaarme e Britney is zeer characteristiek voor doc 1. digheid Star is niet characteristiek (in elke doc!). nt a query – Term frequency / inverse doc. Freq. doc ! r TF. IDF gewichten voor worden s 1 1 3 10 0 2 1 0 11 1 3 1 1 10 0 50 4 0 1 10 0

V: Is het hierbij de bedoeling dat je een webpagina omzet in één of andere soort vector waarin de belangrijkste info staat? Hoe gaat zoiets in zijn werk, wat staat er dan in zo een vector? 51

Agenda Hoe worden gegevens machtig? Mining & combinatie Methoden (1): Classifier learning op relaties Methoden (2): Itemset mining Van relaties naar teksten Methoden (3): Classifier learning op teksten (Een beetje) KD proces: Preprocessing Wat doen zoekmachines? Wat kunnen WIJ doen? 52

Wat maakt mensen blij? 53

Blij in blogs 54

Well kids, I had an awesome birthday thanks to you. =D Just wanted to so thank you for coming and thanks for the gifts and junk. =) I have many pictures and I will post them later. hearts current mood: Wat zijn de karakteristieke woorden voor deze twee stemmingen? Home alone for too many hours, all week long. . . screaming child, headache, tears that just won’t let themselves loose. . and now I’ve lost my wedding band. I hate this. current mood: 55

Data, het voorbereiden van data en leren • Live. Journal. com – optionele stemmingsannotatie • 10, 000 blogs: • 5, 000 blije entries / 5, 000 trieste entries • gemiddelde grootte: 175 woorden / entry • post-processing – verwijder SGML tags, tokenizatie, part-ofspeech tagging • kwaliteit van automatische “stemmingsonderscheiding” • naïve bayes text classifier • five-fold cross validation • Nauwkeurigheid: 79. 13% (>> 50% baseline) 56

Resultaat: happiness factoren afgeleid uit een corpus yay shopping awesome birthday lovely concert cool cute lunch books 86. 67 79. 56 79. 71 78. 37 77. 39 74. 85 73. 72 73. 20 73. 02 goodbye hurt tears cried upset sad cry died lonely crying 18. 81 17. 39 14. 35 11. 39 11. 12 11. 11 10. 56 10. 07 9. 50 57

Agenda Hoe worden gegevens machtig? Mining & combinatie Methoden (1): Classifier learning op relaties Methoden (2): Itemset mining Van relaties naar teksten Methoden (3): Classifier learning op teksten (Een beetje) KD proces: Preprocessing Wat doen zoekmachines? Wat kunnen WIJ doen? 58

Maar: de teksten zijn er niet zomaar … 59

Preprocessing (1) Data cleaning Goal: get clean ASCII text Remove HTML markup*, pictures, advertisements, . . . Automate this: wrapper induction * Note: HTML markup may carry information too (e. g. , <b> or <h 1> marks something important), which can be extracted! (Depends on the application) 60

Preprocessing (2) Further text preprocessing Goal: get processable lexical / syntactical units Tokenize (find word boundaries) Lemmatize / stem ex. buyers, buyer / buyer, buying, . . . buy Remove stopwords Find Named Entities (people, places, companies, . . . ); filtering Resolve polysemy and homonymy: word sense disambiguation; “synonym unification“ Part-of-speech tagging; filtering of nouns, verbs, adjectives, . . . Most steps are optional and application-dependent! Many steps are language-dependent; coverage of non-English varies Free and/or open-source tools or Web APIs exist for most steps 61

Preprocessing (3) Creation of text representation Goal: a representation that the modelling algorithm can work on Most common forms: A text as a set or (more usually) bag of words / vector-space representation: term-document matrix with weights reflecting occurrence, importance, . . . a sequence of words a tree (parse trees) 62

An important part of preprocessing: Named-entity recognition (1) 63

Agenda Hoe worden gegevens machtig? Mining & combinatie Methoden (1): Classifier learning op relaties Methoden (2): Itemset mining Van relaties naar teksten Methoden (3): Classifier learning op teksten (Een beetje) KD proces: Preprocessing Wat doen zoekmachines? Wat kunnen WIJ doen? 64

V: als je bij Google verschillende woorden ingeeft, worden deze dan met AND en OR gecombineerd, of zit er meer achter? 65

Vooruitblik: Natural language queries 66

V: Algemeen over het internet: valt dit te beschouwen als één grote ongeordende chaos van websites, of zijn het meer allemaal aparte databases (bijvoorbeeld met alle webpagina's uit België of alle webpagina's van een internetprovider als Telenet) die samen het internet vormen (en dus toelaten aan een grote, algemene database om die zijn taken te verdelen) ? 67

Wat is dit? Kunnen we hiermee iets doen? 68

Linked Open Data (DBPedia and Freebase indicated in red circles) 69

Vooruitblik Hoe worden gegevens machtig? Mining & combinatie Methoden (1): Classifier learning op relaties Methoden (2): Itemset mining Van relaties naar teksten Methoden (3): Classifier learning op teksten (Een beetje) KD proces: Preprocessing Wat doen zoekmachines? Wat kunnen WIJ doen? XML (ezf. ), No. SQL 70

Bronnen • • • Methoden (Classifier learning) • Slides from the „WEKA book“: • Ian H. Witten, Eibe Frank, Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques (Third Edition). 2011. • http: //www. cs. waikato. ac. nz/ml/weka/book. html Methoden (Itemset Mining) • Agrawal R, Imielinski T, Swami AN. "Mining Association Rules between Sets of Items in Large Databases. " SIGMOD. June 1993, 22(2): 207 -16, http: //rakesh. agrawal-family. com/papers/sigmod 93 assoc. pdf • Agrawal R, Srikant R. "Fast Algorithms for Mining Association Rules", VLDB. Sep 12 -15 1994, Chile, 487 -99. http: //rakesh. agrawalfamily. com/papers/vldb 94 apriori. pdf Methoden (Blij in blogs) • Mihalcea, R. & Liu, H. (2006). A corpus-based approach to finding happiness, In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs. • en Rada Mihalcea´s presentatie op CAAW 2006 71