
30784612ce5e9983e83c15a87a37cf24.ppt
- Количество слайдов: 79
Chapter 10 Managing Databases with Oracle 9 i SII 654 Fall 2005 Fundamentals, Design, and Implementation, 9/e
Introduction § Oracle is the world’s most popular DBMS § It is a powerful and robust DBMS that runs on many different operating systems § Oracle DBMS engine: Personal Oracle and Enterprise Oracle § Example of Oracle products – SQL*Plus: a utility for processing SQL and creating components like stored procedures and triggers • PL/SQL is a programming language that adds programming constructs to the SQL language – Oracle Developer (Forms & Reports Builder) – Oracle Designer Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 2
Creating an Oracle Database § Installing Oracle – Install Oracle 9 i Client to use an already created database – Install Oracle 9 i Personal Edition to create your own databases § Three ways to create an Oracle database – Via the Oracle Database Configuration Assistant – Via the Oracle-supplied database creation procedures – Via the SQL CREATE DATABASE command Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 3
SQL*Plus § Oracle SQL*Plus or the Oracle Enterprise Manager Console may be used to manage an Oracle database § SQL*Plus is a text editor available in all Oracle § Except inside quotation marks of strings, Oracle commands are case-insensitive § The semicolon (; ) terminates a SQL statement § The right-leaning slash (/) executes SQL statement stored in Oracle buffer § SQL*Plus can be used to – Enter SQL statements – Submit SQL files created by text editors, e. g. , notepad, to Oracle Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 4
Example: SQL*Plus Prompt Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 5
SQL*Plus Buffer § SQL*Plus keeps the current statements in a multi-line buffer without executing it § LIST is used to see the contents of the buffer – LIST [line_number] is used to change the current line § CHANGE/astring/bstring/ is used to change the contents of the current line – astring = the string you want to change – bstring = what you want to change it to § Example: change/Table_Name/*/ – ‘Table_Name’ is replaced with ‘*’ Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 6
Example: SQL*Plus Buffer Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 7
Creating Tables § Some of the SQL-92 CREATE TABLE statements need to be modified for Oracle – – Oracle does not support a CASCADE UPDATE constraint Int data type is interpreted by Oracle as Number(38) Varchar data type is interpreted as Var. Char 2 Money or currency is defined in Oracle using the Numeric data type § Oracle sequences must be used for surrogate keys § DESCRIBE or DESC command is used to view table status Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 8
Oracle Data Types Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 9
Oracle Sequences § A sequence is an object that generates a sequential series of unique numbers § It is the best way to work with surrogate keys in Oracle § Two sequence methods – Next. Val provides the next value in a sequence – Curr. Val provides the current value in a sequence § Using sequences does not guarantee valid surrogate key values because it is possible to have missing, duplicate, or wrong sequence value in the table Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 10
Example: Sequences § Creating sequence CREATE SEQUENCE Cust. ID INCREMENT BY 1 START WITH 1000; § Entering data using sequence INSERT INTO CUSTOMER (Customer. ID, Name, Area. Code, Phone. Number) VALUES (Cust. ID. Next. Val, ‘Mary Jones’, ‘ 350’, ‘ 555– 1234); § Retrieving the row just created SELECT * FROM CUSTOMER WHERE Customer. ID = Cust. ID. Curr. Val Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 11
DROP and ALTER Statements § Drop statements may be used to remove structures from the database – DROP TABLE MYTABLE; • Any data in the MYTABLE table will be lost – DROP SEQUENCE My. Sequence; § ALTER statement may be used to drop (add) a column – ALTER TABLE MYTABLE DROP COLUMN My. Column; – ALTER TABLE MYTABLE ADD C 1 NUMBER(4); Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 12
TO_DATE Function § Oracle requires dates in a particular format § TO_DATE function may be used to identify the format – TO_DATE(‘ 11/12/2002’, ’MM/DD/YYYY’) • 11/12/2002 is the date value • MM/DD/YYYY is the pattern to be used when interpreting the date § TO_DATE function can be used with the INSERT and UPDATE statement to enter data – INSERT INTO T 1 VALUES (100, TO_DATE (‘ 01/05/02’, ‘DD/MM/YY’); Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 13
Creating Indexes § Indexes are created to – Enforce uniqueness on columns – Facilitate sorting – Enable fast retrieval by column values § Good candidates for indexes are columns that are frequently used with equal conditions in WHERE clause or in a join § Example: – CREATE INDEX Cust. Name. Idx ON CUSTOMER(Name); – CREATE UNIQUE INDEX Work. Unique. Index ON WORK(Title, Copy, Artist. ID); Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 14
Restrictions On Column Modifications § A column may be dropped at any time and all data will be lost § A column may be added at any time as long as it is a NULL column § To add a NOT NULL column – Add a NULL column – Fill the new column in every row with data – Change its structure to NOT NULL • ALTER TABLE T 1 MODIFY C 1 NOT NULL; Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 15
Creating Views § SQL-92 CREATE VIEW command can be used to create views in SQL*Plus § Oracle allows the ORDER BY clause in view definitions § Only Oracle 9 i supports the JOIN…ON syntax § Example: CREATE VIEW Customer. Interests AS SELECT C. Name as Customer, A. Name as Artist FROM CUSTOMER C JOIN CUSTOMER_ARTIST_INT I ON C. Customer. ID = I. Customer. ID JOIN ARTIST A ON I. Artist. ID = A. Artist. ID; Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 16
Enterprise Manager Console § The Oracle Enterprise Manager Console provides graphical facilities for managing an Oracle database § The utility can be used to manage – Database structures such as tables and views – User accounts, passwords, roles, and privileges § The Manager Console includes a SQL scratchpad for executing SQL statements Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 17
Application Logic § Oracle database application can be processed using – Programming language to invoke Oracle DBMS commands – Stored procedures – Start command to invoke database commands stored in. sql files – Triggers Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 18
Stored Procedures § A stored procedure is a PL/SQL or Java program stored within the database § Stored procedures are programs that can – – Have parameters Invoke other procedures and functions Return values Raise exceptions § A stored procedure must be compiled and stored in the database § Execute or Exec command is used to invoke a stored procedure – Exec Customer_Insert (‘Michael Bench’, ‘ 203’, ‘ 5552014’, ‘US’); Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 19
Example: Stored Procedure § § Insert Figure 10 -20 IN signifies input parameters OUT signifies an output parameter IN OUT signifies a parameter used for both input and output § Variables are declared after the keyword AS Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 20
Triggers § Oracle triggers are PL/SQL or Java procedures that are invoked when specified database activity occurs § Triggers can be used to – – – Enforce a business rule Set complex default values Update a view Perform a referential integrity action Handle exceptions Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 21
Triggers (cont. ) § Trigger types – A command trigger will be fired once per SQL command – A row trigger will be fired once for every row involved in the processing of a SQL command • Three types of row triggers: BEFORE, AFTER, and INSTEAD OF • BEFORE and AFTER triggers are placed on tables while INSTEAD OF triggers are placed on views • Each trigger can be fired on insert, update, or delete commands Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 22
Data Dictionary § Oracle maintains a data dictionary of metadata § The metadata of the dictionary itself are stored in the table DICT SELECT Table_Name, Comments FROM DICT WHERE Table_Name LIKE (‘%TABLES%’); § USER_TABLES contains information about user or system tables DESC USER_TABLES; Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 23
Example Oracle Metadata Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 24
Concurrency Control § Oracle processes database changes by maintaining a System Change Number (SCN) – SCN is a database-wide value that is incremented by Oracle when database changes are made § With SCN, SQL statements always read a consistent set of values; those that were committed at or before the time the statement was started § Oracle only reads committed changes; it will never reads dirty data Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 25
Oracle Transaction Isolation § Oracle supports the following transaction isolation levels – Read Committed: Oracle’s default transaction isolation level since it never reads uncommitted data changes – Serializable: Dirty reads are not possible, repeated reads yield the same results, and phantoms are not possible – Read Only: All statements read consistent data. No inserts, updates, or deletions are possible – Explicit locks: Not recommended Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 26
Oracle Security § Oracle security components: – An ACCOUNT is a user account – A PROFILE is a set of system resource maximums that are assigned to an account – A PRIVILEGE is the right to perform a task – A ROLE consists of groups of PRIVILEGEs and other ROLEs Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 27
Account System Privileges § Each ACCOUNT can be allocated many SYSTEM PRIVILEGEs and many ROLEs § An ACCOUNT has all the PRIVILEGEs – That have been assigned directly – Of all of its ROLEs that are inherited through ROLE connections § A ROLE can have many SYSTEM PRIVILEGEs and it may also have a relationship to other ROLEs § ROLEs simplify the administration of the database – A set of privileges can be assigned to or removed from a ROLE just once Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 28
Account Authentication § Accounts can be authenticated by – Password – The host operating system § Password management can be specified via PROFILEs Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 29
Oracle Recovery Facilities § Three file types for Oracle recovery: – Datafiles contain user and system data – Re. Do log files contain logs of database changes • On. Line Re. Do files are maintained on disk and contain the rollback segments from recent database changes • Offline or Archive Re. Do files are backups of the On. Line Re. Do files – Control files describe the name, contents, and locations of various files used by Oracle Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 30
Oracle Recovery Facilities (cont. ) § Oracle can operate in either ARCHIVELOG or NOARCHIVELOG mode – If running in ARCHIVELOG mode, Oracle logs all changes to the database – When the On. Line Re. Do files fill up, they are copied to the Archive Re. Do files § The Oracle Recovery Manager (RMAN) is a utility program used to create backups and to perform recovery Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 31
Types of Failure § Oracle recovery techniques depend on the type of failure – An application failure due to application logic errors – An instance failure occurs when Oracle itself fails due to an operating system or computer hardware failure • Oracle can recover from application and instance failure without using the archived log file – A media failure occurs when Oracle is unable to write to a physical file because of a disk failure or corrupted files • The database is restored from a backup Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 32
Oracle Backup Facilities § Two kinds of backups § A consistent backup: Database activity must be stopped and all uncommitted changes have been removed from the datafiles – Cannot be done if the database supports 24/7 operations § An inconsistent backup: Backup is made while Oracle is processing the database – An inconsistent backup can be made consistent by processing an archive log file Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 33
Chapter 10 Managing Databases with Oracle 9 i Instructor: Dragomir R. Radev Winter 2005 Fundamentals, Design, and Implementation, 9/e
KDD and Data Mining Instructor: Dragomir R. Radev Winter 2005 Fundamentals, Design, and Implementation, 9/e
The big problem § Billions of records § A small number of interesting patterns § “Data rich but information poor” Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 36
Data mining § Knowledge discovery § Knowledge extraction § Data/pattern analysis Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 37
Types of source data § § Relational databases Transactional databases Web logs Textual databases Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 38
Association rules § 65% of all customers who buy beer and tomato sauce also buy pasta and chicken wings § Association rules: X Y Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 39
Association analysis § IF 20 < age < 30 AND 20 K < INCOME < 30 K § THEN – Buys (“CD player”) § SUPPORT = 2%, CONFIDENCE = 60% Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 40
Basic concepts § Minimum support threshold § Minimum confidence threshold § Itemsets § Occurrence frequency of an itemset Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 41
Association rule mining § Find all frequent itemsets § Generate strong association rules from the frequent itemsets Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 42
Support and confidence § Support (X) § Confidence (X Y) = Support(X+Y) / Support (X) Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 43
Example TID T 100 T 200 T 300 T 400 T 500 T 600 T 700 T 800 T 900 List of item IDs I 1, I 2, I 5 I 2, I 4 I 2, I 3 I 1, I 2, I 4 I 1, I 3 I 2, I 3 I 1, I 2, I 3, I 5 I 1, I 2, I 3 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 44
Example (cont’d) § § § § Frequent itemset l = {I 1, I 2, I 5} I 1 AND I 2 I 5 C = 2/4 = 50% I 1 AND I 5 I 2 AND I 5 I 1 I 2 AND I 5 I 2 I 1 AND I 5 I 3 I 1 AND I 2 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 45
Example 2 TID date items T 100 10/15/99 {K, A, D, B} T 200 10/15/99 {D, A, C, E, B} T 300 10/19/99 {C, A, B, E} T 400 10/22/99 {B, A, D} min_sup = 60%, min_conf = 80% Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 46
Correlations § Corr (A, B) = P (A OR B) / P(A) P (B) § If Corr < 1: A discourages B (negative correlation) § (lift of the association rule A B) Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 47
Contingency table Game ^Game Sum Video 4, 000 3, 500 7, 500 ^Video 2, 000 500 2, 500 Sum 6, 000 4, 000 10, 000 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 48
Example § § P({game}) = 0. 60 P({video}) = 0. 75 P({game, video}) = 0. 40 P({game, video})/(P({game})x(P({video })) = 0. 40/(0. 60 x 0. 75) = 0. 89 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 49
Example 2 hotdogs ^hotdogs Sum hamburgers 2000 500 2500 ^hamburgers 1000 1500 2500 Sum 3000 2000 5000 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 50
Classification using decision trees § Expected information need § I (s 1, s 2, …, sm) = - S pi log (pi) § s = data samples § m = number of classes Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 51
RID Age Income student credit buys? 1 <= 30 High No Fair No 2 <= 30 High No Excellent No 3 31. . 40 High No Fair Yes 4 > 40 Medium No Fair Yes 5 > 40 Low Yes Fair Yes 6 > 40 Low Yes Excellent No 7 31. . 40 Low Yes Excellent Yes 8 <= 30 Medium No Fair No 9 <= 30 Low Yes Fair Yes 10 > 40 Medium Yes Fair Yes 11 <= 30 Medium Yes Excellent Yes 12 31. . 40 Medium No Excellent Yes 13 31. . 40 High Yes Fair Yes 14 > 40 Medium no excellent no Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 52
Decision tree induction § I(s 1, s 2) = I(9, 5) = = - 9/14 log 9/14 – 5/14 log 5/14 = = 0. 940 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 53
Entropy and information gain • E(A) = S S 1 j + … + smj s I (s 1 j, …, smj) Entropy = expected information based on the partitioning into subsets by A Gain (A) = I (s 1, s 2, …, sm) – E(A) Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 54
Entropy § Age <= 30 s 11 = 2, s 21 = 3, I(s 11, s 21) = 0. 971 § Age in 31. . 40 s 12 = 4, s 22 = 0, I (s 12, s 22) = 0 § Age > 40 s 13 = 3, s 23 = 2, I (s 13, s 23) = 0. 971 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 55
Entropy (cont’d) § E (age) = 5/14 I (s 11, s 21) + 4/14 I (s 12, s 22) + 5/14 I (S 13, s 23) = 0. 694 § Gain (age) = I (s 1, s 2) – E(age) = 0. 246 § Gain (income) = 0. 029, Gain (student) = 0. 151, Gain (credit) = 0. 048 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 56
Final decision tree age > 40 31. . 40 student credit yes no yes excellent no Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke fair yes 57
Other techniques § Bayesian classifiers § X: age <=30, income = medium, student = yes, credit = fair § P(yes) = 9/14 = 0. 643 § P(no) = 5/14 = 0. 357 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 58
Example § P (age < 30 | yes) = 2/9 = 0. 222 P (age < 30 | no) = 3/5 = 0. 600 P (income = medium | yes) = 4/9 = 0. 444 P (income = medium | no) = 2/5 = 0. 400 P (student = yes | yes) = 6/9 = 0. 667 P (student = yes | no) = 1/5 = 0. 200 P (credit = fair | yes) = 6/9 = 0. 667 P (credit = fair | no) = 2/5 = 0. 400 Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 59
Example (cont’d) § P (X | yes) = 0. 222 x 0. 444 x 0. 667 = 0. 044 § P (X | no) = 0. 600 x 0. 400 x 0. 200 x 0. 400 = 0. 019 § P (X | yes) P (yes) = 0. 044 x 0. 643 = 0. 028 § P (X | no) P (no) = 0. 019 x 0. 357 = 0. 007 § Answer: yes/no? Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 60
Predictive models § Inputs (e. g. , medical history, age) § Output (e. g. , will patient experience any side effects) § Some models are better than others Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 61
Principles of data mining § Training/test sets § Error analysis and overfitting error test training § Cross-validation input size § Supervised vs. unsupervised methods Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 62
Representing data § Vector space credit pay off default salary Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 63
Decision surfaces credit pay off default salary Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 64
Decision trees credit pay off default salary Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 65
Linear boundary credit pay off default salary Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 66
k. NN models § Assign each element to the closest cluster § Demos: – http: //www 2. cs. cmu. edu/~zhuxj/courseproject/knnd emo/KNN. html Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 67
Other methods § § Decision trees Neural networks Support vector machines Demos – http: //www. cs. technion. ac. il/~rani/Loc. Bo ost/ Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 68
arff files @data sunny, 85, FALSE, no @relation weather sunny, 80, 90, TRUE, no overcast, 83, 86, FALSE, yes @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} rainy, 70, 96, FALSE, yes rainy, 68, 80, FALSE, yes rainy, 65, 70, TRUE, no overcast, 64, 65, TRUE, yes sunny, 72, 95, FALSE, no sunny, 69, 70, FALSE, yes rainy, 75, 80, FALSE, yes sunny, 75, 70, TRUE, yes overcast, 72, 90, TRUE, yes overcast, 81, 75, FALSE, yes rainy, 71, 91, TRUE, no Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 69
Weka http: //www. cs. waikato. ac. nz/ml/weka Methods: rules. Zero. R bayes. Naive. Bayes trees. j 48. J 48 lazy. IBk trees. Decision. Stump Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 70
k. Means clustering § http: //www. cc. gatech. edu/~dellaert/html/sof tware. html § java weka. clusterers. Simple. KMeans -t data/weather. arff Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 71
More useful pointers § http: //www. kdnuggets. com/ § http: //www. twocrows. com/booklet. htm Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 72
More types of data mining § § Classification and prediction Cluster analysis (clustering) Outlier analysis Evolution analysis Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 73
Clustering § Exclusive/overlapping clusters § Hierarchical/flat clusters Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 74
Methods § Single-linkage – One common pair is sufficient – disadvantages: long chains § Complete-linkage – All pairs have to match – Disadvantages: too conservative § Average-linkage § Centroid-based (online) – Look at distances to centroids Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 75
k-means § Needed: small number k of desired clusters § hard vs. soft decisions § Example: Weka Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 76
k-means 1 initialize cluster centroids to arbitrary vectors 2 while further improvement is possible do 3 for each document d do 4 find the cluster c whose centroid is closest to d 5 assign d to cluster c 6 end for 7 for each cluster c do 8 recompute the centroid of cluster c based on its documents 9 end for 10 end while Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 77
Example § Cluster the following vectors into two groups: – A = <1, 6> – B = <2, 2> – C = <4, 0> – D = <3, 3> – E = <2, 5> – F = <2, 1> Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 78
Demos § http: //vivisimo. com/ § http: //www. elet. polimi. it/upload/matteucc/Clustering/tutorial_h tml/Applet. KM. html § http: //cgm. cs. mcgill. ca/~godfried/student_projects/bonnef_kmeans § http: //www. cs. washington. edu/research/imagedatabase/dem o/kmcluster § http: //www. cc. gatech. edu/~dellaert/html/software. html § http: //www-2. cs. cmu. edu/~awm/tutorials/kmeans 11. pdf § http: //www. ece. neu. edu/groups/rpl/projects/kmeans/ § % cd /data 2/tools/weka-3 -3 -4 % export CLASSPATH=/data 2/tools/weka-3 -3 -4/weka. jar % java weka. clusterers. Simple. KMeans -t data/weather. arff Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. Kroenke 79
30784612ce5e9983e83c15a87a37cf24.ppt