a44ac13f9592a021fd7cd3d64342c8a1.ppt
- Количество слайдов: 134
SI 654 Database Application Design Winter 2003 Dragomir R. Radev 1 © 2002 by Prentice Hall
Database Processing Eighth Edition Managing Databases with Oracle Chapter 12 David M. Kroenke 2 © 2002 by Prentice Hall
What is Oracle? • Oracle is the world’s most popular DBMS that… – Is extremely powerful and robust – Runs on many different operating systems – Can be configured and tailored – Operates with most, if not all, addon products 3 © 2002 by Prentice Hall
Oracle Complexity • The power and flexibility of Oracle makes it very complex: – Installations are difficult – The configuration options are numerous – System requirements are high – System maintenance is complex 4 © 2002 by Prentice Hall
The Language of Oracle… SQL Plus • SQL Plus is used in Oracle to: – Define the structure of a database and the definition of the data – Insert, delete, and modify data – Define the behavior of the system through stored procedures and triggers – Retrieve data and generate reports 5 © 2002 by Prentice Hall
Gaining Access to SQL Plus • To gain access to SQL Plus, you will need a username and password and possibly a host string (depending on your system configuration) • When Oracle is first installed, it establishes several default accounts, namely. . . – internal/oracle (a privileged account) – sys/change_on_install (a privileged account) – system/manager (a privileged account) – scott/tiger (a non-privileged account) 6 © 2002 by Prentice Hall
Creating the Database • Ways to create an Oracle database: – Using SQL Plus • Start button –> Programs –> Oracle – Ora. Home 81 –> Applications Development –> SQL Plus – Using Oracle’s Database Configuration Assistant • Start button –> Programs –> Oracle – Ora. Home 81 –> Database Administration –> Database Configuration Assistant 7 © 2002 by Prentice Hall
Entering SQL Plus Commands • The SQL Plus Buffer – As a user types commands, the commands are saved into the SQL Plus buffer. • The SQL Plus Editor – Users may edit or alter SQL Plus commands using a text editor. 8 © 2002 by Prentice Hall
SQL Plus Buffer Commands • SQL Plus is not case sensitive (except within quotation marks). • List – displays the content of the SQL Plus buffer • List n – display line number n and changes the current line number to n • Change – performs a search/replace operation for the current line number • Semi-colon (; ) or slash (/) executes 9 © 2002 by Prentice Hall
SQL Plus Editor • The SQL Plus Edit command will launch the SQL Plus text editor • After the SQL statement is complete and correct, exit the editor • To execute the statement, type the slash key (/) at the SQL prompt • To retrieve an existing SQL file: – SQL> Edit file 1. sql 10 © 2002 by Prentice Hall
SQL Plus Commands • • Desc – lists the fields in the specified table Select – retrieve data Create – create objects Drop – delete objects Alter – change objects Insert – input data Delete – delete data Update – change data 11 © 2002 by Prentice Hall
Select Syntax Select field 1, field 2 From table_a, table_b; 12 © 2002 by Prentice Hall
Create Syntax Create Table tablename ( field 1 data_type(size) NOT NULL, field 2 data_type (size) NULL); Create Sequence table. ID Increment by 1 start with 1000; (this command creates a counter that automatically increments for each new record – does not ensure uniqueness) 13 © 2002 by Prentice Hall
Alter Table Syntax Alter Table tablename 1 Add Constraint Field. PK Primary Key (Field 1, Field 2); Alter Table tablename 2 Add Constraint Field. FK Foreign Key (Field 1, Field 2) references tablename 1 On Delete Cascade; 14 © 2002 by Prentice Hall
Insert Syntax Insert into tablename (field. ID, field 2) Values (field. ID. Next. Val, ‘data content’); 15 © 2002 by Prentice Hall
Drop Syntax • Drop Table tablename; • Drop Sequence field. ID; 16 © 2002 by Prentice Hall
Indexes • Indexes are used to enforce uniqueness and to enable fast retrieval of data. • Create Unique Index field. Index on Table(field 1, field 2); 17 © 2002 by Prentice Hall
Changing the Table Structures • Alter Table tablename add field 4 datatype (size); • Alter Table tablename Drop Column field 2; –you will permanently lose the data in field 2 • Alter Table tablename Modify field 3 not null; 18 © 2002 by Prentice Hall
Changing the Data… Update Syntax Update tablename Set field 1 = ‘value_a’ Where field 3 = value; 19 © 2002 by Prentice Hall
Check Constraint Provide a list of valid values or a valid range… Create Table tablename ( Field 1 datatype (size) Not Null, Field 2 datatype (size) Null Check (field 2 in (‘value_a’, ‘value_b’))); 20 © 2002 by Prentice Hall
Check Constraint • Alter Table tablename Add Constraint Date. Chk Check (Date. Field 1 <= Date. Field 2); • Alter Table tablename Add Constraint Num. Range Check (Field 1 Between 180 and 400); • Alter Table tablename Drop Constraint constraintname; 21 © 2002 by Prentice Hall
Views • Displaying the data from the database just the way a user wants it… Create View 1 As Select * From Tablename With Read Only; 22 © 2002 by Prentice Hall
PL/SQL • Allowing SQL to act more like a programming language. • Row-at-a-time versus set-at-a-time. • PL/SQL permits Cursors • A stored procedure is a PL/SQL (or other program) stored in the database. A stored procedure may have parameters. 23 © 2002 by Prentice Hall
PL/SQL Parameter Types • IN – specifies the input parameters • OUT – specifies the output parameters • IN OUT – a parameter that may be an input or an output 24 © 2002 by Prentice Hall
PL/SQL Code • Variables are declared following the AS keyword • The assignment operator is : = as follows variable 1 : = ‘value’ • Comments in PL/SQL are enclosed between /* and */ as follows… /* This is a comment */ 25 © 2002 by Prentice Hall
PL/SQL Control Structures FOR variable IN list_of_values LOOP Instructions END LOOP; IF condition THEN BEGIN Instructions END; 26 © 2002 by Prentice Hall
Saving, Compiling, and Executing PL/SQL Code • The last line in the PL/SQL procedure should be a slash (/). • The procedure must be saved to a file • To compile the procedure, type the keyword Start, followed by the procedure filename – START My. Prog. SQL • To see any reported errors, type SHOW ERRORS; • To execute the procedure type EXEC 27 My. Prog (‘parameter 1’, ‘parameter 2’); © 2002 by Prentice Hall
Triggers • A trigger is a stored procedure that is automatic invoked by Oracle when a specified activity occurs • A trigger is defined relative to the activity which invoked the trigger – BEFORE – execute the stored procedure prior to the activity – AFTER – execute the stored procedure after the activity – INSTEAD OF – execute the stored procedure in lue of the activity 28 © 2002 by Prentice Hall
Trigger Example Create or Replace Trigger triggername Before Insert or Update of fieldname on tablename For Each Row Begin /* instructions */ End; 29 © 2002 by Prentice Hall
A Trigger Knows the Old and New Values for Fields • The variable : new. fieldname 1 stores the new information for fieldname 1 as entered by the user. • The variable : old. fieldname 1 stores the information in fieldname 1 prior to the user’s request. 30 © 2002 by Prentice Hall
Activating a Trigger • The trigger must be saved to a file • To compile the trigger, type the keyword Start, followed by the trigger filename – START My. Trigger. SQL • To see any reported errors, type SHOW ERRORS; • If no errors were encountered, the trigger is automatically activated 31 © 2002 by Prentice Hall
Data Dictionary • The data dictionary contains information that Oracle knows about itself… the metadata. • It includes information regarding just about everything in the database including the structure and definition of tables, sequences, triggers, indexes, views, stored procedures, etc. • The data dictionary table names are stored in the DICT table. 32 © 2002 by Prentice Hall
Concurrency Control • Since Oracle only reads committed changes, dirty reads and lost updates are avoided • Transaction isolation levels: – Read Committed – Serializable – Read-only – Explicit Locks 33 © 2002 by Prentice Hall
Read Committed Transaction Isolation • Reads may not be repeatable (2 reads may result in 2 data values, based on timing of updates and reads) • Phantoms are possible (data from a read may be deleted after the read occurred) • Uses exclusive locks • Deadlocks are possible and are resolved by rolling-back one of the transactions 34 © 2002 by Prentice Hall
Serializable Transaction Isolation • Reads are always repeatable • Phantoms are avoided • Must issue the following command: Set Transaction Isolation Level Serializable; or Alter Sessions Set Isolation Level Serializable; • Coordinates activities in submission order. When this coordination detects difficulties, the application program(s) must intervene. 35 © 2002 by Prentice Hall
Read-only Transaction Isolation • An Oracle-only isolation level • No inserting, updating, or deleting is permitted 36 © 2002 by Prentice Hall
Explicit Locking • Not recommended • Oracle does not promote locks. As a result, a table may have many, many locks within it. Oracle manages these locks transparently. Issuing explicit locks may interfere with these transparent locks. 37 © 2002 by Prentice Hall
Oracle Security • Username and Password is used to manage DBMS access • Users may be assigned to one or more profiles • Oracle provides extensive resource limitations and access rights. These restrictions may be applied to users or profiles. • The SQL Grant operator provides additional access rights • The SQL Revoke operator remove access rights 38 © 2002 by Prentice Hall
Backup/Recovery • Committed changes are saved to destination Tablespaces. • Uncommitted changes are saved in the Rollback Tablespace. • Redo Log files save all changes made in the Tablespaces. • To start and/or recover from a system failure, the Control Files are read. 39 © 2002 by Prentice Hall
Archivelog • If Oracle is running in ARCHIVELOG mode, backup copies are made of the redo log files. • Otherwise, the redo log files are periodically overwritten with new information. 40 © 2002 by Prentice Hall
Types of Failures • Application Failure – When a program bug is encountered or when a program does not correctly respond to current system conditions. • Instance Failure – When Oracle is unable to do what it needs to do. • Media Failure – When a disk becomes inaccessible to Oracle. 41 © 2002 by Prentice Hall
Recovery of an Application Failure • Oracle rolls back uncommitted changes. 42 © 2002 by Prentice Hall
Recovery of an Instance Failure • Oracle would be restarted using the following sequence… – Read the Control File – Restore system to last known valid state – Roll forward changes not in system (replay the Redo Log Files) 43 © 2002 by Prentice Hall
Recovery from a Media Failure • Restore system from Backup • Read the Control File • Roll forward changes using the Archive Log Files (from the ARCHIVELOG) • Roll forward changes from the on-line Log Files (the most recent versions of the Logs) 44 © 2002 by Prentice Hall
Types of Recoveries • Consistent Backup – After the restoration, delete all uncommitted activities – This ensures consistency, may lose recent changes • Inconsistent Backup – After the restoration, all uncommitted activities remain 45 © 2002 by Prentice Hall
Database Processing Eighth Edition Networks, Multi-Tier Architectures, and XML Chapter 14 David M. Kroenke 46 © 2002 by Prentice Hall
Networks • A network is a collection of computers that communicate with one another using standard sets of rules, called protocols • Common Network Environments: – Internet – Intranet – Wireless Network Access 47 © 2002 by Prentice Hall
The Internet • Internet - a publicly accessible network of networks spanning the globe • Uses communications protocol called Transmission Control Program/Internet Protocol (TCP/IP) 48 © 2002 by Prentice Hall
Key Dates for the Internet • The Internet was born in the 1960’s by the US armed services and was called ARPANET • HTTP: Hyper. Text Transfer Protocol (used to create Web Pages) was created in 1989 by CERN – Key HTTP characteristics: • Request-based (waits for user action) • Stateless (does not sequence or remember activities) 49 © 2002 by Prentice Hall
HTTP: Stateless Property • In applications development, you may often wish to save the application state. • Several Internet tools exist to help accomplish this: – Microsoft Internet Information Server (IIS) – Microsoft Active Server Pages (ASP) – Java Servlets with Java Server. Pages (JSP) 50 © 2002 by Prentice Hall
The Intranet • Some organizations use Internet technologies to create their own privately accessible network called an intranet. • If a connection to the Internet does exists, it does so through a firewall • An intranet is almost always faster than the Internet 51 © 2002 by Prentice Hall
Firewall • Firewall - a security gateway that protects an organization from unauthorized access via the Internet • Consists of software and sometimes hardware components 52 © 2002 by Prentice Hall
Wireless Network Access • Due to less reliability, inferior screen displays, and slower transfer rates, the traditional wired protocols are not appropriate for wireless environments. • A few protocols have been developed which allow wireless devices to communicate via the Internet: – Wireless Application Protocol (WAP) – Wireless Markup Language (WML) 53 © 2002 by Prentice Hall
Multi-tier Architectures • Multi-tier Architectures – Tiers are the number of computers (serving a like function) that a user must use to satisfy his/her request. – Common tiers include Web server and database server. 54 © 2002 by Prentice Hall
A Three-Tier Architecture 55 © 2002 by Prentice Hall
Functions of Tiers 56 © 2002 by Prentice Hall
Processing at the Different Tiers • Since each tier serves a different function, each tier may have a different operating system and different application software offerings. 57 © 2002 by Prentice Hall
Processing • Client Processing – Using the browser (e. g. , Netscape Navigator) • Server Processing – Using Server Software (e. g. , ASP) 58 © 2002 by Prentice Hall
Windows 2000 Web Server Languages • • • Java. Script VBScript Perl Active. X Control Java 59 © 2002 by Prentice Hall
Standards & Languages Common With MS Web Server 60 © 2002 by Prentice Hall
Unix/Linux Web Server Environment • • Java. Script Java Applets Java Server Pages Perl Java CGI 61 © 2002 by Prentice Hall
N-Tier Processing • The 3 -Tier architecture may be extended to include additional tiers. • This produces a distributed processing model using various servers on the Internet 62 © 2002 by Prentice Hall
Markup Languages • Markup Languages are used to specify the appearance and behavior of Web Pages • Markup Language flavors: – HTML – a subset of the SGML – DHTML – RDS/ADO – XML 63 © 2002 by Prentice Hall
HTML • Hyper. Text Markup Language – PROS • Simple • Standardized – CONS • Static content • Limited connectivity • Mixed structure/content 64 © 2002 by Prentice Hall
DHTML • Dynamic Hyper. Text Markup Language • Encapsulates the entire HTML command set • Provides access to objects on the page using the Document Object Model (DOM) • Allows for Cascading Style Sheets (CSS) 65 © 2002 by Prentice Hall
Data Services • Data services allow Web pages to exchange data with databases • RDS is a set of Object. X controls – The data exchanges must be relatively simple • ADO is a set of Active. X Data Controls – These data exchanges may be more complex 66 © 2002 by Prentice Hall
Extensible Markup Language –XML • XML clearly separates content from structure and allows developers to easily define their own elements. • Rather than hard-coding Web pages, you create rules that govern how the document should look. Then merge the structure and the content files. So, the very nature of XML is dynamic. 67 © 2002 by Prentice Hall
Document Type Declaration – DTD • A DTD defines the data content and may provide the data values • While a DTD is desirable, it is not mandatory – XML documents using DTDs are termed type-valid documents – XML documents not using DTDs are termed not-type-valid documents 68 © 2002 by Prentice Hall
XML & CSS • Similar to DHTML, Cascading Style Sheets (CSS) may be used with XML documents to present a consistent, standardized Web site. 69 © 2002 by Prentice Hall
Extensible Style Language Transformation –XMLT • XMLT is used to transform one document into another document 70 © 2002 by Prentice Hall
XML Schema • XML Schema is the next generation of DTD • The schema itself is an XML document • A W 3 standard is currently being developed • A document that conforms to an XML Schema is termed schema-valid. 71 © 2002 by Prentice Hall
XML Schema Concepts • Simple Elements – Consist of a single content value • Complex Elements – Consist of multiple content values 72 © 2002 by Prentice Hall
XML Namespaces • Namespaces define where to look for files • An XML document may have: – Up to one default namespace – Many labeled namespaces • Naming conventions: – Must be unique within all schemas – Typically resembles a URL, but is not a URL 73 © 2002 by Prentice Hall
Wireless Application Protocol (WAP) • WAP has been developed to facilitate Web development for wireless devices such as Personal Data Assistants (PDA) or cellular phones 74 © 2002 by Prentice Hall
WAP Server • A WAP Server transforms XML documents into Wireless Markup Language (WML) – WML is a subset of XML • A WML Scripting Language also exists 75 © 2002 by Prentice Hall
XML and Database Applications • Any document that can process a DTD or XML Schema document can correctly interpret any arbitrary database view • XML can easily process multiple multivalued paths (several SQL statements would be required) 76 © 2002 by Prentice Hall
XML and Database Applications • The separation of structure and content allows for: – The same data to be displayed in many different ways – The same structure (report) may be regenerated many times with different/updated data. – Permits document validation checking 77 © 2002 by Prentice Hall
OASIS – Document structures may be published and made publicly available – Organization for the Advancement of Structured Information Standards (OASIS): • A clearinghouse for XML publications and schema standards 78 © 2002 by Prentice Hall
DBMS Integration of XML – Oracle • XML DOM parser • Xpath • XSQL – SQL Server • XML DOM parser • Xpath • ADO • ASP 79 © 2002 by Prentice Hall
Web-based databases 80 © 2002 by Prentice Hall
Types of databases • Textual databases • Semi-structured databases 81 © 2002 by Prentice Hall
Indexing textual data • Inverted files • Boolean queries • Signature files • Signature S 1 matches signature S 2 if S 2&S 1=S 2 82 © 2002 by Prentice Hall
XML-QL 83 © 2002 by Prentice Hall
XML-QL Two slides from Johannes Gehrke, Cornell University <IMG SRC=“xysq. gif” ALT=“(x+y)^2”> <apply> <power/> <apply> <plus/> <ci>x</ci> <ci>y</ci> </apply> <cn>2</cn> </apply> WHERE <BOOK> <NAME><LAST>$1</LAST></NAME> </BOOK> in “www. booklist. com/books. xml CONSTRUCT <RESULT> $1 </RESULT> 84 © 2002 by Prentice Hall
XML-QL (continued) WHERE <BOOK> $b <BOOK> IN “www. booklist. com/books. xml”, <AUTHOR> $n </AUTHOR> <PUBLISHED> $p </PUBLISHED> in $e CONSTRUCT <RESULT> <PUBLISHED> $p </PUBLISHED> WHERE <LAST> $l </LAST> IN $n CONSTRUCT <LAST> $l </LAST> </RESULT> 85 © 2002 by Prentice Hall
XML-QL (continued) <!ELEMENT book (author+, title, publisher)> <!ATTLIST book year CDATA> <!ELEMENT article (author+, title, year? , (shortversion|longversion))> <!ATTLIST article type CDATA> <!ELEMENT publisher (name, address)> <!ELEMENT author (firstname? , lastname)> 86 © 2002 by Prentice Hall
XML-QL (continued) WHERE <book> <publisher><name>Addison. Wesley</name></publisher> <title> $t</title> <author> $a</author> </book> IN "www. a. b. c/bib. xml" CONSTRUCT $a 87 © 2002 by Prentice Hall
XML-QL (continued) WHERE <book> <publisher><name>Addison. Wesley</></> <title> $t</> <author> $a</> IN "www. a. b. c/bib. xml" CONSTRUCT $a 88 © 2002 by Prentice Hall
XML-QL (continued) WHERE <book> <publisher><name>Addison-Wesley</></> <title> $t</> <author> $a</> IN "www. a. b. c/bib. xml" CONSTRUCT <result> <author> $a</> <title> $t</> 89 © 2002 by Prentice Hall
XML-QL (continued) <bib> <book year="1995"> <!-- A good introductory text --> <title> An Introduction to Database Systems </title> <author> <lastname> Date </lastname> </author> <publisher> <name> Addison-Wesley </name > </publisher> </book> <book year="1998"> <title> Foundation for Object/Relational Databases: The Third Manifesto </title> <author> <lastname> Date </lastname> </author> <lastname> Darwen </lastname> </author> <publisher> <name> Addison-Wesley </name > </publisher> </book> </bib> 90 © 2002 by Prentice Hall
XML-QL (continued) <result> <author> <lastname> Date </lastname> </author> <title> An Introduction to Database Systems </title> </result> <author> <lastname> Date </lastname> </author> <title> Foundation for Object/Relational Databases: The Third Manifesto </title> </result> <author> <lastname> Darwen </lastname> </author> <title> Foundation for Object/Relational Databases: The Third Manifesto </title> </result> 91 © 2002 by Prentice Hall
XML-QL (continued) WHERE <book > $p</> IN "www. a. b. c/bib. xml", <title > $t</>, <publisher><name>Addison-Wesley</>> IN $p CONSTRUCT <result> <title> $t </> WHERE <author> $a </> IN $p CONSTRUCT <author> $a</> 92 © 2002 by Prentice Hall
XML-QL (continued) <result> <title> An Introduction to Database Systems </title> <author> <lastname> Date </lastname> </author> </result> <title> Foundation for Object/Relational Databases: The Third Manifesto </title> <author> <lastname> Date </lastname> </author> <lastname> Darwen </lastname> </author> </result> 93 © 2002 by Prentice Hall
XML-QL (continued) WHERE <article> <author> <firstname> $f </> // firstname $f <lastname> $l </> // lastname $l </> CONTENT_AS $a IN "www. a. b. c/bib. xml" <book year=$y> <author> <firstname> $f </> // join on same firstname $f <lastname> $l </> // join on same lastname $l </> IN "www. a. b. c/bib. xml", y > 1995 CONSTRUCT <article> $a </> 94 © 2002 by Prentice Hall
XML-QL (continued) 95 © 2002 by Prentice Hall
XML-QL (continued) <!ATTLIST person ID ID #REQUIRED> <!ATTLIST article author IDREFS #IMPLIED> 96 © 2002 by Prentice Hall
XML-QL (continued) <person ID="o 123"> <firstname>John</firstname> <lastname>Smith<lastname> </person> <person ID="o 234">. . . </person> <article author="o 123 o 234"> <title>. . . </title> <year> 1995 </year> </article> 97 © 2002 by Prentice Hall
XML-QL (continued) 98 © 2002 by Prentice Hall
XML-QL (continued) WHERE <article><author><lastname> $n</></></> IN "abc. xml” WHERE <article author=$i> <title> </> ELEMENT_AS $t </>, <person ID=$i> <lastname> </> ELEMENT_AS $l </> CONSTRUCT <result> $t $l</> 99 © 2002 by Prentice Hall
Scalar values NOT! <title>A Trip to <titlepart> the Moon </titlepart></title> <title><CDATA> A Trip to </CDATA><titlepart><CDATA> the Moon</CDATA></titlepart></title> YES 100 © 2002 by Prentice Hall
Tag variables WHERE <$p> <title> $t </title> <year>1995</> <$e> Smith </> IN "www. a. b. c/bib. xml", $e IN {author, editor} CONSTRUCT <$p> <title> $t </title> <$e> Smith </> 101 © 2002 by Prentice Hall
Transforming data <!ELEMENT book (author+, title, publisher)> <!ATTLIST book year CDATA> <!ELEMENT article (author+, title, year? , (shortversion|longversion))> <!ATTLIST article type CDATA> <!ELEMENT publisher (name, address)> <!ELEMENT author (firstname? , lastname)> <!ELEMENT person (lastname, firstname, address? , phone? , publicationtitle*)> 102 © 2002 by Prentice Hall
Transforming data (cont’d) WHERE <$> <author> <firstname> $fn </> <lastname> $ln </> <title> $t </> IN "www. a. b. c/bib. xml", CONSTRUCT <person ID=Person. ID($fn, $ln)> <firstname> $fn </> <lastname> $ln </> <publicationtitle> $t </> 103 © 2002 by Prentice Hall
Integrating data from different sources WHERE <person> <name></> ELEMENT_AS $n <ssn> $ssn</> IN "www. a. b. c/data. xml", <taxpayer> <ssn> $ssn</> <income></> ELEMENT_AS $i </> IN "www. irs. gov/taxpayers. xml" CONSTRUCT <result> $n $i </> 104 © 2002 by Prentice Hall
Query blocks WHERE <$e> <title> $t </> <year> 1995 </> CONTENT_A $p IN "www. a. b. c/bib. xml" CONSTRUCT <result ID=Result. ID($p)> <title> $t </> { WHERE $e = "journal-paper", <month> $m </> IN $p CONSTRUCT <result ID=Result. ID($p)> <month> $m </> } { WHERE $e = "book", <publisher>$q </> IN $p CONSTRUCT <result ID=Result. ID($p)> <publisher>$q </> } 105 © 2002 by Prentice Hall
WSQ 106 © 2002 by Prentice Hall
Web-supported queries SIGMOD 2000 (Goldman and Widom) Web. Pages (Search. Exp, T 1, T 2, …, Tn, URL, Rank, Date) SELECT NAME, COUNT FROM STATES, WEBCOUNT WHERE NAME = T 1 ORDER BY COUNT DESC 107 © 2002 by Prentice Hall
KDD: Data Mining 108 © 2002 by Prentice Hall
The big problem • Billions of records • A small number of interesting patterns • “Data rich but information poor” 109 © 2002 by Prentice Hall
Data mining • Knowledge discovery • Knowledge extraction • Data/pattern analysis 110 © 2002 by Prentice Hall
Types of source data • • Relational databases Transactional databases Web logs Textual databases 111 © 2002 by Prentice Hall
Association rules • 65% of all customers who buy beer and tomato sauce also buy pasta and chicken wings • Association rules: X Y 112 © 2002 by Prentice Hall
Association analysis • IF 20 < age < 30 AND 20 K < INCOME < 30 K • THEN – Buys (“CD player”) • SUPPORT = 2%, CONFIDENCE = 60% 113 © 2002 by Prentice Hall
Basic concepts • Minimum support threshold • Minimum confidence threshold • Itemsets • Occurrence frequency of an itemset 114 © 2002 by Prentice Hall
Association rule mining • Find all frequent itemsets • Generate strong association rules from the frequent itemsets 115 © 2002 by Prentice Hall
Support and confidence • Support (X) • Confidence (X Y) = Support(X+Y) / Support (X) 116 © 2002 by Prentice Hall
Example 117 TID T 100 T 200 T 300 T 400 T 500 T 600 T 700 T 800 T 900 List of items IDs I 1, I 2, I 5 I 2, I 4 I 2, I 3 I 1, I 2, I 4 I 1, I 3 I 2, I 3 I 1, I 2, I 3, I 5 I 1, I 2, I 3 © 2002 by Prentice Hall
Example (cont’d) • • Frequent itemset l = {I 1, I 2, I 5} I 1 AND I 2 I 5 C = 2/4 = 50% I 1 AND I 5 I 2 AND I 5 I 1 I 2 AND I 5 I 2 I 1 AND I 5 I 3 I 1 AND I 2 118 © 2002 by Prentice Hall
Example 2 TID items T 100 10/15/99 {K, A, D, B} T 200 10/15/99 {D, A, C, E, B} T 300 10/19/99 {C, A, B, E} T 400 119 date 10/22/99 {B, A, D} min_sup = 60%, min_conf = 80% © 2002 by Prentice Hall
Correlations • Corr (A, B) = P (A OR B) / P(A) P (B) • If Corr < 1: A discourages B (negative correlation) • (lift of the association rule A B) 120 © 2002 by Prentice Hall
Contingency table Game ^Game Sum Video 4, 000 3, 500 7, 500 ^Video 2, 000 500 2, 500 Sum 6, 000 4, 000 10, 000 121 © 2002 by Prentice Hall
Example • • P({game}) = 0. 60 P({video}) = 0. 75 P({game, video}) = 0. 40 P({game, video})/(P({game})x(P({video}) ) = 0. 40/(0. 60 x 0. 75) = 0. 89 122 © 2002 by Prentice Hall
Example 2 hotdog ^hotdo Sum s gs hamburgers 2000 500 2500 ^hamburger 1000 s 1500 2500 Sum 2000 5000 3000 123 © 2002 by Prentice Hall
Classification using decision trees • Expected information need S • I (s 1, s 2, …, sm) = - pi log (pi) • s = data samples • m = number of classes 124 © 2002 by Prentice Hall
RID Age Income student credit buys? 1 <= 30 High No Fair No 2 <= 30 High No Excellent No 3 31. . 40 High No Fair Yes 4 > 40 Medium No Fair Yes 5 > 40 Low Yes Fair Yes 6 > 40 Low Yes Excellent No 7 31. . 40 Low Yes Excellent Yes 8 <= 30 Medium No Fair No 9 <= 30 Low Yes Fair Yes 10 > 40 Medium Yes Fair Yes 11 <= 30 Medium Yes Excellent Yes 12 31. . 40 Medium No Excellent Yes 13 31. . 40 High Yes Fair 14 > 40 Medium no excellent no Yes 125 © 2002 by Prentice Hall
Decision tree induction • I(s 1, s 2) = I(9, 5) = = - 9/14 log 9/14 – 5/14 log 5/14 = = 0. 940 126 © 2002 by Prentice Hall
Entropy and information gain • E(A) = S S 1 j + … + smj s I (s 1 j, …, smj) Entropy = expected information based on the partitioning into subsets by A Gain (A) = I (s 1, s 2, …, sm) – E(A) 127 © 2002 by Prentice Hall
Entropy • Age <= 30 s 11 = 2, s 21 = 3, I(s 11, s 21) = 0. 971 • Age in 31. . 40 s 12 = 4, s 22 = 0, I (s 12, s 22) = 0 • Age > 40 s 13 = 3, s 23 = 2, I (s 13, s 23) = 0. 971 128 © 2002 by Prentice Hall
Entropy (cont’d) • E (age) = 5/14 I (s 11, s 21) + 4/14 I (s 12, s 22) + 5/14 I (S 13, s 23) = 0. 694 • Gain (age) = I (s 1, s 2) – E(age) = 0. 246 • Gain (income) = 0. 029, Gain (student) = 0. 151, Gain (credit) = 0. 048 129 © 2002 by Prentice Hall
Final decision tree age > 40 31. . 40 student credit yes no yes excellent no fair yes 130 © 2002 by Prentice Hall
Other techniques • Bayesian classifiers • X: age <=30, income = medium, student = yes, credit = fair • P(yes) = 9/14 = 0. 643 • P(no) = 5/14 = 0. 357 131 © 2002 by Prentice Hall
Example • P (age < 30 | yes) = 2/9 = 0. 222 P (age < 30 | no) = 3/5 = 0. 600 P (income = medium | yes) = 4/9 = 0. 444 P (income = medium | no) = 2/5 = 0. 400 P (student = yes | yes) = 6/9 = 0. 667 P (student = yes | no) = 1/5 = 0. 200 P (credit = fair | yes) = 6/9 = 0. 667 P (credit = fair | no) = 2/5 = 0. 400 132 © 2002 by Prentice Hall
Example (cont’d) • P (X | yes) = 0. 222 x 0. 444 x 0. 667 = 0. 044 • P (X | no) = 0. 600 x 0. 400 x 0. 200 x 0. 400 = 0. 019 • P (X | yes) P (yes) = 0. 044 x 0. 643 = 0. 028 • P (X | no) P (no) = 0. 019 x 0. 357 = 0. 007 • Answer: yes/no? 133 © 2002 by Prentice Hall
More types of data mining • • Classification and prediction Cluster analysis Outlier analysis Evolution analysis 134 © 2002 by Prentice Hall
a44ac13f9592a021fd7cd3d64342c8a1.ppt