27ffd248ce6a46a79f88fd3bf575269e.ppt
- Количество слайдов: 66
Advanced Topics in Data Mining: Web Mining
Web Mining
Web Mining • Applications are ported to the Web at rapid pace • On-line services, such as America Online (AOL), and Compu. Serve (merged to AOL), are anxious to know user access patterns; not just “search” in the Web • How Amazon does it? • Understanding Web user behavior is important – – It can improve Web page organization It can increase Web server performance It can exploit Web advertising It can increase business opportunity
Amazon Web Page Association Rules
More Information Desired • Collect statistical information (page hits) only, which is insufficient since: – The hit frequency of a page depends not only on its content but also on its location – The number of users accessing a page is not available – Information on what pages accessed together is not available • Data mining in the Web (Web Mining) – Web Access Pattern Collection – Web User Pattern Mining
Web Access Pattern Collection • Server-Based Data Collection – Who are visiting a given Web site and what are they doing • Agent-Based Data Collection – What are the Web sites a particular user has visited?
Server-Based Data Collection • Examine the logs collected by HTTPd – Access Log (IP, Time, Access Data), Referred Log (A B), Error Log, … – We can combining some of them for our use if necessary • Problems – The use of proxy servers – The effect of caching
Server-Based Data Collection
Access Log IP/Domain Name Time Access Data
Referred Log 不考慮 Caching的問題
Server-Based Data Collection • Have to be done in accordance with technology advances – The use of Active Server Pages (Session ID available) • The use of proxy servers • The effect of caching – HTTPd 1. 1 • Limitation – Can only capture the user behavior when they are within this site
Agent-Based Data Collection • Understanding individual Web behavior needs client-based data collection • Results are useful – Better Personalized Service – Improved Web Page Organization – Better Pricing Policies • Methods – Applets can only read/write files in their source servers • a big security constraint – Using Active Components (Active. X Control) and Plug. Ins • APCS (Access Pattern Collection Server)
APCS
APCS
APCS
APCS
APCS
Agent-Based Data Collection • Very difficult to do for non-registered users in the current Web environment – We have to be conducted with users’ consent • Very dependent upon available Web technologies
Web User Pattern Mining • Web user pattern mining is to discover user access patterns in Web servers • Pattern discovery and analysis tools – Some existing Web tools provide mechanisms for reporting user activity in the servers – Web Trends (http: //www. webtrends. com. tw/) – Open Market (http: //www. openmarket. com/) – Net. Genesis (http: //www. netgen. com/)
Path Traversal Patterns Mining • Mining path traversal patterns in a distributed information providing environment (WWW) where documents or objects are linked together (via hyperlinks) to facilitate interactive access • Solution procedure consists of three steps: – Convert the original sequence of log data into a set of maximal forward references (MF) • Filter out the effect of some backward references – Mainly made for ease of traveling and concentrate on mining meaningful user access sequences – Some objects are visited because of their locations rather than their content – Determine the frequent traversal patterns, i. e. , large reference sequences, from the maximal forward references obtained – Determine the maximal reference sequences from large reference sequences (Trivial)
Step 1: MF References • Suppose the traversal log contains the following traversal path for a user: – A, B, C, D, C, B, E, G, H, G, W, A, O, U, O, V When backward references occur, a forward reference path terminate. The set of maximal forward references is {ABCD, ABEGH, ABEGW, AOU, AOV}
Step 1: Another Example
Step 1: Arrange Database Encoding
Step 1: Database Reduction
Step 2: Find Frequent Reference Sequences • Two algorithms for finding Frequent Traversal Patterns (Frequent Reference Sequences, Frequent Consecutive Subsequences) – Full-Scan (FS) Algorithm • FS utilizes key ideas of the DHP algorithm – Selective-Scan (SS) Algorithm • SS reduces the number of database scans
Full-Scan (FS) Algorithm Generate L 1 & Hash Table Scan DB-1
Generate L 1 & Hash Table Scan DB-1 h(x, y) = [ ( order of x ) * 23 + ( order of y ) ] mod 17
Generate C 2
Generate L 2 & Reduce DB Scan DB-2
Generate L 2 & Reduce DB Scan DB-2
Generate C 3, L 3 & Reduce DB Scan DB-3
Generate C 4, L 4 & Reduce DB Scan DB-4
Selective-Scan (SS) Algorithm Scan DB-3
Step 3: Generate Frequent Traversal Patterns Maximal Reference Sequences
WAP-Mine Algorithm • The key consideration is how to facilitate the tedious support counting and candidate generating operations in the mining procedure • Given Web Access Sequence database WAS and a support threshold , mine the complete set of -patterns of WAS User ID 100 200 300 400 Web Access Sequence abdac eaebcac babfaec afbacfc WAS
WAP-Mine Algorithm (1)Scan WAS once, find all frequent-1 events (2)Scan WAS again, construct a WAP-tree (3)Recursively mine the WAP-tree using conditional search Access patterns
Find All Frequent-1 Events Item Support Frequency User ID Web Access Sequence Min_Sup=75% 100 abdac 200 eaebcac 300 babfaec 400 afbacfc 4 b 4 c 4 d 1 e 2 f User ID Web Access Sequence 100 abdac 200 eaebcac 300 babfaec 400 afbacfc a 2 Frequent Subsequence abac abcac babacc
WAP-Tree Construction • Using frequent events to register all count information for further mining User ID Frequent Subsequence 100 abac 200 abcac 300 babac 400 abacc
Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on c Sequence Count aba 2 ab 1 abca 1 ab -1 baba 1 abac 1 aba -1 Generate Web Access Patterns: ac, bc Sequence Count aba 1 abca 1 baba 1 abac 1 Item Sup Frequency a 4 b 4 c 2
Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on ac Sequence Count ab 3 b 1 bab 1 b -1 Item Sup Frequency a Generate Web Access Patterns: aac, bac 4 b 4
Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on bac Sequence Count a 3 ba 1 Item Sup Frequent a Generate Web Access Patterns: abac 4 b 1
Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on abac Sequence Count a 4 No Web Access Patterns are Generated
Mining for Web Transactions • To capture Web customer buying behavior – It is not just market basket transaction for the set of items bought by a customer in a single purchase (Association Rules) – It is not just Web user travel patterns (Path Traversal Patterns) – It is an extension from path traversal patterns • Exploring the relationship between traveling and buying
Mining for Web Transactions Web Transaction Algorithm WR (Web-transaction-Record) Web Transaction Records <Path: a Set of Purchases> Algorithm WTM, MTSPJ, MTSPC Frequent Transaction Patterns Web Transaction Association Rules
Mining for Web Transactions • Web-transaction-Record (WR) Algorithm – Extract meaningful Web transaction records from the given Web transaction • WTM (Web Transaction Mining) Algorithm – Mining Web Transaction Patterns • MTS (Maximal Transaction Segment) Algorithms are the improvement versions of WTM
Mining for Web Transactions
Mining for Web Transactions
WTM Algorithm • It joins the purchased itemsets for generating candidate transaction patterns • WTM employs a two-level hash tree, called Web transaction tree, to store candidate transaction patterns – WTM hashes not only each item but also each purchase in the path
WTM Algorithm DATABASE Web Transaction WT_ID Path ABCE Purchase B{i 1}, C{i 2}, E{i 4} ABCE B{i 1}, C{i 2}, E{i 4} ASJLQ S{i 7}, Q{i 10} B{i 1}, E{i 4} ABFG B{i 1}, G{i 5} S{i 7}, J{i 8}, L{i 9} ABD 400 S{i 7}, L{i 9} ASJL 300 B{i 1}, H{i 6} ABCE 200 ABFGH ASJL 100 D{i 3} ABFG G{i 5} ASJLQ S{i 7}, J{i 8}, Q{i 10}
Support Count WT_ID Path Purchase ABCE 200 Path ABFGH B{i 1}, H{i 6} ASJL 100 B{i 1}, C{i 2}, E{i 4} S{i 7}, L{i 9} ABCE B{i 1}, C{i 2}, E{i 4} ASJLQ S{i 7}, Q{i 10} Purchase Support Count AB B{i 1} 2 ABC C{i 2} 2
WTM Algorithm Support Count >= 2 C 1 T 1 Path Purchase Sup. AB B{i 1} 3 ABC C{i 2} 2 ABD D{i 3} 1 ABCE E{i 4} 3 ABFG G{i 5} 2 AS S{i 7} 4 ABFGH H{i 6} 1 ASJ J{i 8} 2 AS S{i 7} 4 ASJ J{i 8} 2 ASJL L{i 9} 2 ASJLQ Q{i 10} 2
WTM Algorithm Support Count >= 2 C 2 Path 2 ABCE 28 個 B{i 1} C{i 2} B{i 1} E{i 4} 3 T 2 Sup. ABC 共 Purchase Path Purchase Sup. ABC B{i 1} C{i 2} 2 ABCE B{i 1} E{i 4} 3 AS B{i 1} S{i 7} 0 ABCE C{i 2} E{i 4} 2 ASJ B{i 1} J{i 8} 0 ASJ S{i 7} J{i 8} 2 ASJL S{i 7} L{i 9} 2 ASJLQ S{i 7} Q{i 10} 2 ASJLQ J{i 8} Q{i 10} 1 ASJLQ L{i 9} Q{i 10} 0
WTM Algorithm Support Count >= 2 C 3 Path Purchase Sup. ABCE B{i 1} C{i 2} E{i 4} 2 T 3 Path Purchase Sup. ABCE B{i 1} C{i 2} E{i 4} 2
WTM Disadvantages • WTM may generate a lot of unqualified candidate transaction patterns without utilizing the paths of frequent transaction patterns • This will degrade the performance
MTSPJ Algorithm • Algorithm MTSPJ uses maximal transaction segment that contains frequent transaction patterns and the maximal path, to solve the unqualified candidate transaction pattern problem • MTSPJ generalizes candidate transaction patterns only when the leaf node of the Web transaction tree is reached
MTSPJ Algorithm DATABASE Web Transaction WT_ID Path ABCE Purchase B{i 1}, C{i 2}, E{i 4} ABCE B{i 1}, C{i 2}, E{i 4} ASJLQ S{i 7}, Q{i 10} B{i 1}, E{i 4} ABFG B{i 1}, G{i 5} S{i 7}, J{i 8}, L{i 9} ABD 400 S{i 7}, L{i 9} ASJL 300 B{i 1}, H{i 6} ABCE 200 ABFGH ASJL 100 D{i 3} ABFG G{i 5} ASJLQ S{i 7}, J{i 8}, Q{i 10} A B C E S F D J G H L Q
MTSPJ Algorithm Support Count >= 2 C 1 T 1 Path Purchase Sup. AB B{i 1} 3 ABC C{i 2} 2 ABCD D{i 3} 1 ABCE E{i 4} 3 ABFG G{i 5} 2 AS S{i 7} 4 ABFGH H{i 6} 1 ASJ J{i 8} 2 ASJL L{i 9} 2 ASJLQ Q{i 10} 2 AS S{i 7} 4 ASJ J{i 8} 2 ASJL L{i 9} 2 ASJLQ Q{i 10} 2 A B C E S J F G L Q
MTSPJ Algorithm A B C S J F E Purchase Path Purchase S{i 7} L{i 9} 2 ASJL J{i 8} L{i 9} 1 ASJLQ S{i 7} Q{i 10} 2 Maximal Transaction Segment ASJLQ J{i 8} Q{i 10} 1 ABFG B{i 1} C{i 2} E{i 4} C 2 Path S{i 7} J{i 8} Sup. 2 ASJL Q ASJLQ L{i 9} Q{i 10} 0 Maximal Transaction Segment ABCE ASJ L G C 2 B{i 1} G{i 5} Sup. ABC B{i 1} C{i 2} ABCE B{i 1} E{i 4} 3 ABCE C{i 2} E{i 4} 2 Maximal Transaction Segment C 2 2 Path ABFG Purchase B{i 1} G{i 5} Sup. 1 ASJLQ S{i 7} J{i 8} L{i 9} Q{i 10}
MTSPJ Algorithm C 2 Path Purchase ABC T 2 Sup. Path Purchase Sup. B{i 1} C{i 2} 2 ABCE B{i 1} E{i 4} 3 ABCE C{i 2} E{i 4} 2 ABFG B{i 1} G{i 5} 1 ASJ S{i 7} J{i 8} 2 ASJL S{i 7} L{i 9} 2 ASJLQ S{i 7} Q{i 10} 2 ASJL J{i 8} L{i 9} 1 ASJLQ S{i 7} Q{i 10} 2 ASJLQ J{i 8} Q{i 10} 1 ASJLQ L{i 9} Q{i 10} 0
MTSPJ Algorithm A B S C J L E Q Maximal Transaction Segment ABCE B{i 1} C{i 2} E{i 4} C 3 Path ABCE Purchase B{i 1} C{i 2} E{i 4} Sup. 2
MTSPC Algorithm MTSPC utilizes the LC (Large Count) to Filter Candidates Support Count >= 2 C 1 T 1 Path Purchase Sup. AB B{i 1} 3 ABC C{i 2} 2 ABCD D{i 3} 1 ABCE E{i 4} 3 ABFG G{i 5} 2 AS S{i 7} 4 ABFGH H{i 6} 1 ASJ J{i 8} 2 ASJL L{i 9} 2 ASJLQ Q{i 10} 2 AS S{i 7} 4 ASJ J{i 8} 2 ASJL L{i 9} 2 ASJLQ Q{i 10} 2 A B C E S J F G L Q
MTSPC Algorithm A K=1 Maximal Transaction Segment Maximal Path Item 1 C{i 2} 1 E{i 4} ABCE S 1 C Sup. ABC B{i 1} C{i 2} ABCE B{i 1} E{i 4} 3 ABCE C{i 2} E{i 4} 2 1 LC Path B{i 1} 1 ASJ S{i 7} J{i 8} 2 G{i 5} 1 ASJL S{i 7} L{i 9} 2 ASJL J{i 8} L{i 9} 1 Purchase B{i 1} G{i 5} Purchase Sup. ASJLQ S{i 7} Q{i 10} Sup. 1 2 ASJLQ J{i 8} Q{i 10} C 2 ABFG 1 C 2 |I| = 2 > 1 Path L{i 9} Item ABFG 2 1 |I| = 4 > 1 Maximal Transaction Segment C 2 Purchase 1 J{i 8} ASJLQ LC S{i 7} Q Maximal Path L G E |I| = 3 > 1 (K-1) J F Item Q{i 10} B LC B{i 1} Maximal Transaction Segment 1 ASJLQ L{i 9} Q{i 10} 0
MTSPC Algorithm C 2 Path Purchase ABC T 2 Sup. Path Purchase Sup. B{i 1} C{i 2} 2 ABCE B{i 1} E{i 4} 3 ABCE C{i 2} E{i 4} 2 ABFG B{i 1} G{i 5} 1 ASJ S{i 7} J{i 8} 2 ASJL S{i 7} L{i 9} 2 ASJLQ S{i 7} Q{i 10} 2 ASJL J{i 8} L{i 9} 1 ASJLQ S{i 7} Q{i 10} 2 ASJLQ J{i 8} Q{i 10} 1 ASJLQ L{i 9} Q{i 10} 0
MTSPC Algorithm K=2 A Maximal Transaction Segment Maximal Path 2 C{i 2} 2 E{i 4} S B LC B{i 1} ABCE Item 2 E |I| = 3 > 2 Maximal Path L T 2 Q ABCE B{i 1} C{i 2} E{i 4} 2 B{i 1} E{i 4} C{i 2} E{i 4} S{i 7} J{i 8} S{i 7} L{i 9} 2 ASJLQ S{i 7} Q{i 10} 2 1 L{i 9} 1 1 2 ASJL J{i 8} 2 ASJ 3 3 ABCE Purchase B{i 1} C{i 2} ABCE Path Purchase ABC C 3 Path LC Q{i 10} ASJLQ Item S{i 7} J C Maximal Transaction Segment Sup. |I| = 1 < 2 No Generations
Mining for Web Transactions • <ABCE : B{1}, E{4}> = 2 • <AB : B{1}> = 3 • We can derive <ABCE : B{1} => E{4}> – support_count(<ABCE : B{1} => E{4}>) = 2 – confidence(<ABCE : B{1} => E{4}>) =
Summary • Data mining in the Web is an area of growing importance – In particular, the emerging of EC – More and more applications will benefit from the knowledge from data mining • Web Mining = Web Data Collection + Traditional Data Mining? • Important Issues – Incremental Web Mining
27ffd248ce6a46a79f88fd3bf575269e.ppt