Скачать презентацию Web Usage Mining Processes and Applications Qiaoyuan Jiang Скачать презентацию Web Usage Mining Processes and Applications Qiaoyuan Jiang

d944f0e74b4772b47aba414300047e82.ppt

  • Количество слайдов: 33

Web Usage Mining: Processes and Applications Qiaoyuan Jiang CSE 8331 November 24, 2003 1 Web Usage Mining: Processes and Applications Qiaoyuan Jiang CSE 8331 November 24, 2003 1

Outline n n n Brief overview of Web mining Web usage mining Application areas Outline n n n Brief overview of Web mining Web usage mining Application areas of Web usage mining Future research directions Conclusions 2

Web Mining n Web Mining is the application of data mining techniques to discover Web Mining n Web Mining is the application of data mining techniques to discover and retrieve useful information and patterns from the World Wide Web documents and services [Etzioni, 1996]. 3

Web Mining Categories n n n Web Content Mining- extracting knowledge from the content Web Mining Categories n n n Web Content Mining- extracting knowledge from the content of the Web Structure Mining- discovering the model underlying the link structures of the Web Usage Mining- discovering user’s navigation pattern and predicting user’s behavior 4

Web Usage Mining Processes n Preprocessing: conversion of the raw data into the data Web Usage Mining Processes n Preprocessing: conversion of the raw data into the data abstraction (users, sessions, episodes, clicktreams, and pageviews) necessary for further applying the data mining algorithm. n Pattern Discovery: is the key component of WUM, which converges the algorithms and techniques from data mining, machine learning, statistics and pattern recognition etc. research categories. n Pattern Analysis: Validation and interpretation of the mined patterns 5

Web Usage Mining Processes (Cont. ) 6 Web Usage Mining Processes (Cont. ) 6

Web Usage Mining. Preprocessing n Data Cleaning: remove outliers and/or irrelative n User Identification: Web Usage Mining. Preprocessing n Data Cleaning: remove outliers and/or irrelative n User Identification: associate page references with n Session Identification: divide all pages accessed n Path Completion: add important page access n Formatting: format the sessions according to the data different users by a user into sessions records that are missing in the access log due to browser and proxy server caching type of data mining to be accomplished. 7

Web Usage Mining – Preprocessing (Cont. ) 8 Web Usage Mining – Preprocessing (Cont. ) 8

Web Usage Mining Pattern Discovery Tasks n n n Statistical Analysis Clustering Classification Association Web Usage Mining Pattern Discovery Tasks n n n Statistical Analysis Clustering Classification Association Rules Sequential Patterns Dependency Modeling 9

Web Usage Mining Pattern Discovery Tasks (Cont. ) n Statistical Analysis: frequency analysis, mean, Web Usage Mining Pattern Discovery Tasks (Cont. ) n Statistical Analysis: frequency analysis, mean, median, etc. n n Improve system performance Provide support for marketing decisions Simplify site modification task Clustering: n n Clustering of users help to discover groups of users with similar navigation patterns => provide personalized Web content Clustering of pages help to discover groups of pages having related content => search engine 10

Web Usage Mining Pattern Discovery Tasks (Cont. ) n Classification: the technique to map Web Usage Mining Pattern Discovery Tasks (Cont. ) n Classification: the technique to map a data item into one of several predefined classes n n Develop profile of users belonging to a particular class or category Association Rules: discover correlations among pages accessed together by a client n n n Help the restructure of Web site Page prefetching Develop e-commerce marketing strategies 11

Web Usage Mining Pattern Discovery Tasks (Cont. ) n Sequential Patterns: n Dependency Modeling: Web Usage Mining Pattern Discovery Tasks (Cont. ) n Sequential Patterns: n Dependency Modeling: determine if there any extract frequently occurring intersession patterns such that the presence of a set of items s followed by another item in time order n Predict future user visit patterns=>placing ads or recommendations n Page prefeteching significant dependencies among the variables in the Web domain n Predict future Web resource consumption n Develop business strategies to increase sales n Improve navigational convenience of users 12

Web Usage Mining Pattern Analysis n n n Pattern Analysis is the final stage Web Usage Mining Pattern Analysis n n n Pattern Analysis is the final stage of WUM, which involves the validation and interpretation of the mined pattern Validation: to eliminate the irrelative rules or patterns and to extract the interesting rules or patterns from the output of the pattern discovery process Interpretation: the output of mining algorithms is mainly in mathematic form and not suitable for direct human interpretations 13

Web Usage Mining Pattern Analysis Methodologies and Tools n Visualization: help people to understand Web Usage Mining Pattern Analysis Methodologies and Tools n Visualization: help people to understand both real and abstract concepts n n Query mechanism: allow analysts to extract only relevant and useful patterns by specifying constraints. n n Web. Viz: Web is visualized as a direct graph WEBMINER On-Line Analytical Processing (OLAP): enable analysts to perform ad hoc analysis of data in multiple dimensions for decision-making n Web. Log. Miner 14

WEMINER Query Example n Finds all ARs with min support of 1% and min WEMINER Query Example n Finds all ARs with min support of 1% and min confidence of 90%. The analyst only interested in clients from “. edu” domain and data later than Nov. 1 st, 2003 with page accesses start with URL A and contains B and C in that order: SELECT association-rules(A*B*C*) FROM log. data WHERE date>=031101 AND domain=“edu” AND support = 1. 0 AND confidence = 90. 0 15

Application Areas for Web Usage Mining n Personalized: discover the preference and needs of Application Areas for Web Usage Mining n Personalized: discover the preference and needs of n Impersonalized: examine general user navigation individual Web users in order to provide personalized Web site for certain types of users patterns in order to understand how general users use the site n System Improvement n Site Modification n Business Intelligence n Web Characterization 16

System Improvement n n n High performance of a web application is expected since System Improvement n n n High performance of a web application is expected since it directly affects user’s satisfaction WUM provides a key to understanding Web traffic behavior Applications n n Develop policies for web caching, network transmission, load balancing, or data distribution Detecting intrusion, fraud, and attempted break-ins to the system 17

Site Modification n Structure of a Web site is another crucial attribute for attracting Site Modification n Structure of a Web site is another crucial attribute for attracting users other than the content of the Web WUM can provide detailed feedback on user’s navigation behavior, which can be used to redesign the Web site structure for user’s navigational convenience Adaptive Web site project [Perkowiz & Etzioni, 1998 -1999] 18

Business Intelligence n n n Information on how customers are using a Web site Business Intelligence n n n Information on how customers are using a Web site is critical information for marketers of e-commerce businesses WUM can provide business process optimization and marketing decisions Business intelligence includes personalization for C 2 B systems 19

Usage Characterization n Mining general usage patterns (do not focus on any specific users Usage Characterization n Mining general usage patterns (do not focus on any specific users or web sites) help in the study of how browsers are used and the user’s interaction with a browser interface. n Enables the ability to look at the dynamics of the Web and how it is growing. 20

Personalization n n Choosing among thousands of options is challenge for Web users Goal: Personalization n n Choosing among thousands of options is challenge for Web users Goal: provides users with dynamic content tailored to their individual interest Form: recommending one or more items or pages to a user, based on the user’s profile and usage behavior, or the patterns of past visitors who have similar profiles. Performance Measurement: n n Effectiveness: accuracy + coverage Scalability 21

Applications of Personalization n n n Customizing access to information sources Filtering news or Applications of Personalization n n n Customizing access to information sources Filtering news or e-mails Recommendation services for the browsing process Tutoring systems Search More. . . 22

3 phases of Personalization n Data preparation and transformation: data cleaning, filtering, transaction identification 3 phases of Personalization n Data preparation and transformation: data cleaning, filtering, transaction identification Pattern discovery: discovery usage patterns Recommendation: generate personalized content for a user based on matching the user’s session. (online process) 23

24 24

Personalization Techniques – Collaborative Filtering (CF) n Pattern discovery: online k. NN algorithm applied Personalization Techniques – Collaborative Filtering (CF) n Pattern discovery: online k. NN algorithm applied on user profiles in a given domain and matching people who have the same taste. n Recommendation: pages or items that are interested to the k-neighbors will be interested to the active user as well. n Drawbacks: n n Online process =>Lack of scalability Static user profiles => low quality of recommendations 25

Personalization Techniques – Clustering n Technique: clustering user transactions n Advantages: and pageviews. n Personalization Techniques – Clustering n Technique: clustering user transactions n Advantages: and pageviews. n n n User preference is automatically learned from usage data and therefore up-to-date. Better scalability through clustering Drawbacks: n Low accuracy 26

Personalization Techniques – Association Rules (ARs) n Technique: n n For each user, create Personalization Techniques – Association Rules (ARs) n Technique: n n For each user, create a transaction contains all the items the user have ever accessed. Find all rules satisfy the given support and confidence. For each active user, find all the rules supported by the user. Items predicted by these rules are the candidate recommendations Drawbacks: n n All association rules must be discovered prior generating recommendation. This can be improved by real-time generating ARs from a subset of transactions within the active users neighborhood High support => better scalability and accuracy, low coverage. 27

Personalization Techniques – Sequential Patterns (SPs) n n Technique: Markov Model Advantages: n n Personalization Techniques – Sequential Patterns (SPs) n n Technique: Markov Model Advantages: n n Drawbacks: n n Better accuracy: SPs contains more precise information about user navigation behavior. Low recommendation coverage More suitable for predictive tasks, e. g. , Web prefeteching 28

Personalization Techniques – Hybrid Models n Hybrid Models automatically switch among different personalization models Personalization Techniques – Hybrid Models n Hybrid Models automatically switch among different personalization models based on localized degree of hyperlink connectivity. n n n High connectivity degree => Non-SP models Low connectivity degree and deeper navigation path => SP models Performance: better than any individual models 29

Future Research Directions n Usage Mining on Semantic Web n n n Help to Future Research Directions n Usage Mining on Semantic Web n n n Help to build semantic Web With semantic Web, WUM can be improved Multimedia Web Data Mining n Representation, problem solving and learning from Multimedia data is indeed a challenge 30

Future Research Directions (Cont. ) n Software Computing Technology for Web Mining n n Future Research Directions (Cont. ) n Software Computing Technology for Web Mining n n n Fuzzy logic: dealing with imprecision and conceptual data. Used in clustering Web log data and mining ARs. Neural network: n Adaptive to new data and information n Suitable for parallel process n Robust for missing, confusing, ill-defined data n Capable for modeling non-linear decision boundaries n Effective for learning user profiles Genetic algorithm: randomized search and optimization guided by evaluation criteria. n Efficient, adaptive, robust, parallel process n Used in search and query optimization, predict user preference 31

Future Research Directions (Cont. ) n Analysis of Discovered Patterns n n Research on Future Research Directions (Cont. ) n Analysis of Discovered Patterns n n Research on efficient, flexible and powerful analysis tools More Applications n n Temporal evolutions of usage behavior Improving Web services Detect credit card fraud Privacy issues 32

Conclusions 33 Conclusions 33