By Uday Kumar WEB MINING Agenda World

By Uday Kumar WEB MINING

Agenda World Wide Web – a brief history Introduction to Data Mining Process & Techniques Web Mining Data Mining Vs Web Mining Classification of Web Mining Benefits & Application Areas of Web Mining Softwares Summary

World-Wide Web - a brief history Who invented the World-Wide Web ? (Sir) Tim Berners-Lee in 1989, while working at CERN, invented the World Wide Web, including URL scheme, HTML, and in 1990 wrote the first server (httpd) and the first browser. Web’s Characteristics: billions of documents authored by millions of diverse people distributed over millions of computers, connected by variety of media Large size, Dynamic content, Time dimension and Multilingual Different data types: text, image, hyperlinks and user usage information.

Mining Large Data Sets - Motivation There is often information “hidden” in the data that is not readily evident Human analysts may take weeks to discover useful information Much of the data is never analyzed at all

Data Mining

Data Mining - Definition » It is commonly defined as the process of extracting meaningful information from data sources e. g databases, texts, images, the web e. t. c » It is the process of performing automated extraction and generating predictive information from large data banks which enables us to understand the current market trends and enables us to proactive measures to gain maximum benefit from the same.

Data Mining Process

Data Mining Tasks » Data mining makes use of various algorithms to perform a variety of tasks. These algorithms examine the sample data of a problem and determine a model that fits close to solving the problem. » A Predictive model enables you to predicts the values of data by making use of known results from a different set of sample data. The list of tasks that forms the part of predictive model are: Classification Regression Time Series Analysis

Data Mining Tasks Contd. . » A Descriptive model enables you to determine the patterns and relationships in a sample data. The list of tasks that forms the part of descriptive model are: Clustering Summarization Association rules Sequence discovery

Data Mining Tasks Contd. . » Classification: enables you to classify data in a large data bank into predefined set of classes. Ex: People with age less than 40 and salary > 40 k trade on-line » Regression: enables to forecast data values based on the present and past values Ex: helps the organization to predict the need for recruiting new employees and purchases based in the past and current growth rate. » Time Series Analysis: enables to predict future values for the current set of values are time dependent (monthly, yearly. . ) » Summarization: The use of summarization enables you to summarize a large chunk of data containing in a web page.

Data Mining Tasks Contd. . » Clustering: enables you to create new groups (clusters) based on the study of patterns and relation between values of data in a data bank. It is similar to classification but does not require you to predefine groups. (also called as Unsupervised Learning) Ex: Users A and B access similar URLs » Association Rules: It defines certain rules of associativity between data items and then use those rules to establish relationships. Ex: Find the items that tend to be purchased together and specify their relationship. » Sequence Discovery: enables to determine the sequential patterns that might exist in a large and unorganized data bank. Ex: crime detection.

Data Mining Techniques » Data mining is not so much a single technique as the idea that there is more knowledge hidden in the data than shows itself on the surface. Any technique that helps extract more out of your data is useful, list of data mining techniques are. Statistical techniques: is the branch of mathematics, which deals with the collection and analysis of numerical data by using various methods and techniques. Machine Learning: is the process of generating a computer system that is capable of acquiring data and integrating the data to generate useful knowledge. Decision trees: is a tree-shaped structure, in which each branch represents a classification question while leaves of the tree represents the partition of classified information.

Data Mining Techniques » Hidden Markov Models: enables you to predict future actions to be taken in time series. The model provides the probability of a future event, when provided with the present and previous events. » Neural networks: In this a large set of historical data is analyzed in order to predict the output of a particular future situation or a problem. » Genetic algorithms: If you have a certain set of sample data, then GA enables to determine the best possible model out of a set of models in order to represent the sample data.

Data Mining vs. Web Mining Traditional data mining data is structured and relational well-defined tables, columns, rows, keys, and constraints. Web data Semi-structured (HTML documents)and unstructured (free text) readily available data rich in features and patterns

Problems when interacting with the Web » Finding relevant information » Creating new knowledge out of the information available on the Web » Personalization of the information » Learning about consumers or individual users

Web Mining

Web Mining - Definition » “Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data. ” » The web mining process is similar to the data mining process, the difference is usually in the data collection. » In data mining, the data is often already collected and stored in a data warehouse. » In web mining, data collection can be a substantial task, especially for web structure and content mining, which involves crawling a large number of target web pages.

Web Mining - Subtasks Resource finding Retrieving intended documents Information selection/pre-processing Select and pre-process specific information from selected documents Generalization Discover general patterns at individual web sites as well as across multiple web sites Analysis Validation and/or interpretation of mined patterns

Web Mining Contd. . Web Mining is not IR: Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible Web Mining is not IE: Information extraction (IE) aims to extract the relevant facts from given documents IE systems for the general Web are not feasible Most focus on specific Web sites or content

Classification of Web Mining

Web Usage Mining refers to the discovery of user access Click to edit the outline patterns from the web usage logs, which record every click text format made by each user. Second Outline Level The usage data records the user’s behavior when the user Third Outline browses or makes transactions on the web site in order to better Level understand serve the needs of users or Web-based applications. Fourth Outline Level It is an activity that involves the automatic discovery of Fifth patterns from one or more Web servers. Outline Level Sixth

Web Usage Mining Contd. . Organizations often generate and collect large volumes of data; most of this information is usually generated automatically by Web servers and collected in server log. Analyzing such data can help these organizations to determine: the value of particular customers cross marketing strategies across products the effectiveness of promotional campaigns, etc. Typical Sources of Data automatically generated data stored in server access logs, proxy server logs referrer logs, browser logs, bookmark data, mouse clicks and scrolls and client-side cookies user profiles meta data: page attributes, content attributes, usage data

Web Usage Mining Contd. . The first web analysis tools simply provided mechanisms to report user activity as recorded in the servers. Using such tools, it was possible to determine such information as: the number of accesses to the server the times or time intervals of visits the domain names and the URLs of users of the Web server. Two main categories: Learning a user profile (personalized) Web users would be interested in techniques that learn their needs and preferences automatically Learning user navigation patterns (impersonalized) Information providers would be interested in techniques that improve the effectiveness of their Web site or biasing the users towards the goals of the site

Web Usage Mining Contd. . Web servers, Web proxies, and client applications can quite easily capture Web Usage data. Web server log: Every visit to the pages, what and when files have been requested, the IP address of the request, the error code, the number of bytes sent to user, and the type of browser used… By analyzing the Web usage data, web mining systems can discover useful knowledge about a system’s usage characteristics and the users’ interests which has various applications: Personalization and Collaboration in Web-based systems Marketing Web site design and evaluation Decision support

Web Server Log - A Sample

Web Usage Mining Contd. . The technique to retrieve visitor based information from web servers based log files and apply this information to analyze data is known as Web Log Mining. The major types of log files are Access Log- file maintains a list of all the web pages that the visitors have requested. Agent Log- file consists of information about the browser that was used to explore the various web pages.

Web Content Mining extracts or mines useful information or knowledge from web page contents. Click to edit sources In this mining, patterns are extracted from onlinethe outline text format such as HTML files Text documents Images E-books or email messages Audio or Video Second Outline Level Third Outline Level The concept of WCM is far wider than searching for any specific Fourth Outline term or only keyword extraction or some simple statistics of words Level and phrases in documents. Fifth A tool that performs WCM can summarize a web page so that you Outline need not read the complete document and save your time and energy. Level Sixth

Web Content Mining Contd. . The two basic approaches or models to implement WCM are Local Knowledge base Model: The abstract characterizations of several web pages are stored locally. (i. e References to several web sites relating to the categories are stored in a database and based on the selection of the category the searching is performed with in the web site) Agent Based Model: This approach applies the Artificial Intelligence systems known as Web Agents that can perform a search on behalf of a particular user for discovering and organizing documents in the web. Some web agents can apply individual user profiles for searching information from the web and organize and interpret the discovered information.

Preprocessing Content Preparation: Extract text from HTML. Perform Stemming. Remove Stop Words. Calculate Collection Wide Word Frequencies (DF). Calculate per Document Term Frequencies (TF). Vector Creation: Common Information Retrieval Technique. Each document (HTML page) is represented by a sparse vector of term weights. Typically, additional weight is given to terms appearing as keywords or in titles.

Common Mining Techniques The more basic and popular data mining techniques include: Classification- Classification on server logs using decision trees, Naives-Bayes classifier to discover the profiles of users belonging to a particular category. Clustering- can be used to group users exhibiting similar browsing patterns. Associations- can be used to relate pages that are most often referenced together in a single server session. The other significant ideas are: Topic Identification, tracking and drift analysis Concept hierarchy creation Relevance of content.

Web Structure Mining discovers useful knowledge from hyper links, which represent the structure of the web. Click to edit the outline text format Web structure mining can be divided into two kinds: Extract patterns from hyperlinks in the web. A hyperlink is a Second Outline Level structural component that connects the web page to a Third Outline different location. Mining the document structure. It is using the tree-like Level structure to analyze and describe the HTML or XML tags Fourth Outline within the web page. Level Fifth The process of using the graph theory to analyze the node and connection structure of a web site. Outline Level Sixth

Web Structure Mining Contd. . Web Structure is a useful source for extracting information such as Web Page Classification Classifying web pages according to various topics Quality of Web Page The authority of a page on a topic Ranking of web pages Which pages to crawl Deciding which web pages to add to the collection of web pages Finding Related Pages Given one relevant page, find all related pages

Web Structure Mining Contd. . The Hyperlink Induced Topic Search (HITS) is the common method or algorithm for knowledge discovery in the Web. The Concept of HITS is

Web Structure Mining Identication of Authorities: authoritative, high-quality web pages on broad topics hubs: web pages that link to a collection of authorities A good authority is pointed to by many good hubs A good hub points to many good authorities Web structure mining has been largely influenced by research in Social network analysis Citation analysis (bibliometrics). in-links: the hyperlinks pointing to a page out-links: the hyperlinks found in a page. Usually, the larger the number of in-links, the better a page is.

Web Structure Mining Contd. . Each Web page is a node of the Web-graph The out-degree of a node, is the number of distinct links originating at that point to other nodes. The probability, at any step, that the person will continue is a damping factor d =0. 85 N- Number of web pages

Application Areas of Web Mining E-commerce Search Engines Personalization Website Design Web mining applications Amazon. com Google Double Click AOL Ebay My. Yahoo Cite. Seer I-MODE v-TAG Web Mining Server

Applications Contd. . Amazon: A host of Web mining techniques, e. g. associations between pages visited, click-path analysis, etc. , are used to improve the customer’s experience during a ’store visit’. Knowledge gained from Web mining is the key intelligence behind Amazon’s features such as ’instant recommendations’, ’purchase circles’, ’wish-lists’, etc.

Applications Contd. . Google Earlier search engines concentrated on the Web content to return the relevant pages to a query. Google was the first to introduce the importance of the link structure in mining the information from the web. Page Rank, that measures an importance of a page, is the underlying technology in all Google search products. The Page Rank technology, that makes use of the structural information of the Web graph, is the key to returning quality results relevant to a query.

Benefits of Web Mining Match your available resources to visitor interests Increase the value of each visitor Improve the visitor's experience at the website Perform targeted resource management Collect information in new ways Test the relevance of content and web site architecture

Web Mining Softwares Web Miner: Sinope Summarizer: Teleport Pro: Click Tracks

Summary Major Limitations of Web Mining research: Difficult to collect Web Usage data across different Web Sites. Lack of suitable test collections that can be reused by researchers Future research directions: Multimedia data mining: A picture is worth a thousand words. Multilingual knowledge extraction: Web page translations The Hidden Web: Forms, Dynamically generated web pages. Semantic Web Wireless Web: WML and HDML.