Research Issues in Web Data Mining Sanjay Kumar

Research Issues in Web Data Mining Sanjay Kumar Madria Department of Computer Science Purdue University, West Lafayette, IN skm@cs. purdue. edu Sourav Bhowmick, Ng Wee Keong and Lim Ee Peng Nanyang Technological University, Singapore Dawak'99 copy-right@sanjaymadria

WHOWEDA! www. cais. ntu. edu. sg: 8000/~whoweda • A Ware. House Of WEb DAta • Web Information Coupling Model (WICM) – Web Objects – Web Schema • Web Information Coupling Algebra • Web Information Maintenance • Web Mining and Knowledge discovery Dawak'99 copy-right@sanjaymadria

Web Objects • • • Node - url, title, format, size, date, text Link - source-url, target-url, label, link-type Web tuple Web table Web schema Web database Dawak'99 copy-right@sanjaymadria

WWW User Warehouse Concept Mart Web Information Mining System Web Querying & Analysis Component Web Information Coupling System Web Mart Dawak'99 Web Information Maintenance System Web Mart Web Warehouse copy-right@sanjaymadria Web Mart

User WWW Web Query & Display Warehouse Concept Mart Global Ranking Data Visualization Global Web Manipulation Global Web Coupling Pre processing Schema Tightness Web Warehouse Web Select Web Project Local Web Coupling Local Ranking Dawak'99 Web Join Local Web Manipulation copy-right@sanjaymadria Data Visualization Web Union Web Intersection Schema Tightness Schema Search Schema Match

Web Schema • Structural ‘summary’ of web table • Information Coupling using a Query graph • Query graph ->Web schema • directed graph as ordered 4 -tuple: – Set of node variables – Set of link variables – Connectivities – Predicates Dawak'99 copy-right@sanjaymadria

Dawak'99 copy-right@sanjaymadria

Informatio Headline n Square's article 1 homepage Headline article n News @TCS News specials Airport info Dawak'99 (List of video files) List of links to local news List of links to world news copy-right@sanjaymadria Local news 1 Local news World k news 1 World news t

e x y target_url CONTAINS url contains "article” “headlines” f target_URL CONTAINS "newshub/spe cials" g z label CONTAINS "Local News" h url CONTAINS "local" w Dawak'99 label CONTAINS "World News" copy-right@sanjaymadria url CONTAINS "world"

Information Square's homepage Headline article 1 List of links to local news Local news 1 News specials World news 1 Dawak'99 List of links to world news copy-right@sanjaymadria

Schema- example • Node variables: • Link variable: • Connectivities: and x<fh->w } Dawak'99 Xn = { x, y, z, w } Xl = { e, f, g } C = { x<e>y and x<fg->z copy-right@sanjaymadria

• Predicates • P={x. url=”http: //www. mediacity. com. sg/i -square”, • y. url CONTAINS “headlines” • e. target_url CONTAINS "article", • f. target. url CONTAINS "newshub/specials", • g. label CONTAINS "Local News", • z. url CONTAINS "local", • h. label CONTAINS "World News", • Dawak'99 CONTAINS "world" } w. url copy-right@sanjaymadria

Query Graph - Example • Query graph - same as schema • Informally, it is directed connected graph consists of nodes, links and keywords imposed on them. • Produce a list of diseases with their symptoms, evaluation procedures and treatment starting from the web site at http: //www. panacea. org/ • Web table Diseases Dawak'99 copy-right@sanjaymadria

Treatment list q g Treatment http: //www. panacea. org/ Issues y x f Symptoms list z List of Diseases e Evaluation w Dawak'99 copy-right@sanjaymadria p

q 1 Treatment list g 1 Treatment http: //www. panacea. org/ Issues x 0 y 1 AIDS List of Diseases e 1 f 1 Symptoms z 1 Symptoms list Evaluation w 1 Dawak'99 Elisa Test copy-right@sanjaymadria p 2

Example 2 • Produce a list of drugs, and their uses and side effects starting from the web site at http: //www. panacea. org/ • Web table Drugs Dawak'99 copy-right@sanjaymadria

http: //www. panacea. org/ Drug list a Issues c b r Side effects List of Diseases Use s k Dawak'99 Uses copy-right@sanjaymadria Side effects d

http: //www. panacea. org/ AIDS a 0 List of Diseases Drug list b 1 Issues Indavir c 1 r 1 Side effects Use s 1 k 1 Uses of Indavir Dawak'99 Side effects of Indavir copy-right@sanjaymadria d 1

WWW Data Mining • web structure mining : Web structure mining involves mining the web document’s structures and links. • web content mining : Web content mining describes the automatic search of information resources available on-line. • web usage mining : Web usage mining includes the data from server access logs, user registration or profiles, user sessions or transactions etc. Dawak'99 copy-right@sanjaymadria

Web Structure Mining : Issues · Measuring the frequency of the local links (links in the same server) in the web tuples in a web table. · web tuples have more information about interrelated documents that exists at the same server. · measures the completeness of the web site in a sense that most of the closely related information is available at the same site(server). · For example, an airline’s home page will have more local links connecting the “routing information with air-fares and schedules”. Dawak'99 copy-right@sanjaymadria

· Measuring the frequency of web tuples in a web table containing links which are interior; links which are within the same file. · measures a web document’s ability to crossreference other related web pages within the same document. · measures the flow of the web documents. Dawak'99 copy-right@sanjaymadria

· Measuring the frequency of web tuples in a web table that contains links that are global; links which span different web sites. · measures the visibility of the web documents and ability to relate similar or related documents across different sites. · For example, research documents related to “semistructured data” will be available at many sites and such sites should be visible to other related sites by providing cross references by the popular phrases such as “more related links”. Dawak'99 copy-right@sanjaymadria

· Measuring the frequency of identical web tuples that appear in a web table or among the web tables. · measures the replication of web documents and may help in identifying the mirrored sites. · What is the in-degree and out-degree of each node (web document)? What is the meaning of high and low in- and out-degrees? · Locating links to popular web sites in the web tuples in a table. · Number of web tuples are returned in response to a query on some popular phrases such as “Bioscience” with respect to queries containing keywords like “earth-science”. Dawak'99 copy-right@sanjaymadria

· discover the nature of the hyperlinks in the web sites of a particular domain. · What information do they provide and how are they related conceptually. · Is it possible to extract a conceptual hierarchical information for designing web sites of a particular domain. · generalizing the flow of information in web sites representing some particular domain. Dawak'99 copy-right@sanjaymadria

Web Bags and Web Structure Mining • Most of the search engines fail to handle the following knowledge discovery goals: · locate the most visible web sites or documents for reference. Many paths (high fan in) can reach that sites or documents. · locate the most luminous web sites or documents for reference. web sites or documents which have the most number of outgoing links. · find the most traversed path for a particular query result. To identify the set of most popular interlinked web documents that have been traversed frequently to obtain the query result. Dawak'99 copy-right@sanjaymadria

Applications of Visibility • Association rules • e-commerce Dawak'99 copy-right@sanjaymadria

Dawak'99 copy-right@sanjaymadria

• From the results returned, find most visible pages. Assume Z 1 is the most visible page with the given threshold. • This gives estimates about different restaurants selling pizzas. • Lower threshold gives you set (Z 1, Z 2) as visible pages, which sells both pizza and pasta. • Generalize rules such as out of 66% of restaurants which offer pizza to their customers, 33% also offers pasta. Dawak'99 copy-right@sanjaymadria

E-commerce application • My web site’s visibility is going down!!!! Dawak'99 copy-right@sanjaymadria

Application - Luminosity • Association rules such as X% of all the companies which makes a product “A”, Y% of them also makes a set of products “B and C”. • Exmple - certain companies (33%) if they make a product A also make products B and C. • the company C makes only the product A. • That is, 66% of companies which make a product “A” , 33% of them also make products B and C. Dawak'99 copy-right@sanjaymadria

Dawak'99 copy-right@sanjaymadria

Web Content Mining · what does it mean to mine content from the web? · Is extracting information from a very small subset of all HTML web pages is also an instance of web data mining? · mining a subset of web pages stored in one or more web tables is more feasible option. · Similarity and difference between web content mining in web warehouse context and conventional data mining. Dawak'99 copy-right@sanjaymadria

· Selection of type of data in the WWW to do web content mining. · Cleaning of selected data to mine effectively. · Types of knowledge that can be discovered in a web warehouse context. · Discovery of types of information hidden in a web warehouse which are useful for decision making. · specify, measure and justify the interestingness of the discovered knowledge · knowledge to be discovered are as follows: generalized relation, characteristic rule, discriminate rule, classification rule, association rule, and deviation rule. Dawak'99 copy-right@sanjaymadria

· Do the data mining techniques applicable to web mining and if yes, how? For example, we are interested in generating the following types of rules: 40% of web tuples (i. . e, web pages) in response to a “travel information query from Hong Kong to Macau” suggest that popular means of traveling is by ferry. · To derive some additional knowledge in a web warehouse for web content mining. · mining previously unknown knowledge in a web warehouse. · Presentation of discovered knowledge to the users to expedite complex decision making. Dawak'99 copy-right@sanjaymadria

Web Usage Mining • discovery of user access patterns from web servers; user profile, access pattern for pages, etc. used for efficient and effective web site management and the user behavior. · In WHOWEDA, the user initiates a coupling framework to collect related information. · For example, coupling a query graph “to find the hotel information” with the query graph “to find the places of interest”. · From this query graph, we can generate some user access pattern of coupling framework like “ 50% of users who query “hotel” also couple their query with “places of interest”. Dawak'99 copy-right@sanjaymadria

· find coupled concepts from the coupling framework. · helps in organizing web sites. · For example, web documents that provide information on “hotels” should also have hyperlinks to web pages providing information on “places of interest”. Dawak'99 copy-right@sanjaymadria

Warehouse Concept Mart • Knowledge discovery in web data becomes more and more complex due to the large number of data on WWW. · build the concept hierarchies involving web data to use them in knowledge discovery. · collection of concept hierarchies a Warehouse Concept Mart (WCM). · concept mart is build by extracting and generalizing terms from web documents to represent classification knowledge of a given class hierarchy. Dawak'99 copy-right@sanjaymadria

· For unclassified words, they can be clustered based on their common properties. Once the clusters are decided, the keywords can be labeled with their corresponding clusters, and common features of the terms are summarized to form the concept description. · associate a weight at each level of concept marts to evaluate the importance of a term with respect to the concept level in the concept hierarchy. Dawak'99 copy-right@sanjaymadria

Web Concept Mart Applications • Intelligent answering of web queries · supply the threshold for a given key word in the warehouse concept mart and the words with the threshold more than the given value can be taken into consideration when answering the query. · use different levels of concepts in the warehouse concept mart or can provide approximate answers. · provide the user some knowledge in framing the global coupling query graph. • Example - DBMS and Oracle Dawak'99 copy-right@sanjaymadria

• Web mining and Concept Mart • mine the association between words appearing in the concept mart at various levels and in the web tuples returned as the result of a query. · Mining knowledge at multiple levels may help WWW users to find some interesting rules that are difficult to be discovered otherwise. · A knowledge discovery process may climb up and step down to different concepts in the warehouse concept mart’s level with user’s interactions and instructions including different threshold values. · capture the flow of web sites of particular domain; helpful in location information Dawak'99 copy-right@sanjaymadria

Conclusions · web mining issues in context of the web warehousing project called WHOWEDA (Warehouse of Web Data). · discussed web mining issues with respect to web structure, web content and web usage. · Our focus is to design tools and techniques for web mining to generate some useful knowledge from the WWW data. · We are working on formal algorithms to generate association rules and classification rules. Dawak'99 copy-right@sanjaymadria