Скачать презентацию 1 Semantic Web Usage Mining Overview and Скачать презентацию 1 Semantic Web Usage Mining Overview and

2970dac59355e59715f867d5675c0793.ppt

  • Количество слайдов: 74

1 Semantic Web Usage Mining – Overview and Case Studies – Bettina Berendt Humboldt 1 Semantic Web Usage Mining – Overview and Case Studies – Bettina Berendt Humboldt University Berlin Institute of Information Systems www. wiwi. hu-berlin. de/~berendt

Goals and top-level questions n Make the world‘s knowledge available to the world n Goals and top-level questions n Make the world‘s knowledge available to the world n How do people discover knowledge on the Web? n How can more knowledge sources contribute to the Web? 2

Approaches to the current Web‘s biggest challenges: 3 lots of data, human-understandable Web Mining Approaches to the current Web‘s biggest challenges: 3 lots of data, human-understandable Web Mining extracts implicit knowledge Semantic Web Mining • use semantics to improve mining • use mining results to generate semantics The Semantic Web makes knowledge machineunderstandable [Berendt, Hotho, & Stumme, Proc. ISWC 2002] [Berendt, Mladenic, et al. (Eds. ), From Web to Semantic Web, Springer LNAI 2004] [Berendt, Grobelnik, Mladenic et al. (Eds. ), Semantics, Web, and Mining, Springer LNAI 2006]

Agenda Web Mining Why? 4 Agenda Web Mining Why? 4

1. What should I buy? 5 1. What should I buy? 5

2. Where do I find relevant information on. . . ? 6 2. Where do I find relevant information on. . . ? 6

3. “What do people do there? “ Name 7 3. “What do people do there? “ Name 7

4. How can a site be made usable – for a worldwide audience? 8 4. How can a site be made usable – for a worldwide audience? 8

5 a. Why go to a shop. . . if everything is available on 5 a. Why go to a shop. . . if everything is available on the Internet? 9

5 b. What is my site worth for my business? 10 5 b. What is my site worth for my business? 10

6. How to help people become active members of the knowledge society – help 6. How to help people become active members of the knowledge society – help them to contribute content? 11

Agenda Web Mining How? 12 Agenda Web Mining How? 12

13 Web Mining Knowledge discovery (aka Data mining): “the non-trivial process of identifying valid, 13 Web Mining Knowledge discovery (aka Data mining): “the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. ” 1 Web Mining: the application of data mining techniques on the content, (hyperlink) structure, and usage of Web resources. Web mining areas: Web content mining Web structure mining Web usage mining 1 Fayyad, U. M. , Piatetsky-Shapiro, G. , Smyth, P. , & Uthurusamy, R. (Eds. ) (1996). Advances in Knowledge Discovery and Data Mining. Boston, MA: AAAI/MIT Press

Data analysis: the textbook version n The meaning of attributes is clear n The Data analysis: the textbook version n The meaning of attributes is clear n The meaning of attribute values is clear Data modelling can be applied directly (e. g. , regression, classification, clustering, association-rule discovery) (A simplified extract from the “adult” dataset in the UCI machine learning repository) 14

15 Data analysis: the reality data mining / knowledge discovery process. . . p 15 Data analysis: the reality data mining / knowledge discovery process. . . p 3 ee 24304. dip. t-dialin. net - - [19/Mar/2002: 12: 03: 51 +0100]"GET /search. html? t=jane%20 austen&SID=023785&ord=asc HTTP/1. 0" 200 1759 p 3 ee 24304. dip. t-dialin. net - - [19/Mar/2002: 12: 05: 06 +0100] "GET /search. html? t=jane%20 austen&m=video&SID=023785&ord=desc HTTP/1. 0" 200 8450 What is the p 3 ee 24304. dip. t-dialin. net - - [19/Mar/2002: 12: 06: 41 +0100] "GET meaning of the /view. asp? id=3456&SID=023785 HTTP/1. 0" 200 3478 attributes? n . . . What is the meaning of the attribute values? n Data modelling is only one part! CRISP-DM

16 Where does semantics come in? Semantics 16 Where does semantics come in? Semantics

Agenda Semantic Web How? 17 Agenda Semantic Web How? 17

18 What is an ontology? “an explicit specification of a shared conceptualisation“ (Gruber, 1993) 18 What is an ontology? “an explicit specification of a shared conceptualisation“ (Gruber, 1993) Definition Core ontology with axioms: a structure O : = ( C, ≤C , R , σ , ≤R , A ) consisting of n two disjoint sets C (concept identifiers) and R (relation identifiers) n a partial order ≤C on C (concept hierarchy or taxonomy) n a function σ : R → C+ (signature), where C+ is the set of all finite tuples of elements in C n a partial order ≤R on R (relation hierarchy), where l l n r 1 ≤R r 2 implies |σ(r 1)| = |σ(r 2)| i (σ(r 1)) ≤C i (σ(r 2)) for all 1 ≤ i ≤ |σ(r 1)|, i the projection on the i-th component with a set A of axioms in a logical language L [Stumme, Hotho, & Berendt, Journal of Web Semantics, 2006, and sources there]

Agenda Web Mining Semantic Web . . . p 3 ee 24304. dip. t-dialin. Agenda Web Mining Semantic Web . . . p 3 ee 24304. dip. t-dialin. net - [19/Mar/2002: 12: 03: 51 +0100]"GET Under/search. html? t=jane%20 austen&SID=02 stand 3785&ord=asc HTTP/1. 0" 200 1759 p 3 ee 24304. dip. t-dialin. net - [19/Mar/2002: 12: 05: 06 +0100] "GET /search. html? t=jane%20 austen&m=vide o&SID=023785&ord=desc HTTP/1. 0" 200 8450 p 3 ee 24304. dip. t-dialin. net - [19/Mar/2002: 12: 06: 41 +0100] "GET /view. asp? id=3456&SID=023785 HTTP/1. 0" 200 3478. . . 19

20 Semantics of requests Step 1: Domain ontology • community portal ka 2 portal. 20 Semantics of requests Step 1: Domain ontology • community portal ka 2 portal. aifb. uni-karlsruhe. de affiliation • ontology-based: • Knowledge base in F-Logic • Static pages: annotations • Dynamic pages: generated from queries • Queries also in F-Logic • Logs contain these queries [Oberle, Berendt, Hotho, & Gonzalez, Proc. AWIC 2003]

21 Semantics of requests Step 2: Modelling requests and sessions-as-sets RESEARCHER PERSON PROJECT PUBLICATION 21 Semantics of requests Step 2: Modelling requests and sessions-as-sets RESEARCHER PERSON PROJECT PUBLICATION RESEARCHTOPIC EVENT An example query with concepts and relations: FORALL N, PEOPLE <-PEOPLE: Employee[affiliation->> "http: //www. an. Institute. org"] and PEOPLE: Person[last. Name->>N]. ORGANIZATION RESEARCHINTEREST LASTNAME TITLE ISABOUT EVENTS EVENTTITLE Query = feature vector of concepts + relations WORKSATPROJECT AUTHOR AFFILIATION ISWORKEDONBY PROGRAMCOMMITTEE EMPLOYS NAME RESEARCHGROUPS EMAIL Clustering, Association rules, Classification, . . . Session = feature vector of concepts + relations, summed over all queries in the session

22 Semantics of sequences Step 3: Strategy pattern discovery An ontology of navigation strategies 22 Semantics of sequences Step 3: Strategy pattern discovery An ontology of navigation strategies l Define strategy templates as regular expressions – Of requests (mapped to ontological entities) – Of transitions (between ontological entities) affiliation. Search, 629 topic. Search, 312 . . . refinement, 113 repetition, 402 . . . Ex. [. search. * individual] l Discover strategies by learning a strategy trie . . . repetition, 295 individual, 112 . . . [Berendt & Spiliopoulou, VLDB Journal, 2000] [Berendt, Data Mining and Knowledge Discovery, 2002]

NB: For more exploratory analyses: The Web Usage Miner WUM 23 select t from NB: For more exploratory analyses: The Web Usage Miner WUM 23 select t from node a b, template a * b as t where a. url startswith "SEITE 1 -" and a. occurrence = 1 and b. url contains "1 SCHULE" and b. occurrence = 1 and (b. support / a. support) >= 0. 2 [Spiliopoulou, 1999; Berendt & Spiliopoulou, VLDB Journal, 2000]

Semantics of sequences 24 Step 4: Strategy pattern evaluation Use strategy patterns‘ statistics to Semantics of sequences 24 Step 4: Strategy pattern evaluation Use strategy patterns‘ statistics to l Derive descriptive measures of patterns – support, confidence – popularity, effectiveness, efficiency l Apply inferential statistics to compare patterns [Berendt, Data Mining and Knowledge Discovery, 2002]

25 Communication – Visual data mining Step 5: Mapping an ontological relation over concepts 25 Communication – Visual data mining Step 5: Mapping an ontological relation over concepts to a linear order and to visual variables Concreteness Goal: Individual page Reach goal More constraints on search First search page Refine search Remain unspecific Abandon search Time

26 Ad Q. 3: What do people do there? 26 Ad Q. 3: What do people do there?

Communication – Visual data mining 27 Step 5 – Example Search criterion location Search Communication – Visual data mining 27 Step 5 – Example Search criterion location Search criterion textual property Individual page { Search with x parameters Entry page [Berendt, Data Mining and Knowledge Discovery, 2002], [Berendt, Postproc. Web. KDD 2001]

An online shop with a difference 28 [Berendt, Günther, & Spiekermann, Communications of the An online shop with a difference 28 [Berendt, Günther, & Spiekermann, Communications of the ACM, 2005]

Communication – Visual data mining 29 Step 6: Visual abstraction new semantic patterns Closeness Communication – Visual data mining 29 Step 6: Visual abstraction new semantic patterns Closeness to product Shopping for cameras Shopping for jackets [Berendt, Data Mining and Knowledge Discovery, 2002], [Berendt, Postproc. Web. KDD 2002]

30 Ad Q. 4: Worldwide usability 30 Ad Q. 4: Worldwide usability

The impact of language and domain knowledge on search option choice 31 2 studies The impact of language and domain knowledge on search option choice 31 2 studies on the use of search options in the e. Health site: n Webserver log: 3 928 235 requests / 277 809 sessions from 188 countries l n 83. 2 % first-language users, 16. 8% second-language users Webserver log + Questionnaire: 165 (106) people from 34 countries l l n 84. 9% first-language users, 15. 1% second-language users 10. 4% physicians, 89. 6% patients Results: l Search engine, alphabetical search: in particular first-language users, physicians l Content-organized search: in particular second-language patients Domain knowledge compensates for limited language knowledge. [Kralisch & Berendt, New Review of Hypermedia and Multimedia, 2005]

32 Semantics: Service ontology TOP Alphabetic al search Diagnosis 21002 Search Diagnosis info 32 Semantics: Service ontology TOP Alphabetic al search Diagnosis 21002 Search Diagnosis info

33 Results on frequent search patterns Alphabetical search: hub-and-spoke → only linguistic relations (6. 33 Results on frequent search patterns Alphabetical search: hub-and-spoke → only linguistic relations (6. 4%) Diagnoses are “hubs" for navigation (5. 3%, 4%) Localization search: linear / Depth-first → search refinement & medical knowledge (5%) [Berendt, Postproc. Web. KDD 2005]

Mining with ISOVIS: Semantic drill-down, visualizing detail & context 34 [Berendt, Postproc. Web. KDD Mining with ISOVIS: Semantic drill-down, visualizing detail & context 34 [Berendt, Postproc. Web. KDD 2005]

35 Ad Q. 5: Shopping behaviour and Web site value 35 Ad Q. 5: Shopping behaviour and Web site value

5. What is my site worth for my business? n n What are the 5. What is my site worth for my business? n n What are the conversion rates (how many visitors become buyers etc. )? n Internet market shares [BCG 2002] A site is often only a part of a distribution strategy / one channel to reach customers. What are the cross-channel effects? 36

Semantics: The buying process as a service ontology 37 Semantics: The buying process as a service ontology 37

Mining (example): Association rules for investigating preferences in the buying process 38 Study based Mining (example): Association rules for investigating preferences in the buying process 38 Study based on ~100 K sessions, ~13 K transactions from 2002 at a leading European retailer of consumer electronics showed, among other things: Online payment Direct Key performance indicators („Web metrics “), e. g. : delivery (s=0. 27, c=0. 97) < 1/3 tradit. online users! Online payment In-store pickup (s=0. 02, c=0. 03) • conversion efficiency • offline conversion Cash on delivery Direct delivery (s=0. 02, c=0. 03) • effectivity and effiziency of search options In-store payment In-store pickup (s=0. 69, c=0. 94) Site is primarily used for information search. [Berendt & Spiliopoulou, VLDB Journal, 2000, Berendt, Data Mining and Knowl. Discovery, 2002; Teltzrow & Berendt, Proc. Web. KDD 2003]

Agenda Web Mining (Semantic) Web 39 Agenda Web Mining (Semantic) Web 39

Step 6: Deployment of results 40 Example 1: Using results for site improvement City Step 6: Deployment of results 40 Example 1: Using results for site improvement City Name Path analysis + metrics + c 2 analysis showed: • All search criteria were approx. equally effective • Location-based search was most popular • City-based search was most efficient. . . but least popular Modify site design to make efficient search more popular [Berendt & Spiliopoulou, VLDB Journal, 2000, Berendt, Data Mining and Knowl. Discovery, 2002; Spiliopoulou & Pohle, DMKD, 2001]

41 Step 6: Deployment of results Example 2: Using results for personalization Recommendations for 41 Step 6: Deployment of results Example 2: Using results for personalization Recommendations for Web site design [Kralisch, Eisend, & Berendt, Proc. HCI International, 2005]

42 Step 6: Deployment of results Example 3: A privacy-preserving Web-metrics analysis service [Teltzrow, 42 Step 6: Deployment of results Example 3: A privacy-preserving Web-metrics analysis service [Teltzrow, Preibusch, & Berendt, IEEE EC Conf. 2004]

Agenda Web Mining Semantic Web . . . 136 Literaturverzeichnis [1] contribute Agarwal, R. ; Krueger, B. P. ; Scholes, G. D. ; Yang, M. ; Yom, J. ; Mets, L. ; Fleming, G. R. Ultrafast energy transfer in LHC-II revealed by three-pulse photon echo peak shift measurements, J. Phys. Chem. B, 2000, 104, 2908, . . . 43

Data and metadata in the Digital Library EDOC 44 <BIBLIOGRAPHY><FLOAT><PAGENUMBER>136< /PAGENUMBER></FLOAT> <HEAD>Literaturverzeichnis</HEAD>. . . Data and metadata in the Digital Library EDOC 44 136< /PAGENUMBER> Literaturverzeichnis. . . [2] Albrecht, T. F. ; Bott, K. ; Meier, T. ; Schulze, A. ; Koch, M. ; Cundiff, S. T. ; Feldmann, J. ; Stolz, W. ; Thomas, P. ; Koch, S. W. ; Gö bel; E. O. Disorder mediated biexcitonic beats in semiconductor quantum wells, Phys. Rev. B, 1996, 54, 4436, . . . (http: //edoc. hu-berlin. de/diml/dtd/xdiml. dtd)

45 Authoring support for document servers n Surveys & Web usage mining analysis of 45 Authoring support for document servers n Surveys & Web usage mining analysis of a digitial publishing service showed: l n Metadata creation is one of the main barriers for contribution. Reasons include deficiencies in Marketing l information flow l understanding and use of structured search l education in structured writing l HCI aspects ) ) Education ) Intelligent Authoring Tools [Berendt, Brenstein, Li, & Wendland, Proc. ETD 2003] [Berendt, Proc. AAAI Spring Symposium KCVC, 2005]

… and this has consequences (problems of the fully manual approach) <BIBLIOGRAPHY><FLOAT><PAGENUMBER>136</PAGENUMBER></FLOAT > <HEAD>Literaturverzeichnis</HEAD> … and this has consequences (problems of the fully manual approach) 136 Literaturverzeichnis [1] Agarwal, R. ; Krueger, B. P. ; Scholes, G. D. ; Yang, M. ; Yom, J. ; Mets, L. ; Fleming, G. R. Ultrafast energy transfer in LHC-II revealed by three-pulse photon echo peak shift measurements, J. Phys. Chem. B, 2000, 104, 2908, . . . 46

The fully automatic approach 47 The fully automatic approach 47

Why is this a problem? [Cardona & Marx, Physik Journal 2004] [Berendt, in Neues Why is this a problem? [Cardona & Marx, Physik Journal 2004] [Berendt, in Neues Handbuch Hochschullehre, 2003] 48

49 Build a tool that is o user-friendly o intelligent o modular and extensible 49 Build a tool that is o user-friendly o intelligent o modular and extensible

50 [Berendt, Dingel, & Hanser, Proc. ECDL 2006] 50 [Berendt, Dingel, & Hanser, Proc. ECDL 2006]

51 IR-THESIS – System architecture Text mining / Information Extraction tools Web services Databases 51 IR-THESIS – System architecture Text mining / Information Extraction tools Web services Databases (local a/o mirrored) VBA macro other WS and info. sources

52 52

Search and retrieval 53 Search and retrieval 53

54 54

55 55

Organisation of the literature /bibliography construction 56 Organisation of the literature /bibliography construction 56

57 57

Discussion 58 Discussion 58

59 59

60 Writing corrected, XML annotated, and formatted 60 Writing corrected, XML annotated, and formatted

Conclusions and outlook n Semantics are often necessary to do mining at all n Conclusions and outlook n Semantics are often necessary to do mining at all n Semantics often allow the analyst to make more sense of the results n Semantic Web Mining is semi-automatic interactive tools! n Standardisation can make the mining process more automatic n Mining can help to generate semantics n To what extent are further user and context modelling useful a/o necessary for valid conclusions (intentions, goals, constraints, …)? n How can we encourage standards? n When are explicit (formal) semantics better, when implicit semantics? n How can we move beyond the Web (“ubiquitous environments“)? n How can privacy be protected in a data-rich and mining-rich world? (Are privacy semantics à la P 3 P a solution? ) n What do users want? What about other stakeholders? Whom and what and how to ask? 61

62 Thank you for your attention! 62 Thank you for your attention!

Discussion points 1: Is reference markup ontological / Semantic Web? n Di. ML (Dissertation Discussion points 1: Is reference markup ontological / Semantic Web? n Di. ML (Dissertation Markup Language), used in the case study above, is approximately structured like Bibtex (with the difference that the type of publication is an attribute, so there is only one top-level concept „citation“). This makes it comparable also to Dublin Core. [The system in ist latest versions also contains mapping to DC and other commonly used schemata. ] n This makes it indeed an extremely primitive „ontology“ (essentially, a concept hierarchy with one concept, „publication“ with attributes with literals as value range: author, title, etc. ). n Extensions to make this „really semantic“ include (some are part of our current work) l Author, affiliation, etc. as concepts with instances, as in Repec. org introduces relations like „is-author-of“ l Unique identifiers of publications that allow the detection of duplicates, as in Citeseer l Links to libraries, as in Open. URL l Versioning and other interesting relations between different publications (cf. The Dublin Core element „relation“) 63

Discussion point 2: Can folksonomies be used instead of ontologies? (1) n This is Discussion point 2: Can folksonomies be used instead of ontologies? (1) n This is a difficult question, not least because it is still unclear what exactly tags are: l an object-level summary and thus more content, or l a truly meta-level classification which comes from a set of labels that is categorically different from “just more content words” ? n In the following, I use the second interpretation. I refer to folksonomy tags as "concepts" because a folksonomy can formally be regarded as an extremely simple ontology: a set of concepts with no hierarchical or other relations between them. n The answer to the question in the title of this slide depends on the aspect of folksonomies one is most interested in, and how important one thinks certain properties of ontologies. 64

Discussion point 2: Can folksonomies be used instead of ontologies? (2) The answer tends Discussion point 2: Can folksonomies be used instead of ontologies? (2) The answer tends to be YES when one focuses on n WHO DEFINES THE CONCEPTS l All ontologies used in the case studies shown were based on or extended popular models and/or ontologies in the domain of investigation – search in the educational portal: models of information search from information science; – shopping: models of the customer buying process from marketing; – shopping with bot assistance: the same + our design of questions, developed in conjunction with a major German retailer; – search in the medical portal: like search in the educational portal plus the medical ICD-9, the International Classification of Diseases; Di. ML/DC). l l n But in fact, none of the ontologies used in the case studies here was a "standard" in the sense that many people agree on it and many applications use it - in fact, there are precious few such standard behaviour models! In that sense, the ontologies used here are, like much of the Semantic Web work, just one possibility proposed by a number of people (the research group + application partners), instead of the result of a standardisation effort. IN FOLKSONOMY-STYLE TAGGING, A RESOURCE USUALLY HAS MORE THAN ONE TAG l Any set of concepts that a group agrees on can be used. l In SWUM (Semantic Web Usage Mining), Web pages are mapped to single concepts (ex. : slides 22 ff. ) or sets of concepts (ex. : slide 21). This set of concepts could also be a tag set as in del. icio. us. 65

Discussion point 2: Can folksonomies be used instead of ontologies? (3) The answer tends Discussion point 2: Can folksonomies be used instead of ontologies? (3) The answer tends to be MAYBE when one focuses on l DYNAMICS introduce a non-stability of the mapping, which means that the patterns would change "depending on how you look at them" - which may or may not be desirable My opinion: This quickly becomes untractable, thus an ontology-based treatment of different viewpoints and dynamics ( ontology evolution) appears to be the better choice. The answer tends to be NO when one focuses on n FORMAL PROPERTIES l HIERARCHIES: generalization is an important feature of many mining algorithms (unless you abstract, you may not find any pattern. l (Non-hierarchical) RELATIONS: – In folksonomies, there are no relations on concepts. Therefore, meaningful visualizations become harder to produce (note that the stratograms shown on slides 27 and 29 require relations that induce a linear order on concepts). – Also, all other inference possibilities are lost. n COMPARABILITY: The results of SWUM can only be compared (e. g. , conversion rates in one site with those in another site) if stable and uniform ontologies are used. 66

Discussion point 3: Which of the techniques shown in this talk are being used Discussion point 3: Which of the techniques shown in this talk are being used in industry and other real-world sites? (1) Pre-remark 1: The contents of this talk was (recent) research, thus it would be surprising to see it already incorporated into industrial practice. However, given that Web usage mining has been around for a number of years, the question is valid. Pre-remark 2: Web usage mining is used on a large scale by search engines. Google says it, Yahoo! Says it. Both say they rely rather on latent-semantic-indexing style semantics than on Semantic-Web-style semantics (but they do use lexica and other helpers); the boundaries are fluid. Anyway, they don‘t say too much about the details of their algorithms. After all, mining is their business model. . . Anyway, we believe that SWUM is applicable to analysing search when the focus is on what services of a site(`s interface) are used, not when the content of searches is investigated (cf. content vs. Service conceptual hierarchies in Berendt & Spiliopoulou, VLDB Journal 2000). Thus, search engines are not the intended application areas of our techniques, but retail, information, e-Government, etc. sites. The question should therefore be rephrased as 3 questions: n Do off-the-shelf software packages (used by end-user companies either on-site or in ASP mode, i. e. without external consultants to do the analyses) support Web usage mining, and specifically Web usage mining with semantics? l n Do consultants offer SWUM analyses? l n The answer is: Very partly. The answer is: partly. What are the likely reasons? l A tentative answer is: Perception problems and lack of incentives. 67

Discussion point 3 (2): Support in off-the-shelf software – basic forms of analysis n Discussion point 3 (2): Support in off-the-shelf software – basic forms of analysis n Pageview counts and simple OLAP-type analyses (hits by country, by language, etc. ) are pretty standard and supported even by most of the simplest freeware products (e. g. , Analog). Their usage is very common in industry. n State-of-the-art commercial analysis software like Webtrends allows a certain degree of programming for extracting more attributes that can be subjected to OLAP-type analyses (see below for an example). n State-of-the-art software often also supports the extraction of more information transferred via Javascript. An example is Google Analytics. n Syntax is generally the only basis. Semantics usually comes in only insofar as the Content Management System used by most sites today provides a certain frame of reference and „meaning“. 68

Discussion point 3 (3): Support in off-the-shelf software – Conversion rates n Software generally Discussion point 3 (3): Support in off-the-shelf software – Conversion rates n Software generally also supports the definition of simple templates from which conversion rates can be computed automatically (e. g. , „a click on page X with referrer Y, or after a sequence of pages that started with referrer Y, is a converted customer brought to us from the banner shown on affiliated site S“). n Conversion rates are not only extremely simple (divide the number of sessions that reached X and then Y by the number of sessions that reached X), but also quite powerful: Every success measure that can be defined via „reachability“ can be cast a conversion rate. n The „ 3 -click rule“ (every page must be reachable with 3 clicks) is a related and equally simple-to-compute measure. That a page is reachable in 3 clicks can be computed from the site graph, that it is reached can be computed from frequent sequences. This only requires that the tool can compute frequent contiguous sequences, which is algorithmically simple and requires little thinking on the part of the analyst. n For conversion-rate computations, semantics occurs in the simple sequence templates offered by the tools, the mapping is gathered from the users via Web forms or scripts. n Conversion rates are also related to pricing models such as Google. Ads. For a survey of software, see http: //www. kdnuggets. com/solutions/web-mining. html 69

Discussion point 3 (4): Support in off-the-shelf software – possibilities and limitations / example Discussion point 3 (4): Support in off-the-shelf software – possibilities and limitations / example country & language n Language l is usually defined as either the presentation language (in a site with dynamic pages generated by a content management system, this can easily be extracted) l or the language (assumed to be) preferred by the user (the browser setting, which in most cases is likely to be the default with which the browser is shipped). n Country is inferred from the IP address and an IP geo-coordinates mapping. Such mappings are provided by software like Maxmind. This is relatively reliable according to the producers and according to a test we did (publication in preparation). n To obtain the user‘s native language, we inferred it from the Geo-IP mapping and official data on official languages in countries around the world. In a small experimental sample in which we asked users to specify their native language, we obtained quite high accuracy (Kralisch & Berendt, NRHM 2005). n I do not know of data on the accuracy of the browser setting native language mapping, or of data comparing it to the Geo-IP approach we used. n But: only the combination presentation language + user‘s native language gives information about whether a user accesses content in his/her native language or in a foreign language – and this knowledge may be much more important for personalization than presentation language or preferred language alone (see Kralisch, Ph. D. dissertation 2006, http: //edoc. huberlin. de/docviews/abstract. php? id=27410) n Nonetheless, even the semantics of „presentation language =/= user language“ are to my knowledge not utilized in off-the-shelf software. One reason is that the awareness of the importance of language in Internet design has only begun. 70

Discussion point 3 (5): Consultancy companies n More advanced forms of conversion-rate analysis, which Discussion point 3 (5): Consultancy companies n More advanced forms of conversion-rate analysis, which rely on (some) semantics, have been introduced or popularized by consultancy companies. Examples: l Net. Genesis (Cutler & Sterne) „E-Metrics“ White Paper, 2000, http: //www. emetrics. org/articles/whitepaper. html – The „funnel“metrics introduced there are now also offered, for example, by Google Analytics: http: //www. google. com/analytics/feature_funnel. html l l n Accenture (R. Ghani), „Mining the Web to add semantics to retail data mining“, in Berendt et al. , Web Mining: From Web to Semantic Web (2004). survey by Anand et al. , „On the deployment of Web usage mining“, ibid. Unfortunately, publicly available data on Web usage are usually at a very high level of aggregation and (also for this reason) build on essentially non-semantic analysis types, e. g. http: //www. nielsen-netratings. com/resources. jsp? section=pr_netv&nav=1 71

Discussion point 3 (6): Likely reasons n One major problem is a divergence between Discussion point 3 (6): Likely reasons n One major problem is a divergence between the (current or definitional? ) nature of data mining / knowledge discovery on the one hand, and business expectations on the other: l KD is still more an art than an engineering process, with few standards even for process. l Business often expects data mining to be a set of fully automatic, prepackaged black-box solutions. n The CRISP-DM process model shown on slide 16 , for example, is a very high-level attempt at standardisation which leaves many details open. n In fact, it can be (and often is) argued that the search for interesting and novel patterns through exploratory data analysis by definition involves hand-crafting. Going back to the original definition of data mining (see slide 13), one could argue that looking for the values of pre-defined pattern templates (e. g. , conversion rates) is the antithesis of „novel patterns“ and thus by definition not data mining. n On the other hand, Web usage mining is essentially market research: a study of user / consumer behaviour. Market research is an established discipline in which it is quite accepted that methods involve human intervention and interpretation rather than the automatic application of pre-packaged procedures (one example is the focus-group method). 72

Discussion point 3 (7): Likely reasons – contd. n Maybe this is a perception Discussion point 3 (7): Likely reasons – contd. n Maybe this is a perception problem: While it is clear that consumer opinions bear a strong qualitative element (such that focus groups cannot be prepared, administered and interpreted by a machine only), data mining carries the image of number crunching (implying that computers are the main actors here). n In line with this, the responsible people often have disjoint qualifications: The market research people have a strong background in the relevant socialscience methods; the IT people (who are expected to do the data mining „on the side“) can use tools, but usually have limited knowledge about empirical methods in general or data mining in particular. n This point was discussed at a panel at the Web. KDD workshop at SIGKDD 2005 – one result was that the job description „Chief Data Officer“ (= a seniormanagement person with resources who knows about data mining in the sense of data analysis AND computers) was a really recent invention. In the meantime, data-mining consultancies filled the gap (but had to convince companies they were worth it). n Or it is a problem of lacking standards (once we have behaviour models of retail sites, of education sites, etc. , we can pre-package these behaviour ontologies and even compare sites). n Standards (in behaviour modeling) require that there is an interest in what the behaviour models say, and an interest in being comparable to other sites. Encouraging developments in this direction can currently be observed in the digital libraries community. . to be continued. . . 73

74 Thank you for your questions! 74 Thank you for your questions!