Knowledge Discovery in Databases Information Retrieval University

Knowledge Discovery in Databases & Information Retrieval University of Texas at Austin School of nformation i Knowledge Management Systems Presented April 29, 2003 By Anne Marie Donovan

n Knowledge Discovery in Databases u n “The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad, Piatetsky-Shapiro, and Smyth, 1996, p. 30) Also known as knowledge extraction, information harvesting, data archeology, and information extraction (p. 28)

n Information Retrieval “The methods and processes for searching relevant information out of information systems that contain extremely large numbers of documents” (Rocha, 2001, 1. 1) u “The ultimate goal of IR is to produce or recommend relevant information to users” (1. 2) u “Traditional IR does not identify users and classifies subjects only with unchanging keywords and categories” (1. 2) u

n Institutions that use KDD/IR systems Require knowledge-based decisions u Have a large quantity of accessible, relevant, historical and current data u Have a high payoff for correct decisions t Financial: banking & investment t Medical: healthcare & insurance t Sales: marketing & customer relations (Piatetsky-Shapiro, 1998, Slides 28 -31) u

n Database Management Systems u File Systems Relational Database Management Systems (RDBMS) u Object-Oriented Database Management Systems (OODBMS) u u Object-Relational Database Management Systems (ORDBMS) (Devarakonda, 2001, ORDBMS)

n Relational Database Management Systems (RDBMS) Relational databases are composed of many relations in the form of two-dimensional tables of rows and columns u RDBMS advantages include the SQL standard (enables migration between database systems), rapid data access and large storage capacity u RDBMS disadvantages include an inability to handle complex data types and relationships (Devarakonda, 2001, RDBMS) u

n Object-Oriented Database Management Systems (OODBMS) OODBMS use abstract data types (ADTs) in which the internal data structure is hidden u OODBMS data is managed through two sets of relations, one describing the interrelations of data items and another describing the abstract relationships u OODBMS handle complex data relationships, but suffer from poor performance and problems of scalability (Devarakonda, 2001, OODBMS) u

n Object-Relational Database Management Systems (ORDBMS) ORDBMS store all database information in tables, but some entries have richer data structure that are also called abstract data types (ADTs). u ORDBMS exhibit features of both the relational and object models such as scalability and support for rich data types u Their main advantage is massive scalability (Devarakonda, 2001, ORDBMS) u

n The KDD Process Collecting and pre-processing data t The problem of continually increasing volumes of data t The problem of increasingly complex forms of data u Identifying and extracting useful knowledge from large data repositories t What knowledge is in the data set? t What can be observed about the data set? u Presenting the knowledge in usable forms (Fayyad et al. , 1996) u

n The KDD Process (continued) Data management problems in data collection, storage, and retrieval t Translation, change detection, integration, duplication, summarization; aggregation, timeliness/datedness (Widom, 1995) u The impracticality of manual analysis t Billions of records and hundreds of fields t Increasing desire for on-the-fly analysis and more flexible presentation (Fayyad et al. , p. 28) u

n The KDD Process (continued) A need to automate the knowledge discovery and extraction processes t Data selection and pre-processing t Data transformation and mining t Interpretation and evaluation (p. 28) u Automation requires attention to: t Data collection, storage, and retrieval t Statistical foundations of search and retrieval processes (p. 29) u

n Stages in the KDD process Learning the application domain u Creating a target data set u Data cleaning and preprocessing u Data reduction and projection u Choosing the function of data mining u Choosing the data mining algorithm u Data mining u Interpretation u Using discovered knowledge (pp. 30 -31) u

n Data mining The application of specific algorithms to a data set for the purpose of extracting data patterns (p. 28) u “Fitting models to or determining patterns from observed data” (p. 31) u n Data warehousing u Collecting and “cleaning” transactional data to make it available for online analysis and decision support (p. 30)

n Data mining tasks Classification: predicting an item class u Forecasting: predicting a parameter value u Clustering: finding groups of items u Description: describing a group u Deviation detection: finding changes u Link analysis: finding relationships and associations u Visualization: presenting data visually to facilitate human discovery (Piatetsky-Shapiro, 1998, Slide 17) u

n Components of data mining systems Model functions: classification, regression, clustering, etc. (pp. 31 -32) u Model representation: decision trees and rules, linear models, non-linear models, example-based methods, etc. (p. 32) u Preference criterion: quantitative criterion embedded in the search algorithm; implicit criterion embedded in the KDD process u Search algorithms: parameter search (given a model) or model search over model space u

n There is NO universal search algorithm Each type of search suits specific types of search problems u The searcher must be careful to properly formulate the question u The searcher must understand the search goal (p. 31) u n Every search can be improved by an increase in data or query context

n Creating context for KDD and IR Extending IR throughout the social network of an organization, e. g. , Answer Garden (Ackerman, 1994 & Ackerman and Mac. Donald, 1996) u Providing social context for data exchange, e. g. , People. Garden (Xiong and Donath, 1999) u Relational database reverse engineering, “extracts a conceptual model from an existing relational database by analyzing data instances as well as metadata” (Lee and Hwang, 2002, Conclusion) u

n KD & IR problems for Web resources Collecting and pre-processing data t Even more continually changing data t Complex data; streaming & multi-media u The problem of identifying and extracting useful knowledge from Web resources t No consistent data models; no context t A lack of descriptive information u Presenting the knowledge in usable forms t More and more wireless devices and timesensitive, multi-media applications u

n Current methods for Web KD & IR Collecting and pre-processing data t Web crawlers and link-based ranking t Human indexing and categorization u Identifying and extracting useful knowledge from Web resources t Keyword search on natural language text t Topical directories or topical Web sites u Presenting the knowledge in usable forms t Content presented in native format (plugins) or in HTML u

n Automating KD & IR for the Web Semantic markup to enable machine understanding/processing (RDF/S & DAML/OIL) & inference analysis u Intelligent search engines and agents to exploit semantic statements u Ontologies to provide context (a data model) for agents (Shah et. al. ) u

n Automating KD & IR for the Web (continued) Automated data collection, automated context collection (data pre-processing) u Value-added services (query routing) u Integrated query systems/knowledge delivery systems (accessibility) u Social accounting metrics to provide context for humans (Smith, 2002, p. 52) u

n Enhanced presentation for the Web u Reformatting for presentation t Differentiated service t Variable visualization • Adaptive graphics, “a unifying framework that allows visual representations of information to be customized and mixed together into new ones” (Boier-Martin, 2003, pp. 6 -9) • Previewing & interactive content • Selective presentation & customized views

n KDD and IR for pervasive computing u Achieving “ubiquitous data access” (Cherniack, Franklin, & Zdonik, 2001, slide 7) t Data management problems • Dissemination (context dependent pull/push) • Synchronization (multiple collectors/devices) • Recharging (renewing) multiple data streams • Profile-driven data management

n KDD and IR for pervasive computing (continued) u Achieving “ubiquitous data access” (Cherniack, Franklin, & Zdonik, 2001, slide 7) t Location aware, mobile devices t Service discovery for mobile services t Distributed sensors/collectors (slides 827)

n Next generation KDD & IR will…. Focus on solving business problems, not data analysis problems u Embed knowledge discovery engines u Integrate access to enterprise and external data on the back-end u Integrate knowledge discovery process with knowledge delivery tools (Piatetsky-Shapiro, 1998, Slide 7) u

n Next generation KDD & IR will…. Manage information retrieval contextually u Allow contextual query/continuous query u Synchronize multiple data flows from disparate sensors/input devices u Enable KD in virtual networks of peer-topeer databases (data “clusters” or “cubes”) u Interpolate or extrapolate for missing data (Cherniack et. al. , 2001, slides 115 -138) u

n Next generation KDD & IR will…. Recognize individual users u Characterize information resources u Provide a way to exchange knowledge between users and information resources (push and pull of information u Adapt to the user community and enable the reuse and recombination of information as well as its exchange (Rocha, 2001, 1. 2) u

n KDD research problems Massive data sets & high dimensionality u User interaction & prior knowledge u Determining statistical significance u Missing data u Understandability of patterns u Management of changing data & knowledge u Data integration u Non-standard, multimedia, & objectoriented data (Fayyad, Piatetsky-Shapiro, & Smyth, 1996, pp. 33 -34) u

n “Top Ten” IR research issues Integrated solutions u Distributed IR u Efficient, flexible indexing and retrieval u "Magic” (automatic query expansion) u Interfaces and browsing u Routing and filtering u Effective retrieval u Multimedia retrieval u Information extraction u Relevance feedback (Croft, 1995) u

n Total Information Awareness - DARPA on the bleeding edge…. . . New database technologies t Database architectures t Database population t New search algorithms and data models u Genysis t Goal is to produce technology enabling ultra-large, all-source information repositories t http: //www. darpa. mil/iao/Genisys. htm u

n Social Issues Communicating context u Creating trust/social value u Inciting cooperation/collaboration u Privacy tradeoffs: convenience/service or security/privacy? u

References Ackerman, M. S. (1998, July). Augmenting the organizational memory: A field study of Answer Garden. ACM Transactions on Information Systems, 16(3), 203 -204. Retrieved March 28, 2003 from http: //doi. acm. org/10. 1145/290159. 290160 Ackerman, M. S. , & Malone, T. W. (1990, April). Answer Garden: A tool for growing organizational memory. ACM SIGOIS Bulletin, 11(. 2 -3), 31 -39. Retrieved March 28, 2003 from http: //doi. acm. org/10. 1145/91474. 91485 Ackerman, M. S. , & Mc. Donald, D. W. (1996). Proceedings of the ACM Conference on Computer-Supported Cooperative Work 1996 (CSCW 96 Boston, MA). Retrieved March 28, 2003 from http: //doi. acm. org/10. 1145/240080. 240203 Boier-Martin, I. M. . (2003, January/February). Adaptive graphics. In T. Rhyne (Ed. ) Visualization Viewpoints, IEEE Computer Graphics and Application, 23(1), 6 -10. Retrieved April 5, 2003 from http: //www. research. ibm. com/people/i/imartin/papers/visviewpoints. pdf

References Chakrabarti, S. , Srivastava, S. , Subramanyam, M. , & Tiware, M. (2000). Using Memex to archive and mine community Web browsing experience. A paper presented at the 9 th International World Wide Web Conference, Amsterdam, May 15 -19, 2000. Retrieved April 12, 2003 from http: //www 9. org/w 9 cdrom/98/98. html Croft, W. B. (1995, November). What do people want from information retrieval? : The top 10 research issues for companies that use and sell IR systems. D-Lib Magazine. Retrieved April 5, 2003 from http: //sunsite. anu. edu. au/mirrors/dlib/november 95/11 croft. html DARPA Information Awareness Office. (2003 a). Genysis. Retrieved from the DARPA Information Awareness Office Web site at: http: //www. darpa. mil/iao/Genisys. htm DARPA Information Awareness Office. (2003 b). Total Information Awareness System. Retrieved from the DARPA Information Awareness Office Web site at: http: //www. darpa. mil/iao/TIASystems. htm

References Devarakonda, R. (2001, March). Object-Relational database systems - The road ahead. ACM Crossroads Student Magazine. Retrieved April 12, 2003 from www. acm. org/crossroads/xrds 7 -3/ordbms. html Fayyad, U. , Piatetsky-Shapiro, G. , & Smyth, P. (1996, November). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11), 27 -34. Retrieved March 03, 2003 from http: //wwwhome. cs. utwente. nl/~mpoel/colleges/dwdm/ACM_artikelen/fayyad 2. pdf Lee, D. , & Hwang, Y. (2002, March 1). Extracting semantic metadata and its visualization. ACM Crossroads Student Magazine. Retrieved March 27, 2003 from www. acm. org/crossroads/xrds 7 -3/smeva. html Piatetsky-Shapiro, G. (1998, December 4). Data mining and knowledge discovery tools: The next generation. Retrieved February 27, 2003 from kdnuggets. com at http: //www. kdnuggets. com/gpspubs/dama-nextgen-98/index. htm

References Rauber, A. , Aschenbrenner, A. , Witvoet, O. , Bruckner, R. M. , & Kaiser, M. (2002, December). Uncovering information hidden in Web archives: A glimpse at Web analysis building on data warehouses. D-Lib Magazine, 8(12). Retrieved March 28, 2003 from http: //www. dlib. org/dlib/december 02/rauber/12 rauber. html Rocha, L. M. (2001). Talk. Mine: A soft computing approach to adaptive knowledge recommendation [Electronic version]. In V. Loia & S. Sessa (Eds. ), Studies in fuzziness and soft computing: Vol. 75. Soft computing agents: New trends for designing autonomous systems. (pp. 89 -116). New York: Springer. Retrieved March 28, 2003 from http: //www. c 3. lanl. gov/~rocha/softagents. html Shah, U. , Finin, T. , Joshi, A. , Cost, R. S. , & Mayfield, J. (2002, November). Information retrieval on the Semantic Web. Paper presented at The ACM Conference on Information and Knowledge Management , November 2002. Retrieved March 28, 2003 from http: //www. csee. umbc. edu/~finin/papers/cikm 02. pdf

References Smith, M. (2002). Tools for navigating large social cyberspaces. Communications of the ACM, 45(4), 51 -55. Retrieved March 28, 2003 from http: //delivery. acm. org/10. 1145/510000/505272/p 51 smith. html? key 1=505272&key 2=5541680501&coll=GUIDE&dl=GUIDE&C FID=9914049&CFTOKEN=12943474 Whitted, T. (1999, July/August). Draw on the Wall. IEEE Computer Graphics and Applications, 19(4), 6 -9. Retrieved April 8, 2003 from ieeeexplore. ieee. org at: http: //ieeexplore. ieee. org/iel 5/38/16795/00773957. pdf? is. Number=16795&arnu mber=773957&prod=JNL&ar. St=6&ared=9&ar. Author=Whitted%2 C+T. Widom, J. (1995, November). Research problems in data warehousing. Proceedings of the 4 th International Conference on Information and Knowledge Management (CIKM). Retrieved March 28, 2003 from http: //www. ischool. utexas. edu/~i 385 tkms/readings/Widom-1995 Research. Problems. pdf

References Xion, R. , & Donath, J. (1999). People. Garden: Creating data portraits for users. CHI Letters, 1(1). 37 -44. Retrieved April 8, 2003 from http: //smg. media. mit. edu/papers/Xiong/pgarden_uist 99. pdf