36ed584ea905a3cb32af5333df345d4b.ppt
- Количество слайдов: 43
From Data Integration to Community Information Management An. Hai Doan University of Illinois Joint work with Pedro De. Rose, Robert Mc. Cann, Yoonkyong Lee, Mayssam Sayyadian, Warren Shen, Wensheng Wu, Quoc Le, Hoa Nguyen, Long Vu, Robin Dhamankar, Alex Kramnik, Luis Gravano, Weiyi Meng, Raghu Ramakrishnan, Dan Roth, Arnon Rosenthal, Clemen Yu
Data Integration Challenge New researcher realestate. com Find houses with 4 bedrooms priced under 300 K homeseekers. com homes. com 2
Actually Bought a House in 2004 l Buying period – – l queried 7 -8 data sources over 3 weeks some of the sources are local, not “indexed” by national sources 3 hours / night 60+ hours huge amount of time on querying, post processing Buyer-remorse period – repeated the above for another 3 weeks! We really need to automate data integration. . . 3
Architecture of Data Integration Systems Find houses with 4 bedrooms priced under 300 K mediated schema source schema 1 wrapper homes. com source schema 2 wrapper realestate. com source schema 3 wrapper houses. com 4
Current State of Affairs Vibrant research & industrial landscape l Research since the 70 s, accelerated in past decade l – database, AI, Web, KDD, Semantic Web communities – 14+ workshops in past 3 years: ISWC-03, IJCAI-03, VLDB-04, SIGMOD-04, DILS-04, IQIS-04, ISWC-04, Web. DB-05, ICDE-05, DILS-05, IQIS-05, IIWeb-06, etc. – main database focuses: – modeling, architecture, query processing, schema/tuple matching – building specialized systems: life sciences, Deep Web, etc. l Industry – 53 startups in 2002 [Wiederhold-02] – many new ones in 2005 Despite much R&D activities, however … 5
DI Systems are Still Very Difficult to Build and Maintain l Builder must execute multiple tasks select data sources create wrappers create mediated schemas match schemas eliminate duplicate tuples monitor changes etc. Most tasks are extremely labor intensive l Total cost often at 35% of IT budget [Knoblock et. al. 02] l – systems often take months or years to develop High cost severely limits deployment of DI systems 6
Data Integration @ Illinois l Directions: – automate tasks to minimize human labor – leverage users to spread out the cost – simplify tasks so that they can be done quickly 7
Sample Research on Automating Integration Tasks: Schema Matching Mediated-schema price agent-name 1 -1 match homes. com listed-price 320 K 240 K contact-name Jane Brown Mike Smith address complex match city state Seattle WA Miami FL 8
Schema Matching is Ubiquitous! Fundamental problem in numerous applications l Databases l – – – l data integration, model management data translation, collaborative data sharing keyword querying, schema/view integration data warehousing, peer data management, … AI – knowledge bases, ontology merging, information gathering agents, . . . l Web – e-commerce, Deep Web, Semantic Web, Google Base, next version of My Web 2. 0? l e. Government, bio-informatics, e-sciences 9
Why Schema Matching is Difficult l Schema & data never fully capture semantics! – not adequately documented l Must rely on clues in schema & data – using names, structures, types, data values, etc. l Such clues can be unreliable – same names different entities: area location or square-feet – different names same entity: area & address location l Intended semantics can be subjective – house-style = house-description? l Cannot be fully automated, needs user feedback 10
Current State of Affairs l Schema matching is now a key bottleneck! – largely done by hand, labor intensive & error prone – data integration at GTE [Li&Clifton, 2000] – 40 databases, 27000 elements, estimated time: 12 years l Numerous matching techniques have been developed – Databases: IBM Almaden, Wisconsin, Microsoft Research, Purdue, BYU, George Mason, Leipzig, NCSU, Illinois, Washington, . . . – AI: Stanford, Toronto, Rutgers, Karlsruhe University, NEC, USC, … "everyone and his brother is doing ontology mapping" l Techniques are often synergistic, leading to multi-component matching architectures – each component employs a particular technique – final predictions combine those of the components 11
Example: LSD [Doan et al. SIGMOD-01] agent Mediated schema address agent-name Urbana, IL Seattle, WA James Smith Mike Doan homes. com area contact-agent Peoria, IL Kent, WA name Name Matcher contact (206) 634 9435 (617) 335 4243 Naive Bayes Matcher area => (address, 0. 7), (description, 0. 3) contact-agent => (agent-phone, 0. 7), (agent-name, 0. 3) comments agent 0. 5 Combiner 0. 1 Constraint Enforcer 0. 3 Match Selector => (address, 0. 6), (desc, 0. 4) area = address Only one attribute of source schema matches address contact-agent = agent-phone. . . comments = desc 12
Multi-Component Matching Solutions l Introduced in [Doan et. al. , Web. DB-00, SIGMOD-01, Do&Rahm, VLDB-02, Embley et. al. 02] l Now commonly adopted, with industrial-strength systems – e. g. , Protoplasm [MSR], COMA++ [Univ of Lepzig] Match selector Constraint enforcer Combiner Matcher 1 … Matcher n LSD l Match selector Combiner Constraint enforcer Matcher 1 … Matcher n COMA Matcher SF Combiner Matcher 1 … Matcher n LSD-SF Such systems are very powerful. . . – maximize accuracy; highly customizable l . . . but place a serious tuning burden on domain users 13
Tuning Schema Matching Systems l Given a particular matching situation – how to select the right components? – how to adjust the multitude of knobs? Match selector Constraint enforcer Combiner Matcher 1 … Matcher n Execution graph l Threshold selector Bipartite graph selector A* search enforcer Relax. labeler ILP Average Min Max Weighted combiner sum combiner q-gram name Decision tree matcher TF/IDF name matcher Naïve Bays matcher SVM matcher • • Characteristics of attr. Split measure Post-prune? Size of validation set • • • Knobs of decision tree matcher Library of matching components Untuned versions produce inferior accuracy 14
But Tuning is Extremely Difficult l Large number of knobs – e. g. , 8 -29 in our experiments l Wide variety of techniques – database, machine learning, IR, information theory, etc. Complex interaction among components l Not clear how to compare quality of knob configs l l Long-standing problem since the 80 s, getting much worse with multiple-component systems Developing efficient tuning techniques is now crucial 15
The e. Tuner Solution [VLDB-05 a] l Given schema S & matching system M – tunes M to maximize average accuracy of matching S with future schemas – commonly occur in data integration, warehousing, supply chain l Challenge 1: Evaluation – score each knob config K of matching system M – return K*, the one with highest score – but how to score knob config K? – if we know a representative workload W = {(S, T 1), . . . , (S, Tn)}, and correct matches between S and T 1, …, Tn can use W to score K l Challenge 2: Huge or infinite search space 16
Solving Challenge 1: Generate Synthetic Input/Output l Need workload W = {(S, T 1), (S, T 2), …, (S, Tn)} l To generate W – start with S – perturb S to generate T 1 – perturb S to generate T 2 – etc. l Know the perturbation know matches between S & Ti 17
Generate Synthetic Input/Output Emps Employees id first last salary ($) 1 Bill Laup 40, 000 $ 2 Mike Brown 60, 000 $ Employees last Laup Brown = = id NONE emp-last wage Perturb # of columns id salary ($) 1 40, 000 $ 2 60, 000 $ emp-last id Laup 1 Brown 2 Emps Perturb table and column names wage 45200 59328 Perturb data tuples emp-last id wage Laup 1 40, 000$ Brown 2 60, 000$ Schema S 1 2 3 12 3 Make sure tables do not share tuples l Rules are applied probabilistically l 18
The e. Tuner Architecture Tuning Procedures Perturbation Rules Workload Generator Synthetic Workload S l Ω 2 T 2 S Ωn Tn Tuned Matching Tool M T 1 S Schema S Ω 1 Staged Searcher Matching Tool M More details / experiments in – Sayyadian et. al. , VLDB-05 19
e. Tuner: Current Status l Only the first step – but now we have a line of attack for a long-standing problem l Current directions – find optimal synthetic workload – develop faster search methods – extend for other matching scenarios – adapt ideas to scenarios beyond schema matching – wrapper maintenance [VLDB-05 b] – domain-specific search engine? 20
Automate Integration Tasks: Summary l Schema matching – architecture: Web. DB-00, SIGMOD-01, WWW-02 – long-standing problems: SIGMOD-04 a, e. Tuner [VLDB-05 a] – learning/other techniques: CIDR-03, VLDBJ-03, MLJ-03, Web. DB-03, SIGMOD-04 b, ICDE-05 a, ICDE-05 b – novel problem: debug schemas for interoperability [ongoing] – industry transfer: involving 2 startups – promote research area: workshop at ISWC-03, special issues in SIGMOD Record-04 & AI Magazine-05, survey Query reformulation: ICDE-02 l Mediated schema construction: SIGMOD-04 b, ICDM-05, l ICDE-06 Duplicate tuple removal: AAAI-05, Tech report 06 a, 06 b l Wrapper maintenance: VLDB-05 b l 21
Research Directions l Automate integration tasks – to minimize human labor l Leverage users – to spread the cost l Simplify integration tasks – so that they can be done quickly 22
The MOBS Project l Learn from multitude of users to improve tool accuracy, thus significantly reducing builder workload Questions Answers l MOBS = Mass Collaboration to Build Systems 23
Mass Collaboration l Build software artifacts – Linux, Apache server, other open-source software l Knowledge bases, encyclopedia – wikipedia. com l Review & technical support websites – amazon. com, epinions. com, quiq. com, l Detect software bugs – [Liblit et al. PLDI 03 & 05] l Label images/pages on the Web – ESPgame, flickr, del. ici. ous, My Web 2. 0 l Improve search engines, recommender systems Why not data integration systems? 24
Example: Duplicate Data Matching l Serious problem in many settings (e. g. , e. com) Dell laptop X 200 with mouse. . . Mouse for Dell laptop 200 series. . . Dell X 200; mouse at reduced price. . . l Hard for machine, but easy for human 25
Key Challenges How to modify tools to learn from users? l How to combine noisy user answers l Multiple noisy oracles –build user models, learn them via interaction with users –novel form of active learning l How to obtain user participation? – data experts, often willing to help (e. g. , Illinois Fire Service) – may be asked to help (e. g. , e. com) – volunteer (e. g. , online communities), "payment" schemes 26
Current Status l Develop first-cut solutions – built prototype, experimented with 3 -132 users, for source discovery and schema matching – improve accuracy by 9 -60%, reduced workload by 29 -88% l Built two simple DI systems on Web – almost exclusively with users l Building a real-world application – DBlife (more later) l See [Mc. Cann et al. , Web. DB-03, ICDE-05, AAAI Spring Symposium-05, Tech Report-06] 27
Research Directions l Automate integration tasks – to minimize human labor l Leverage users – to spread the cost l Simplify integration tasks – so that they can be done quickly 28
Simplify Mediated Schema Keyword Search over Multiple Databases Novel problem l Very useful for urgent / one-time DI needs l – also when users are SQL-illiterate (e. g. , Electronic Medical Records) – also on the Web (e. g. , when data is tagged with some structure) l Solution [Kite, Tech Report 06 a] – combines IR, schema matching, data matching, and AI planning 29
Simplify Wrappers Structured Queries over Text/Web Data SELECT. . . FROM. . . WHERE. . . E-mails, text, Web data, news, etc. l Novel problem – attracts attention from database / AI / Web researchers at Columbia, IBM TJ Watson/Almaden, UCLA, IIT-Bombay l [SQOUT, Tech Report 06 b], [SLIC, Tech Report 06 c] 30
Research Directions l Automate integration tasks – to minimize human labor l Leverage users – to spread the cost l Simplify integration tasks Integration is difficult Do best-effort integration Integrate with text Should leverage human – so that they can be done quickly Build on this to promote 31 Community Information Management
Community Information Management l Numerous communities on the Web – database researchers, movie fans, legal professionals, bioinformatics, etc. – enterprise intranets, tech support groups Each community = many disparate data sources + people l Members often want to query, monitor, discover info. l – – – any interesting connection between researchers X and Y? list all courses that cite this paper find all citations of this paper in the past one week on the Web what is new in the past 24 hours in the database community? which faculty candidates are interviewing this year, where? Current integration solutions fall short of addressing such needs 32
Cimple Project @ Illinois/Wisconsin l Software platform that can be rapidly deployed and customized to manage data-rich online communities Researcher Homepages Jim Gray ** Pages * * Group Pages DBworld mailing list Keyword search SQL querying Web pages Conference Jim Gray * ** * SIGMOD-04 give-talk SIGMOD-04 * * * Text documents Question answering Browse Mining Alert/Monitor News summary DBLP Import & personalize data Share / aggregation Tag entities/relationship / create new contents Context-dependent services 33
Prototype System: DBlife l 1164 data sources, crawled daily, 11000+ pages / day 160+ MB, 121400+ people mentions 5600+ persons 34
Structure Related Challenges Researcher Homepages Jim Gray ** Pages * * Group Pages DBworld mailing list * * * SIGMOD-04 * ** give-talk SIGMOD-04 * * * Text documents Question answering Browse Mining Alert/Monitor News summary DBLP l Keyword search SQL querying Web pages Conference Jim Gray Extraction – better blackboxes, compose blackboxes, exploit domain knowledge l Maintenance – critical, but very little has been done l Exploitation – keyword search over extracted structure? SQL queries? – detect interesting events? 35
User Related Challenges l Users should be able to – – – l Jim Gray import whatever they want correct/add to the imported data extend the ER schema create new contents for share/exchange ask for context-dependent services give-talk SIGMOD-04 Examples – user imports a paper, system provides bib item – user imports a movie, add desc, tags it for exchange l Challenges – provide incentives, payment – handle malicious/spam users – share / aggregate user activities/actions/content 36
Comparison to Current My Web 2. 0 l Cimple focuses on domain-specific communities – not the entire Web l Besides page level – also considers finer granularities of entities / relations / attributes – leverages automatic “best-effort” data integration techniques l Leverages user feedback to further improve accuracy – thus combines both automatic techniques and human efforts l Considers the entire range of search + structured queries – and how to seamlessly move between them l Allows personalization and sharing – consider context-dependent services beyond keyword search (e. g. , selling, exchange) 37
Applying Cimple to My Web 2. 0: An Example Going beyond just sharing Web pages l Leveraging My Web 2. 0 for other actions l – e. g. , selling, exchanging goods (turning it to a classified ads platform? ) l E. g. , want to sell my house – – create a page describing the house save it to my account on My Web 2. 0 tag it with “sell: house, sell, house, champaign, IL” took me less than 5 minutes (not including creating the page) – now if someone searches for any of these keywords … 38
39
40
Here a button can be added to facilitate the “sell” action 41 provide context-dependent services
The Big Picture [Speculative Mode] Structured data Unstructured data (relational, XML) (text, Web, email) Database: SQL IR/Web/AI/Mining: keyword, QA Many apps will involve all three Multitude of users Semantic Web Industry/Real World Exact integration will be difficult - best-effort is promising - should leverage human Apps will want broad range of services - keyword search, SQL queries - buy, sell, exchange, etc. 42
Summary l Data integration: crucial problem – at intersection of database, AI, Web, IR l Integration @ Illinois in my group: – automate tasks to minimize human labor – leverage users to spread out the cost – simplify tasks so that they can be done quickly Best-effort integration, should leverage human l The Cimple project @ Illinois/Wisconsin l – builds on current work to study Community Information Management l A step toward managing structured + text + users synergistically! See “anhai” on Yahoo for more details 43