Information Data Integration Combining information from multiple

Information & Data Integration Combining information from multiple autonomous sources Kambhampati & Knoblock Information Integration on the Web (MA-1) 1

The end-game: 3 options • Have an in-class final exam – 5/8 2: 30 pm is the designated time • Have a take-home exam • Make the final home-work into a take home –. . and have a mandatory discussion class on 5/8 2: 30 pm Also, note the change in demo schedule Kambhampati & Knoblock Information Integration on the Web (MA-1) 2

Today’s Agenda • Discuss Semtag/Seeker • Lecture start on Information Integration Kambhampati & Knoblock Information Integration on the Web (MA-1) 3

Information Integration • Combining information from multiple autonomous information sources – And answering queries using the combined information • Many variations depending on – The type of information sources (text? Data? Combination? ) • Data vs. Information integration • Horizontal vs. Vertical integration – The level of eagerness of the integration • Ad hoc vs. Pre-mediated integration – Pre-mediation itself can be Warehouse vs online approaches – Generality of the solution • Mashup vs. Mediated Kambhampati & Knoblock Information Integration on the Web (MA-1) 4

Information Integration as making the database repository iew of the web. . ”V “IR e Th Linkage • Discovering information sources (e. g. deep web modeling, schema learning, …) • Gathering data (e. g. , wrapper learning & information extraction, federated search, …) Queries • Querying integrated • Cleaning data (e. g. , de-duping and linking records) to form a single [virtual] database information sources (e. g. queries to views, execution of web-based queries, …) • Data mining & analyzing integrated information (e. g. , collaborative filtering/classification learning using extracted data, …)

Th B” “D e Services iew V Source Trust Webpages as r ses n a tio ove atab gra tion s d e Int dia mou Me tono Au Structured data Sensors (streaming Data) Source Fusion/ Query Planning ry e Qu Mediator Executor Answers Monitor Kambhampati & Knoblock Information Integration on the Web (MA-1) 6

r S c’s pti ke c rne o Who is dying to have it? (Applications) • WWW: – Comparison shopping – Portals integrating data from multiple sources – B 2 B, electronic marketplaces • Science and culture: – Medical genetics: integrating genomic data – Astrophysics: monitoring events in the sky. – Culture: uniform access to all cultural databases produced by countries in Europe provinces in Canada • Enterprise data integration – An average company has 49 different databases and spends 35% of its IT dollars on integration efforts Kambhampati & Knoblock Information Integration on the Web (MA-1) 7

r S c’s pti ke c rne o Is it like Expedia/Travelocity/Orbitz… • Surpringly, NO! • The online travel sites don’t quite need to do data integration; they just use SABRE – SABRE was started in the 60’s as a joint project between American Airlines and IBM – It is the de facto database for most of the airline industry (who voluntarily enter their data into it) • There are very few airlines that buck the SABRE trend—South. West airlines is one (which is why many online sites don’t bother with South West) • So, online travel sites really are talking to a single database (not multiple data sources)… – To be sure, online travel sources do have to solve a hard problem. Finding an optimal fare (even at a given time) is basically computationally intractable (not to mention the issues brought in by fluctuating fares). So don’t be so hard on yourself • Check out http: //www. maa. org/devlin_09_02. html Kambhampati & Knoblock Information Integration on the Web (MA-1) 9

Are we talking “comparison shopping” agents? • Certainly closer to the aims of these • But: • Wider focus • Consider larger range of databases • Consider services • Implies more challenges • “warehousing” may not work • Manual source characterization/ integration won’t scale-up Kambhampati & Knoblock Information Integration on the Web (MA-1) 11

Kambhampati & Knoblock Information Integration on the Web (MA-1) 12

4/26 Information Integration – 2 Focus on Data Integration Kambhampati & Knoblock Information Integration on the Web (MA-1) 13

Information Integration Text Integration Data Integration Soft-Joins Collection Selection Data aggregation (vertical integration) Data Linking (horizontal integration) Kambhampati & Knoblock Information Integration on the Web (MA-1) 14

Different “Integration” scenarios • “Data Aggregation” (Vertical) • – All sources export (parts of a) single relation • No need for joins etc • Could be warehouse or virtual – E. g. Bib. Finder, Junglee, Employeds etc – Challenges: Schema mapping; data overlap • “Data Linking” (Horizontal) – Joins over multiple relations stored in multiple DB • E. g. Softjoins in WHIRL • Ted Kennedy episode – Challenges: record linkage over text fields (object mapping); query reformulation Kambhampati & Knoblock • • “Collection Selection” – All sources export text documents – E. g. meta-crawler etc. Challenges: Similarity definition; relevance handling All together (vertical & horizontal) – Many interesting research issues –. . but few actual fielded systems Information Integration on the Web (MA-1) 15

Collection Selection Kambhampati & Knoblock Information Integration on the Web (MA-1) 16

Collection Selection/Meta Search Introduction • Metasearch Engine • A system that provides unified access to multiple existing search engines. • Metasearch Engine Components – Database Selector • Identifying potentially useful databases for each user query – Document Selector • Identifying potentially useful document returned from selected databases – Query Dispatcher and Result Merger • Ranking the selected documents Kambhampati & Knoblock Information Integration on the Web (MA-1) 17

Collection Selection Query Execution WSJ WP FT Kambhampati & Knoblock Results Merging CNN NYT Information Integration on the Web (MA-1) 18

Evaluating collection selection • Let c 1. . cj be the collections that are chosen to be accessed for the query Q. Let d 1…dk be the top documents returned from these collections. • We compare these results to the results that would have been returned from a central union database – Ground Truth: The ranking of documents that the retrieval technique (say vector space or jaccard similarity) would have retrieved from a central union database that is the union of all the collections • Compare precision of the documents returned by accessing Kambhampati & Knoblock Information Integration on the Web (MA-1) 19

General Scheme & Challenges • Get a representative of each of the database – Representative is a sample of files from the database • Challenge: get an unbiased sample when you can only access the database through queries. • Compare the query to the representatives to judge the relevance of a database – Coarse approach: Convert the representative files into a single file (super-document). Take the (vector) similarity between the query and the super document of a database to judge that database’s relevance – Finer approach: Keep the representative as a mini-database. Union the mini-databases to get a central mini-database. Apply the query to the central mini-database and find the top k answers from it. Decide the relevance of each databased on which of the answers came from which database’s representative • You can use an estimate of the size of the database too – What about overlap between collections? Web (MA-1) (See ROSCO paper) Information Integration on the Kambhampati & Knoblock 21

Uniform Probing for Content Summary Construction • Automatic extraction of document frequency statistics from uncooperative databases – [Callan and Connell TOIS 2001], [Callan et al. SIGMOD 1999] • Main Ideas – Pick a word and send it as a query to database D • Random. Sampling-Other. Resource(RS-Ord): from a dictionary • Random. Sampling-Learned. Resource(RS-Lrd): from retrieved documents – Retrieval the top-K documents returned – If the number of retrieved documents exceeds a threshod T, stop, otherwise retart at the beginning – k=4 , T=300 – Compute the sample document frequency for each word that appeared in a retrieved document. Kambhampati & Knoblock Information Integration on the Web (MA-1) 22

CORI Net Approach (Representative as a super document) • Representative Statistics – The document frequency for each term and each database – The database frequency for each term • Main Ideas – Visualize the representative of a database as a super document, and the set of all representative as a database of super documents – Document frequency becomes term frequency in the super document, and database frequency becomes document frequency in the super database – Ranking scores can be computed using a similarity function such as the Cosine function Kambhampati & Knoblock Information Integration on the Web (MA-1) 23

Re. DDE Approach (Representative as a mini-collection) • Use the representatives as mini collections • Construct a union-representative that is the union of the mini-collections (such that each document keeps information on which collection it was sampled from) • Send the query first to union-collection, get the top-k ranked results – See which of the results in the top-k came from which minicollection. The collections are ranked in terms of how much their mini-collections contributed to the top-k answers of the query. – Scale the number of returned results by the expected size of the actual collection Kambhampati & Knoblock Information Integration on the Web (MA-1) 24

Data Integration Kambhampati & Knoblock Information Integration on the Web (MA-1) 25

Models for Integration Modified from Alon Halevy’s slides Kambhampati & Knoblock Information Integration on the Web (MA-1) 26

Solutions for small-scale integration • • Mostly ad-hoc programming: create a special solution for every case; pay consultants a lot of money. Data warehousing: load all the data periodically into a warehouse. – 6 -18 months lead time – Separates operational DBMS from decision support DBMS. (not only a solution to data integration). – Performance is good; data may not be fresh. – Need to clean, scrub you data. Kambhampati & Knoblock s or , f ified s thi lass did nt c lee me ng y Ju plo em Information Integration on the Web (MA-1) 27

The Virtual Integration Architecture • Leave the data in the sources. • When a query comes in: – Determine the relevant sources to the query – Break down the query into sub-queries for the sources. – Get the answers from the sources, and combine them appropriately. • Data is fresh. Approach scalable • Issues: – Relating Sources & Mediator – Reformulating the query – Efficient planning & execution Kambhampati & Knoblock Garlic [IBM], Hermes[UMD]; Tsimmis, Info. Master[Stanford]; DISCO[INRIA]; Information Manifold [AT&T]; SIMS/Ariadne[USC]; Emerac/Havasu[ASU] Information Integration on the Web (MA-1) 29

Desiderata for Relating Source-Mediator Schemas • Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources. • Easy addition: make it easy to add new data sources. • Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively. • Nonlossy: be able to handle all queries that can be answered by directly accessing the sources Kambhampati & Knoblock Reformulation Information Integration on the Web (MA-1) 32

r S c’s pti ke c rne o Databases Why isn’t this just Distributed Databases • No common schema – Sources with heterogeneous schemas (and ontologies) – Semi-structured sources • Legacy Sources – Not relational-complete – Variety of access/process limitations • Autonomous sources – No central administration – Uncontrolled source content overlap • Unpredictable run-time behavior – Makes query execution hard • Predominantly “Read-only” – Could be a blessing—less worry about transaction management – (although the push now is to also support transactions on web) Kambhampati & Knoblock Information Integration on the Web (MA-1) 34

Differences minor for data aggregation… Approaches for relating source & Mediator Schemas “View” Refresher • Global-as-view (GAV): express the mediated schema relations as a set of views over the data source relations • Local-as-view (LAV): express the source relations as views over the mediated schema. • Can be combined…? Kambhampati & Knoblock Virtual vs Materialized ovie m in a rio. . m e the n scena r mpa gratio o t’s c e inte Le s taba Da Information Integration on the Web (MA-1) 36

Global-as-View Mediated schema: Express mediator schema relations as views over Movie(title, dir, year, genre), source relations Schedule(cinema, title, time). Create View Movie AS select * from S 1 [S 1(title, dir, year, genre)] union select * from S 2 [S 2(title, dir, year, genre)] union [S 3(title, dir), S 4(title, year, genre)] select S 3. title, S 3. dir, S 4. year, S 4. genre from S 3, S 4 where S 3. title=S 4. title Kambhampati & Knoblock Information Integration on the Web (MA-1) 38

Global-as-View Mediated schema: Express mediator schema relations as views over Movie(title, dir, year, genre), source relations Schedule(cinema, title, time). Create View Movie AS select * from S 1 [S 1(title, dir, year, genre)] union select * from S 2 [S 2(title, dir, year, genre)] union [S 3(title, dir), S 4(title, year, genre)] select S 3. title, S 3. dir, S 4. year, S 4. genre from S 3, S 4 Mediator schema relations are where S 3. title=S 4. title Virtual views on source relations Kambhampati & Knoblock Information Integration on the Web (MA-1) 39

Local-as-View: example 1 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create Source S 1 AS select * from Movie Express source schema relations as views over mediator relations S 1(title, dir, year, genre) Create Source S 3 AS select title, dir from Movie S 3(title, dir) Create Source S 5 AS select title, dir, year from Movie S 5(title, dir, year), year >1960 where year > 1960 AND genre=“Comedy” Sources are “materialized views” of mediator schema Kambhampati & Knoblock Information Integration on the Web (MA-1) 42

GAV vs. LAV Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S 4: S 4(cinema, genre) Lossy mediation Kambhampati & Knoblock Information Integration on the Web (MA-1) 45

GAV vs. • Not modular – Addition of new sources changes the mediated schema • Can be awkward to write mediated schema without loss of information • Query reformulation easy – reduces to view unfolding (polynomial) – Can build hierarchies of mediated schemas • Best when – Few, stable, data sources – well-known to the mediator (e. g. corporate integration) • Garlic, TSIMMIS, HERMES Kambhampati & Knoblock LAV • Modular--adding new sources is easy • Very flexible--power of the entire query language available to describe sources • Reformulation is hard – Involves answering queries only using views (can be intractable—see below) • Best when – Many, relatively unknown data sources – possibility of addition/deletion of sources • Information Manifold, Info. Master, Emerac, Havasu Information Integration on the Web (MA-1) 46

Extremes of Laziness in Data Integration • Fully Query-time II (blue sky for now) – – – – • Fully pre-fixed II – Decide on the only query Get a query from the user you want to support (most interesting on the mediator schema – Write a (java)script that action is Go “discover” relevant data supports the query by “in between”) sources accessing specific (predetermined) sources, piping Figure out their “schemas” Map the schemas on to the E. g. We may start with results (through known APIs) to specific other known sources and mediator schema sources Reformulate the user query their known schemas, • Examples include Google do hand-mapping into data source queries Map Mashups and support automated Optimize and execute the reformulation and queries optimization Return the answers Kambhampati & Knoblock Information Integration on the Web (MA-1) 67

Kambhampati & Knoblock Information Integration on the Web (MA-1) 68

Services od ili ty M /U t Executor Answers Kambhampati & Knoblock Needs to handle Source/network Interruptions, Runtime uncertainity, replanning lls Ca So ur ce tics atis g St atin Upd ing nn pla ts Re ques Re DWIM Sensors (streaming Data) el ry e Qu Webpages Structured data Needs to handle: Multiple objectives, Service composition, Source quality & overlap nc e • Probing Queries Source Fusion/ Query Planning re • Ontologies; Source/Service Descriptions efe • Source Trust User queries refer to the mediated schema. Data is stored in the sources in a local schema. Content descriptions provide the semantic mappings between the different schemas. Mediator uses the descriptions to translate user queries into queries on the sources. Pr • Information Integration on the Web (MA-1) Monitor 69

Dimensions to Consider • • How many sources are we accessing? How autonomous are they? Can we get meta-data about sources? Is the data structured? – Discussion about soft-joins. See slide next • Supporting just queries or also updates? • Requirements: accuracy, completeness, performance, handling inconsistencies. • Closed world assumption vs. open world? – See slide next Kambhampati & Knoblock Information Integration on the Web (MA-1) 70

Soft Joins. . WHIRL [Cohen] n We can extend the notion of Joins to “Similarity Joins” where similarity is measured in terms of vector similarity over the text attributes. So, the join tuples are output n a ranked form—with the rank proportional to the similarity n Neat idea… but does have some implementation difficulties n n Most tuples in the cross-product will have non-zero similarities. So, need query processing that will somehow just produce highly ranked tuples Also other similarity/distance metrics may be used n E. g. Edit distance 71

72

Source Descriptions • Contains all meta-information about the sources: – Logical source contents (books, new cars). – Source capabilities (can answer SQL queries) – Source completeness (has all books). – Physical properties of source and network. – Statistics about the data (like in an RDBMS) – Source reliability – Mirror sources – Update frequency. Kambhampati & Knoblock Information Integration on the Web (MA-1) 75

Source Access • How do we get the “tuples”? – Many sources give “unstructured” output • Some inherently unstructured; while others “englishify” their database-style output – Need to (un)Wrap the output from the sources to get tuples – “Wrapper building”/Information Extraction – Can be done manually/semi-manually Kambhampati & Knoblock Information Integration on the Web (MA-1) 76

Source Fusion/Query Planning • Accepts user query and generates a plan for accessing sources to answer the query – Needs to handle tradeoffs between cost and coverage – Needs to handle source access limitations – Needs to reason about the source quality/reputation Kambhampati & Knoblock Information Integration on the Web (MA-1) 77

Monitoring/Execution • Takes the query plan and executes it on the sources – Needs to handle source latency – Needs to handle transient/short-term network outages – Needs to handle source access limitations – May need to re-schedule or re-plan Kambhampati & Knoblock Information Integration on the Web (MA-1) 78