
d56170a5a5c10d94b6ea163837e69812.ppt
- Количество слайдов: 33
Data Warehousing/Mining Comp 150 DW Semistructured Data Instructor: Dan Hebert Data Warehousing/Mining 1
Semistructured Data v Everything that has no rigid schema – Schema is contained within the data (self-describing), OR – No separate schema, OR – Schema exists but places only loose constraints on data v Emerged as an important topic for a variety of reasons – Many data sources like WWW which we would like to treat as databases but cannot for the lack of schema – Desirable to have an extremely flexible format for data exchange between disparate databases – May want to view structured data as semistructured data for the purpose of browsing Data Warehousing/Mining 2
Motivation v Some data really is unstructured/semistructured – World Wide Web, – Data exchange formats – Some exotic database management systems, e. g. , ACe. DB, popular with biologists Data integration v Browsing v Data Warehousing/Mining 3
Motivation - World Wide Web v Why do we want to treat the Web as a database? – To maintain integrity – To query based on structure (as opposed to content) – To introduce some “organization”. v But the Web has no structure. The best we can say is that it is an enormous graph. Data Warehousing/Mining 4
Motivation - Data Formats Much (probably most) of the world’s data is in data formats v These are formats defined for the interchange and archiving of data v Data formats vary in generality. ASN. 1 and XDR are quite general v Scientific data formats tend to be “fixed schemas” v The textual representation given by data formats is sometimes not immediately translatable into a standard relational/objectoriented representation v Data Warehousing/Mining 5
Motivation - Data Integration v Goal is to integrate all types of information, including unstructured information – Irregular, missing information, structure not fully known, dynamic schema evolution, etc. v Traditional data models and languages not well suited – Cannot accommodate heterogeneous data sets (different types and structures), etc. – Difficult to build software that will easily convert between two disparate models v OEM (Object Exchange Model) – Semistructured data model from TSIMMIS project at Stanford – Internal data structure for exchange of data between DBMSs – Used by other systems: e. g. , Windows 95 registry, Lotus Notes Data Warehousing/Mining 6
Motivation - Browsing To query a database one needs to understand the schema. v However schemas have opaque terminology and the user may want to start by querying the data with little or no knowledge of the schema. v – Where in the database is the string “Casablanca” to be found? – Are there integers in the database greater than 216 ? – What objects in the database have an attribute name that starts with “act”? v While extensions to relational query languages have been proposed for such queries, there is no generic technique for interpreting them. Data Warehousing/Mining 7
The Model v Represent data as some kind of graph-like or treelike model – Cycles are allowed but usually refer to them as trees – Several different approaches with minor differences (easy to convert) u v Data on labels or edges, nodes carry information or not Straightforward to encode relational and objectoriented databases – Issue: object identity Data Warehousing/Mining 8
Querying Semistructured Data v There are (at least) three approaches to this problem – Add arbitrary features to SQL or to your favorite query language – Find some principled approach to programs that are based on the type of the data – Represent the graph (or whatever the structure is) as appropriate predicates and use some variety of datalog on that structure Data Warehousing/Mining 9
The “Extend SQL” Approach In fact it is an attempt to extend the philosophy of OQL and comprehension syntax to these new structures v It is the approach taken in the design of Un. QL and also of Lorel v Looks very similar to OQL (path expressions) v Data Warehousing/Mining 10
Example select from where Data Warehousing/Mining Entry. Movie. Title DB Entry. Movie. Director. . . 11
Syntax Issues Need (path) variables to tie paths and edges together v Paths of arbitrary length v – “Find all strings in db” – “Find whether “Allen” acted in “Casablanca” – Need regular expresions to constrain paths v Rich set of overloadings for operators to deal with comparisons of objects with values and of values with sets Data Warehousing/Mining 12
Underlying Computational Strategy v Model graph as a relational database and use relational query language. – Database large relation (node-id, label, node-id) – Used by Stanford group in LORE/LOREL v Complications – Labels are from heterogeneous set of types, need more than one relation – Additional relations if info to be stored in nodes – Various navigation issues Data Warehousing/Mining 13
Semistructured Data - Case Study Object Exchange Model Data Warehousing/Mining 14
OEM Features • Common model for heterogeneous information exchange, self-describing • Each object: OID F Label Type Value OID = unique identifier or NULL F Label = character string descriptor F Type = atomic data type or set F Value = atomic value or set of object references • “Help pages” for labels • Query language OEM-QL Data Warehousing/Mining 15 15
Representing Semistructured Data Using OEM Label Memory Addresses
An OEM Query Language: OEM-QL • Logic-based language for OEM – Match object patterns, generate variable bindings, construct new OEM objects from existing ones • Get articles published in “IEEE Computer” P : P:
Semistructured Data - Case Study WWW Extraction Data Warehousing/Mining 18
Problem v Lots of valuable information on the Web – irregular structure – highly dynamic Embedded in HTML v Limited query facilities v Data Warehousing/Mining 19
Data Extraction Tool Flexible, easy to use v Accommodate virtually any HTML source v Interface with existing system, e. g. , data warehouse, user interface for querying v Query World Wide Web Extractor WH Integrator Data Warehouse Specification Data Warehousing/Mining 20
Approach v Extract Web data into OEM format – Query using OEM-QL Python-based, configurable parser v Declarative description of HTML source v – location of data on page – how to package data into OEM “Regular expression”-like syntax v Human intelligence rather than A. I. v Data Warehousing/Mining 21
Extractor Specification Consists of commands of the form: [ “variable(s)”, “source”, “pattern” ] Data Warehousing/Mining 22
HTML Source File
header 1 | header 2 | header 3 | text 1 | text 2 | text 3 | . . .
---|
[ Specification File [“root”, “get('http: //www. example. test/')”, “#” ], [“__tempvar 1”, “root”, “*
Result OEM Object . . .
Basic Syntax: Variable v variable(l: p: t) – optional parameters for specification of corresponding OEM object l: label name u t: type u p: parent object u v _variable – temporary data structure, does not appear as OEM object Data Warehousing/Mining 26
Basic Syntax: Source v split(variable, token) – creates a list with multiple elements using token as the element separator v get(URL) – obtain contents of HTML file at address URL Data Warehousing/Mining 27
Basic Syntax: Patterns v token 1 # token 2 – match and store current input (between tokens) v token 1 * token 2 – match, don’t store current input (between tokens) Data Warehousing/Mining 28
Syntactic Sugar v Functions for extracting commonly used HTML constructs – extract_table(variable), pattern – split_table_row(variable) – split_table_column(variable) – extract_list(variable), pattern – split_list(variables) Data Warehousing/Mining 29
Advanced Features v Customization of output – structure, label names, data type, . . . Extraction across multiple HTML pages v Graceful recovery from parse errors v – resume parsing using next input from source v Multiple patterns in single command – follow different parse tree depending on structure in source Data Warehousing/Mining 30
. . . Sample Extraction Scenario Data Warehousing/Mining 31
Extracted OEM Data OEM-QL query:
Evaluation v Better than – writing programs – YACC, PERL, etc. – A. I. v Can do better – GUI tool to simplify the generation of extractor specification – Machine learning or data mining techniques to automatically infer structure. . . Data Warehousing/Mining 33