Data Warehousing Mining Comp 150 DW Semistructured Data Instructor

Скачать презентацию Data Warehousing Mining Comp 150 DW Semistructured Data Instructor

d56170a5a5c10d94b6ea163837e69812.ppt

Количество слайдов: 33

Data Warehousing/Mining Comp 150 DW Semistructured Data Instructor: Dan Hebert Data Warehousing/Mining 1

Semistructured Data v Everything that has no rigid schema – Schema is contained within the data (self-describing), OR – No separate schema, OR – Schema exists but places only loose constraints on data v Emerged as an important topic for a variety of reasons – Many data sources like WWW which we would like to treat as databases but cannot for the lack of schema – Desirable to have an extremely flexible format for data exchange between disparate databases – May want to view structured data as semistructured data for the purpose of browsing Data Warehousing/Mining 2

Motivation v Some data really is unstructured/semistructured – World Wide Web, – Data exchange formats – Some exotic database management systems, e. g. , ACe. DB, popular with biologists Data integration v Browsing v Data Warehousing/Mining 3

Motivation - World Wide Web v Why do we want to treat the Web as a database? – To maintain integrity – To query based on structure (as opposed to content) – To introduce some “organization”. v But the Web has no structure. The best we can say is that it is an enormous graph. Data Warehousing/Mining 4

Motivation - Data Formats Much (probably most) of the world’s data is in data formats v These are formats defined for the interchange and archiving of data v Data formats vary in generality. ASN. 1 and XDR are quite general v Scientific data formats tend to be “fixed schemas” v The textual representation given by data formats is sometimes not immediately translatable into a standard relational/objectoriented representation v Data Warehousing/Mining 5

Motivation - Data Integration v Goal is to integrate all types of information, including unstructured information – Irregular, missing information, structure not fully known, dynamic schema evolution, etc. v Traditional data models and languages not well suited – Cannot accommodate heterogeneous data sets (different types and structures), etc. – Difficult to build software that will easily convert between two disparate models v OEM (Object Exchange Model) – Semistructured data model from TSIMMIS project at Stanford – Internal data structure for exchange of data between DBMSs – Used by other systems: e. g. , Windows 95 registry, Lotus Notes Data Warehousing/Mining 6

Motivation - Browsing To query a database one needs to understand the schema. v However schemas have opaque terminology and the user may want to start by querying the data with little or no knowledge of the schema. v – Where in the database is the string “Casablanca” to be found? – Are there integers in the database greater than 216 ? – What objects in the database have an attribute name that starts with “act”? v While extensions to relational query languages have been proposed for such queries, there is no generic technique for interpreting them. Data Warehousing/Mining 7

The Model v Represent data as some kind of graph-like or treelike model – Cycles are allowed but usually refer to them as trees – Several different approaches with minor differences (easy to convert) u v Data on labels or edges, nodes carry information or not Straightforward to encode relational and objectoriented databases – Issue: object identity Data Warehousing/Mining 8

Querying Semistructured Data v There are (at least) three approaches to this problem – Add arbitrary features to SQL or to your favorite query language – Find some principled approach to programs that are based on the type of the data – Represent the graph (or whatever the structure is) as appropriate predicates and use some variety of datalog on that structure Data Warehousing/Mining 9

The “Extend SQL” Approach In fact it is an attempt to extend the philosophy of OQL and comprehension syntax to these new structures v It is the approach taken in the design of Un. QL and also of Lorel v Looks very similar to OQL (path expressions) v Data Warehousing/Mining 10

Example select from where Data Warehousing/Mining Entry. Movie. Title DB Entry. Movie. Director. . . 11

Syntax Issues Need (path) variables to tie paths and edges together v Paths of arbitrary length v – “Find all strings in db” – “Find whether “Allen” acted in “Casablanca” – Need regular expresions to constrain paths v Rich set of overloadings for operators to deal with comparisons of objects with values and of values with sets Data Warehousing/Mining 12

Underlying Computational Strategy v Model graph as a relational database and use relational query language. – Database large relation (node-id, label, node-id) – Used by Stanford group in LORE/LOREL v Complications – Labels are from heterogeneous set of types, need more than one relation – Additional relations if info to be stored in nodes – Various navigation issues Data Warehousing/Mining 13

Semistructured Data - Case Study Object Exchange Model Data Warehousing/Mining 14

OEM Features • Common model for heterogeneous information exchange, self-describing • Each object: OID F Label Type Value OID = unique identifier or NULL F Label = character string descriptor F Type = atomic data type or set F Value = atomic value or set of object references • “Help pages” for labels • Query language OEM-QL Data Warehousing/Mining 15 15

Representing Semistructured Data Using OEM Label Memory Addresses Set Value b 1: t: Atomic Value a: n: p: a 1: v: w: x: . . . Data Warehousing/Mining 16 16

An OEM Query Language: OEM-QL • Logic-based language for OEM – Match object patterns, generate variable bindings, construct new OEM objects from existing ones • Get articles published in “IEEE Computer” P : P: }> • Get titles of books by “Jeff Ullman” : }> Data Warehousing/Mining 17 17 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Semistructured Data - Case Study WWW Extraction Data Warehousing/Mining 18 " src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-18.jpg" alt="Semistructured Data - Case Study WWW Extraction Data Warehousing/Mining 18 " /> Semistructured Data - Case Study WWW Extraction Data Warehousing/Mining 18 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Problem v Lots of valuable information on the Web – irregular structure – highly" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-19.jpg" alt="Problem v Lots of valuable information on the Web – irregular structure – highly" /> Problem v Lots of valuable information on the Web – irregular structure – highly dynamic Embedded in HTML v Limited query facilities v Data Warehousing/Mining 19 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Data Extraction Tool Flexible, easy to use v Accommodate virtually any HTML source v" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-20.jpg" alt="Data Extraction Tool Flexible, easy to use v Accommodate virtually any HTML source v" /> Data Extraction Tool Flexible, easy to use v Accommodate virtually any HTML source v Interface with existing system, e. g. , data warehouse, user interface for querying v Query World Wide Web Extractor WH Integrator Data Warehouse Specification Data Warehousing/Mining 20 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Approach v Extract Web data into OEM format – Query using OEM-QL Python-based, configurable" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-21.jpg" alt="Approach v Extract Web data into OEM format – Query using OEM-QL Python-based, configurable" /> Approach v Extract Web data into OEM format – Query using OEM-QL Python-based, configurable parser v Declarative description of HTML source v – location of data on page – how to package data into OEM “Regular expression”-like syntax v Human intelligence rather than A. I. v Data Warehousing/Mining 21 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Extractor Specification Consists of commands of the form: [ “variable(s)”, “source”, “pattern” ] Data" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-22.jpg" alt="Extractor Specification Consists of commands of the form: [ “variable(s)”, “source”, “pattern” ] Data" /> Extractor Specification Consists of commands of the form: [ “variable(s)”, “source”, “pattern” ] Data Warehousing/Mining 22 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="HTML Source File <HTML> <HEAD>. . . <TABLE> <TR> <TH><I> header 1 </I></TH> <TH><I>" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-23.jpg" alt="HTML Source File <HTML> <HEAD>. . . <TABLE> <TR> <TH><I> header 1 </I></TH> <TH><I>" /> HTML Source File <HTML> <HEAD>. . . <TABLE> <TR> <TH><I> header 1 </I></TH> <TH><I> header 2 </I></TH> <TH><I> header 3 </I></TH> </TR> <TD> text 1 </TD> <TD><A HREF=http: //www. stuff/> text 2 </A></TD> <TD> text 3 </TD> </TR>. . . </TABLE>. . . </BODY> </HTML> Data Warehousing/Mining 23 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="[ Specification File [“root”, “get('http: //www. example. test/')”, “#” ], [“__tempvar 1”, “root”, “*<table>#</table>*”" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-24.jpg" alt="[ Specification File [“root”, “get('http: //www. example. test/')”, “#” ], [“__tempvar 1”, “root”, “*<table>#</table>*”" /> [ Specification File [“root”, “get('http: //www. example. test/')”, “#” ], [“__tempvar 1”, “root”, “*<table>#</table>*” ], [“__tempvar 2”, “split (__tempvar 1, ’</tr>’)”, “#” ], [“rows”, “__tempvar 2[1: -1]”, “#” ], [“header 1, header 2_url, header 2, header 3”, “rows”, “*<td>#</td>*<a*href=#>#</a>*<td>#</td>*”] ] Data Warehousing/Mining 24 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Result OEM Object . . . <root complex { <rows complex { <header 1" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-25.jpg" alt="Result OEM Object . . . <root complex { <rows complex { <header 1" /> Result OEM Object . . . <root complex { <rows complex { <header 1 string “text 1”> <header 2_url string “http: //www. stuff”> <header 2 string “text 2” <header 3 string “text 3”> }> <rows complex { . . . }> }> Data Warehousing/Mining 25 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Basic Syntax: Variable v variable(l: p: t) – optional parameters for specification of corresponding" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-26.jpg" alt="Basic Syntax: Variable v variable(l: p: t) – optional parameters for specification of corresponding" /> Basic Syntax: Variable v variable(l: p: t) – optional parameters for specification of corresponding OEM object l: label name u t: type u p: parent object u v _variable – temporary data structure, does not appear as OEM object Data Warehousing/Mining 26 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Basic Syntax: Source v split(variable, token) – creates a list with multiple elements using" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-27.jpg" alt="Basic Syntax: Source v split(variable, token) – creates a list with multiple elements using" /> Basic Syntax: Source v split(variable, token) – creates a list with multiple elements using token as the element separator v get(URL) – obtain contents of HTML file at address URL Data Warehousing/Mining 27 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Basic Syntax: Patterns v token 1 # token 2 – match and store current" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-28.jpg" alt="Basic Syntax: Patterns v token 1 # token 2 – match and store current" /> Basic Syntax: Patterns v token 1 # token 2 – match and store current input (between tokens) v token 1 * token 2 – match, don’t store current input (between tokens) Data Warehousing/Mining 28 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Syntactic Sugar v Functions for extracting commonly used HTML constructs – extract_table(variable), pattern –" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-29.jpg" alt="Syntactic Sugar v Functions for extracting commonly used HTML constructs – extract_table(variable), pattern –" /> Syntactic Sugar v Functions for extracting commonly used HTML constructs – extract_table(variable), pattern – split_table_row(variable) – split_table_column(variable) – extract_list(variable), pattern – split_list(variables) Data Warehousing/Mining 29 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Advanced Features v Customization of output – structure, label names, data type, . ." src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-30.jpg" alt="Advanced Features v Customization of output – structure, label names, data type, . ." /> Advanced Features v Customization of output – structure, label names, data type, . . . Extraction across multiple HTML pages v Graceful recovery from parse errors v – resume parsing using next input from source v Multiple patterns in single command – follow different parse tree depending on structure in source Data Warehousing/Mining 30 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title=". . . Sample Extraction Scenario Data Warehousing/Mining 31 " src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-31.jpg" alt=". . . Sample Extraction Scenario Data Warehousing/Mining 31 " /> . . . Sample Extraction Scenario Data Warehousing/Mining 31 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Extracted OEM Data OEM-QL query: <city C {<high H> < low L>}> : <temperature" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-32.jpg" alt="Extracted OEM Data OEM-QL query: <city C {<high H> < low L>}> : <temperature" /> Extracted OEM Data OEM-QL query: <city C {<high H> < low L>}> : <temperature {<city_temp {<country “Germany”> <city C> <high_today H> <low_today L>}>}> Data Warehousing/Mining 32 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Evaluation v Better than – writing programs – YACC, PERL, etc. – A. I." src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-33.jpg" alt="Evaluation v Better than – writing programs – YACC, PERL, etc. – A. I." /> Evaluation v Better than – writing programs – YACC, PERL, etc. – A. I. v Can do better – GUI tool to simplify the generation of extractor specification – Machine learning or data mining techniques to automatically infer structure. . . Data Warehousing/Mining 33 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="" src="" alt="" /> </p> </div> </div> <div id="inputform"> <script>$("#inputform").load("https://present5.com/wp-content/plugins/report-content/inc/report-form-aj.php"); </script> </div> </p>  </div> </article> </section> <div class="three columns"> <div class="widget-entry"> </div> </div> </div> </div>  <footer id="footer" style="padding: 5px 0 5px;"> <div class="container"> <div class="columns twelve">  <script type="text/javascript"></script> <a href="https://slidetodoc.com/" alt="Наш международный проект SlideToDoc.com!" target="_blank"><img src="https://present5.com/SlideToDoc.png"></a> <script> $(window).load(function() { var owl = document.getElementsByClassName('owl-carousel owl-theme owl-loaded owl-drag')[0]; document.getElementById("owlheader").insertBefore(owl, null); $('#owlheader').css('display', 'inline-block'); }); </script> <script type="text/javascript"> var yaParams = {'typepage': '1000_top_300k', 'author': '1000_top_300k' }; </script>  <script type="text/javascript" > (function(m,e,t,r,i,k,a){m[i]=m[i]||function(){(m[i].a=m[i].a||[]).push(arguments)}; m[i].l=1*new Date(); for (var j = 0; j < document.scripts.length; j++) {if (document.scripts[j].src === r) { return; }} k=e.createElement(t),a=e.getElementsByTagName(t)[0],k.async=1,k.src=r,a.parentNode.insertBefore(k,a)}) (window, document, "script", "https://mc.yandex.ru/metrika/tag.js", "ym"); ym(32395810, "init", { clickmap:true, trackLinks:true, accurateTrackBounce:true, webvisor:true }); </script> <noscript><div><img src="https://mc.yandex.ru/watch/32395810" style="position:absolute; left:-9999px;" alt="" /></div></noscript>   <nav id="top-nav"> <ul id="menu-top" class="top-menu clearfix"> </ul> </nav> </div> </div> </footer> <script type='text/javascript'> /* <![CDATA[ */ var wpcf7 = {"apiSettings":{"root":"https:\/\/present5.com\/wp-json\/contact-form-7\/v1","namespace":"contact-form-7\/v1"}}; /* ]]> */ </script> <script type='text/javascript' src='https://present5.com/wp-content/plugins/contact-form-7/includes/js/scripts.js?ver=5.1.4'></script> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/jquery.shuffle.js?ver=4.9.26'></script> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/scripts.js?ver=1.13'></script> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/shuffle.js?ver=4.9.26'></script>  <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/notify.js?ver=1741133444'></script> <script type='text/javascript'> /* <![CDATA[ */ var my_ajax_object = {"ajax_url":"https:\/\/present5.com\/wp-admin\/admin-ajax.php","nonce":"8524a34e93"}; /* ]]> */ </script> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/filer.js?ver=1741133444'></script> </body> </html>