Скачать презентацию Data Warehousing Mining Comp 150 DW Semistructured Data Instructor Скачать презентацию Data Warehousing Mining Comp 150 DW Semistructured Data Instructor

d56170a5a5c10d94b6ea163837e69812.ppt

  • Количество слайдов: 33

Data Warehousing/Mining Comp 150 DW Semistructured Data Instructor: Dan Hebert Data Warehousing/Mining 1 Data Warehousing/Mining Comp 150 DW Semistructured Data Instructor: Dan Hebert Data Warehousing/Mining 1

Semistructured Data v Everything that has no rigid schema – Schema is contained within Semistructured Data v Everything that has no rigid schema – Schema is contained within the data (self-describing), OR – No separate schema, OR – Schema exists but places only loose constraints on data v Emerged as an important topic for a variety of reasons – Many data sources like WWW which we would like to treat as databases but cannot for the lack of schema – Desirable to have an extremely flexible format for data exchange between disparate databases – May want to view structured data as semistructured data for the purpose of browsing Data Warehousing/Mining 2

Motivation v Some data really is unstructured/semistructured – World Wide Web, – Data exchange Motivation v Some data really is unstructured/semistructured – World Wide Web, – Data exchange formats – Some exotic database management systems, e. g. , ACe. DB, popular with biologists Data integration v Browsing v Data Warehousing/Mining 3

Motivation - World Wide Web v Why do we want to treat the Web Motivation - World Wide Web v Why do we want to treat the Web as a database? – To maintain integrity – To query based on structure (as opposed to content) – To introduce some “organization”. v But the Web has no structure. The best we can say is that it is an enormous graph. Data Warehousing/Mining 4

Motivation - Data Formats Much (probably most) of the world’s data is in data Motivation - Data Formats Much (probably most) of the world’s data is in data formats v These are formats defined for the interchange and archiving of data v Data formats vary in generality. ASN. 1 and XDR are quite general v Scientific data formats tend to be “fixed schemas” v The textual representation given by data formats is sometimes not immediately translatable into a standard relational/objectoriented representation v Data Warehousing/Mining 5

Motivation - Data Integration v Goal is to integrate all types of information, including Motivation - Data Integration v Goal is to integrate all types of information, including unstructured information – Irregular, missing information, structure not fully known, dynamic schema evolution, etc. v Traditional data models and languages not well suited – Cannot accommodate heterogeneous data sets (different types and structures), etc. – Difficult to build software that will easily convert between two disparate models v OEM (Object Exchange Model) – Semistructured data model from TSIMMIS project at Stanford – Internal data structure for exchange of data between DBMSs – Used by other systems: e. g. , Windows 95 registry, Lotus Notes Data Warehousing/Mining 6

Motivation - Browsing To query a database one needs to understand the schema. v Motivation - Browsing To query a database one needs to understand the schema. v However schemas have opaque terminology and the user may want to start by querying the data with little or no knowledge of the schema. v – Where in the database is the string “Casablanca” to be found? – Are there integers in the database greater than 216 ? – What objects in the database have an attribute name that starts with “act”? v While extensions to relational query languages have been proposed for such queries, there is no generic technique for interpreting them. Data Warehousing/Mining 7

The Model v Represent data as some kind of graph-like or treelike model – The Model v Represent data as some kind of graph-like or treelike model – Cycles are allowed but usually refer to them as trees – Several different approaches with minor differences (easy to convert) u v Data on labels or edges, nodes carry information or not Straightforward to encode relational and objectoriented databases – Issue: object identity Data Warehousing/Mining 8

Querying Semistructured Data v There are (at least) three approaches to this problem – Querying Semistructured Data v There are (at least) three approaches to this problem – Add arbitrary features to SQL or to your favorite query language – Find some principled approach to programs that are based on the type of the data – Represent the graph (or whatever the structure is) as appropriate predicates and use some variety of datalog on that structure Data Warehousing/Mining 9

The “Extend SQL” Approach In fact it is an attempt to extend the philosophy The “Extend SQL” Approach In fact it is an attempt to extend the philosophy of OQL and comprehension syntax to these new structures v It is the approach taken in the design of Un. QL and also of Lorel v Looks very similar to OQL (path expressions) v Data Warehousing/Mining 10

Example select from where Data Warehousing/Mining Entry. Movie. Title DB Entry. Movie. Director. . Example select from where Data Warehousing/Mining Entry. Movie. Title DB Entry. Movie. Director. . . 11

Syntax Issues Need (path) variables to tie paths and edges together v Paths of Syntax Issues Need (path) variables to tie paths and edges together v Paths of arbitrary length v – “Find all strings in db” – “Find whether “Allen” acted in “Casablanca” – Need regular expresions to constrain paths v Rich set of overloadings for operators to deal with comparisons of objects with values and of values with sets Data Warehousing/Mining 12

Underlying Computational Strategy v Model graph as a relational database and use relational query Underlying Computational Strategy v Model graph as a relational database and use relational query language. – Database large relation (node-id, label, node-id) – Used by Stanford group in LORE/LOREL v Complications – Labels are from heterogeneous set of types, need more than one relation – Additional relations if info to be stored in nodes – Various navigation issues Data Warehousing/Mining 13

Semistructured Data - Case Study Object Exchange Model Data Warehousing/Mining 14 Semistructured Data - Case Study Object Exchange Model Data Warehousing/Mining 14

OEM Features • Common model for heterogeneous information exchange, self-describing • Each object: OID OEM Features • Common model for heterogeneous information exchange, self-describing • Each object: OID F Label Type Value OID = unique identifier or NULL F Label = character string descriptor F Type = atomic data type or set F Value = atomic value or set of object references • “Help pages” for labels • Query language OEM-QL Data Warehousing/Mining 15 15

Representing Semistructured Data Using OEM Label Memory Addresses <collection, {b 1, a 1, . Representing Semistructured Data Using OEM Label Memory Addresses Set Value b 1: t: Atomic Value a: n: p: a 1: v: w: x: . . . Data Warehousing/Mining 16 16

An OEM Query Language: OEM-QL • Logic-based language for OEM – Match object patterns, An OEM Query Language: OEM-QL • Logic-based language for OEM – Match object patterns, generate variable bindings, construct new OEM objects from existing ones • Get articles published in “IEEE Computer” P : P: }> • Get titles of books by “Jeff Ullman” : }> Data Warehousing/Mining 17 17 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Semistructured Data - Case Study WWW Extraction Data Warehousing/Mining 18 " src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-18.jpg" alt="Semistructured Data - Case Study WWW Extraction Data Warehousing/Mining 18 " /> Semistructured Data - Case Study WWW Extraction Data Warehousing/Mining 18 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Problem v Lots of valuable information on the Web – irregular structure – highly" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-19.jpg" alt="Problem v Lots of valuable information on the Web – irregular structure – highly" /> Problem v Lots of valuable information on the Web – irregular structure – highly dynamic Embedded in HTML v Limited query facilities v Data Warehousing/Mining 19 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Data Extraction Tool Flexible, easy to use v Accommodate virtually any HTML source v" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-20.jpg" alt="Data Extraction Tool Flexible, easy to use v Accommodate virtually any HTML source v" /> Data Extraction Tool Flexible, easy to use v Accommodate virtually any HTML source v Interface with existing system, e. g. , data warehouse, user interface for querying v Query World Wide Web Extractor WH Integrator Data Warehouse Specification Data Warehousing/Mining 20 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Approach v Extract Web data into OEM format – Query using OEM-QL Python-based, configurable" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-21.jpg" alt="Approach v Extract Web data into OEM format – Query using OEM-QL Python-based, configurable" /> Approach v Extract Web data into OEM format – Query using OEM-QL Python-based, configurable parser v Declarative description of HTML source v – location of data on page – how to package data into OEM “Regular expression”-like syntax v Human intelligence rather than A. I. v Data Warehousing/Mining 21 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Extractor Specification Consists of commands of the form: [ “variable(s)”, “source”, “pattern” ] Data" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-22.jpg" alt="Extractor Specification Consists of commands of the form: [ “variable(s)”, “source”, “pattern” ] Data" /> Extractor Specification Consists of commands of the form: [ “variable(s)”, “source”, “pattern” ] Data Warehousing/Mining 22 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="HTML Source File <HTML> <HEAD>. . . <TABLE> <TR> <TH><I> header 1 </I></TH> <TH><I>" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-23.jpg" alt="HTML Source File <HTML> <HEAD>. . . <TABLE> <TR> <TH><I> header 1 </I></TH> <TH><I>" /> HTML Source File <HTML> <HEAD>. . . <TABLE> <TR> <TH><I> header 1 </I></TH> <TH><I> header 2 </I></TH> <TH><I> header 3 </I></TH> </TR> <TD> text 1 </TD> <TD><A HREF=http: //www. stuff/> text 2 </A></TD> <TD> text 3 </TD> </TR>. . . </TABLE>. . . </BODY> </HTML> Data Warehousing/Mining 23 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="[ Specification File [“root”, “get('http: //www. example. test/')”, “#” ], [“__tempvar 1”, “root”, “*<table>#</table>*”" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-24.jpg" alt="[ Specification File [“root”, “get('http: //www. example. test/')”, “#” ], [“__tempvar 1”, “root”, “*<table>#</table>*”" /> [ Specification File [“root”, “get('http: //www. example. test/')”, “#” ], [“__tempvar 1”, “root”, “*<table>#</table>*” ], [“__tempvar 2”, “split (__tempvar 1, ’</tr>’)”, “#” ], [“rows”, “__tempvar 2[1: -1]”, “#” ], [“header 1, header 2_url, header 2, header 3”, “rows”, “*<td>#</td>*<a*href=#>#</a>*<td>#</td>*”] ] Data Warehousing/Mining 24 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Result OEM Object . . . <root complex { <rows complex { <header 1" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-25.jpg" alt="Result OEM Object . . . <root complex { <rows complex { <header 1" /> Result OEM Object . . . <root complex { <rows complex { <header 1 string “text 1”> <header 2_url string “http: //www. stuff”> <header 2 string “text 2” <header 3 string “text 3”> }> <rows complex { . . . }> }> Data Warehousing/Mining 25 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Basic Syntax: Variable v variable(l: p: t) – optional parameters for specification of corresponding" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-26.jpg" alt="Basic Syntax: Variable v variable(l: p: t) – optional parameters for specification of corresponding" /> Basic Syntax: Variable v variable(l: p: t) – optional parameters for specification of corresponding OEM object l: label name u t: type u p: parent object u v _variable – temporary data structure, does not appear as OEM object Data Warehousing/Mining 26 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Basic Syntax: Source v split(variable, token) – creates a list with multiple elements using" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-27.jpg" alt="Basic Syntax: Source v split(variable, token) – creates a list with multiple elements using" /> Basic Syntax: Source v split(variable, token) – creates a list with multiple elements using token as the element separator v get(URL) – obtain contents of HTML file at address URL Data Warehousing/Mining 27 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Basic Syntax: Patterns v token 1 # token 2 – match and store current" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-28.jpg" alt="Basic Syntax: Patterns v token 1 # token 2 – match and store current" /> Basic Syntax: Patterns v token 1 # token 2 – match and store current input (between tokens) v token 1 * token 2 – match, don’t store current input (between tokens) Data Warehousing/Mining 28 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Syntactic Sugar v Functions for extracting commonly used HTML constructs – extract_table(variable), pattern –" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-29.jpg" alt="Syntactic Sugar v Functions for extracting commonly used HTML constructs – extract_table(variable), pattern –" /> Syntactic Sugar v Functions for extracting commonly used HTML constructs – extract_table(variable), pattern – split_table_row(variable) – split_table_column(variable) – extract_list(variable), pattern – split_list(variables) Data Warehousing/Mining 29 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Advanced Features v Customization of output – structure, label names, data type, . ." src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-30.jpg" alt="Advanced Features v Customization of output – structure, label names, data type, . ." /> Advanced Features v Customization of output – structure, label names, data type, . . . Extraction across multiple HTML pages v Graceful recovery from parse errors v – resume parsing using next input from source v Multiple patterns in single command – follow different parse tree depending on structure in source Data Warehousing/Mining 30 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title=". . . Sample Extraction Scenario Data Warehousing/Mining 31 " src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-31.jpg" alt=". . . Sample Extraction Scenario Data Warehousing/Mining 31 " /> . . . Sample Extraction Scenario Data Warehousing/Mining 31 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Extracted OEM Data OEM-QL query: <city C {<high H> < low L>}> : <temperature" src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-32.jpg" alt="Extracted OEM Data OEM-QL query: <city C {<high H> < low L>}> : <temperature" /> Extracted OEM Data OEM-QL query: <city C {<high H> < low L>}> : <temperature {<city_temp {<country “Germany”> <city C> <high_today H> <low_today L>}>}> Data Warehousing/Mining 32 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Evaluation v Better than – writing programs – YACC, PERL, etc. – A. I." src="https://present5.com/presentation/d56170a5a5c10d94b6ea163837e69812/image-33.jpg" alt="Evaluation v Better than – writing programs – YACC, PERL, etc. – A. I." /> Evaluation v Better than – writing programs – YACC, PERL, etc. – A. I. v Can do better – GUI tool to simplify the generation of extractor specification – Machine learning or data mining techniques to automatically infer structure. . . Data Warehousing/Mining 33 </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="" src="" alt="" /> </p> </div> </div> <div id="inputform"> <script>$("#inputform").load("https://present5.com/wp-content/plugins/report-content/inc/report-form-aj.php"); </script> </div> </p> <!--end entry-content--> </div> </article><!-- .post --> </section><!-- #content --> <div class="three columns"> <div class="widget-entry"> </div> </div> </div> </div> <!-- #content-wrapper --> <footer id="footer" style="padding: 5px 0 5px;"> <div class="container"> <div class="columns twelve"> <!--noindex--> <!--LiveInternet counter--><script type="text/javascript"><!-- document.write("<img src='//counter.yadro.ru/hit?t26.10;r"+ escape(document.referrer)+((typeof(screen)=="undefined")?"": ";s"+screen.width+"*"+screen.height+"*"+(screen.colorDepth? screen.colorDepth:screen.pixelDepth))+";u"+escape(document.URL)+ ";"+Math.random()+ "' alt='' title='"+" ' "+ "border='0' width='1' height='1'><\/a>") //--></script><!--/LiveInternet--> <a href="https://slidetodoc.com/" alt="Наш международный проект SlideToDoc.com!" target="_blank"><img src="https://present5.com/SlideToDoc.png"></a> <script> $(window).load(function() { var owl = document.getElementsByClassName('owl-carousel owl-theme owl-loaded owl-drag')[0]; document.getElementById("owlheader").insertBefore(owl, null); $('#owlheader').css('display', 'inline-block'); }); </script> <script type="text/javascript"> var yaParams = {'typepage': '1000_top_300k', 'author': '1000_top_300k' }; </script> <!-- Yandex.Metrika counter --> <script type="text/javascript" > (function(m,e,t,r,i,k,a){m[i]=m[i]||function(){(m[i].a=m[i].a||[]).push(arguments)}; m[i].l=1*new Date(); for (var j = 0; j < document.scripts.length; j++) {if (document.scripts[j].src === r) { return; }} k=e.createElement(t),a=e.getElementsByTagName(t)[0],k.async=1,k.src=r,a.parentNode.insertBefore(k,a)}) (window, document, "script", "https://mc.yandex.ru/metrika/tag.js", "ym"); ym(32395810, "init", { clickmap:true, trackLinks:true, accurateTrackBounce:true, webvisor:true }); </script> <noscript><div><img src="https://mc.yandex.ru/watch/32395810" style="position:absolute; left:-9999px;" alt="" /></div></noscript> <!-- /Yandex.Metrika counter --> <!--/noindex--> <nav id="top-nav"> <ul id="menu-top" class="top-menu clearfix"> </ul> </nav> </div> </div><!--.container--> </footer> <script type='text/javascript'> /* <![CDATA[ */ var wpcf7 = {"apiSettings":{"root":"https:\/\/present5.com\/wp-json\/contact-form-7\/v1","namespace":"contact-form-7\/v1"}}; /* ]]> */ </script> <script type='text/javascript' src='https://present5.com/wp-content/plugins/contact-form-7/includes/js/scripts.js?ver=5.1.4'></script> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/jquery.shuffle.js?ver=4.9.26'></script> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/scripts.js?ver=1.13'></script> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/shuffle.js?ver=4.9.26'></script> <!--[if lt IE 9]> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/selectivizr.js?ver=1.0.2'></script> <![endif]--> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/notify.js?ver=1741133444'></script> <script type='text/javascript'> /* <![CDATA[ */ var my_ajax_object = {"ajax_url":"https:\/\/present5.com\/wp-admin\/admin-ajax.php","nonce":"8524a34e93"}; /* ]]> */ </script> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/filer.js?ver=1741133444'></script> </body> </html>