Why do we care about databases in a

Why do we care about databases in a course about Web?

Administrivia • Midterm next class (in class) – Openbook/Open Notes/open course web (not general web) – Everything done until Sep 28 th (and covered in homework 2) – Questions? Slides adapted from Rao (ASU) & Franklin (Berkeley)

Adapting old disciplines for Web-age • Information (text) retrieval – Scale of the web – Hyper text/ Link structure – Authority/hub computations • Databases – Multiple databases • Heterogeneous, access limited, partially overlapping – Network (un)reliability • Datamining [Machine Learning/Statistics/Databases] – Learning patterns from large scale data Slides adapted from Rao (ASU) & Franklin (Berkeley)

Why do we care about databases? • Three reasons – Deep web is all databases… – We can do better with structured data… – Exposing databases on web changes their clientele. . Slides adapted from Rao (ASU) & Franklin (Berkeley)

Deep Web is databases. . • The crawlable web pages are just the tip of a huge ice berg that is deep web – Many web sites have huge backend databases that generate pages dynamically in response to queries • Airline fare databases; News paper classifieds etc. – By some estimates, deep web is 2 orders of magnitude bigger than the shallow (“html page”) web • We need to exploit deep web – Crawl/index deep web – Select databases relevant to a query – Provide information aggregation/integration services over deep web databases • . . and all the big kids are trying to gobble up anyone who is even going through the motions of doing these. . • …which leads to several DB challenges not addressed in traditional DBs – – Wrapper generation Schema mapping (automated) form filling Query optimization • Learning source profiles Slides adapted from Rao (ASU) & Franklin (Berkeley)

Web brings unwashed masses, unreliable medium as well as dirty data to databases. . • Web accessibility changes the user/data/medium profile significantly – from SQL gurus supporting financial data on dedicated DBMS to “ 2. 1 keyword query” instant gratification seekers working with dirty/inconsistent data over unreliable web. • Challenges – – How does one support keyword queries in databases? How does one support imprecise queries in databases? How do we handle incompleteness/inconsistency in databases? Does it make sense to focus on total response time minimization • As against a multi-objective cost/benefit optimization? The DB community has embraced these challenges --see Lowell Report Slides adapted from Rao (ASU) & Franklin (Berkeley)

Databases offer lessons on exploiting structure • We argued that structure (and semantics) help querying – If there is structure (as in databases) we can exploit it • Databases is an existing technology for exploiting some forms of structure – SQL may not look like much, but it is more expressive than keyword queries! – If not, we can extract structure and then exploit it • Challenges – Techniques for extracting information (NLP-lite) – Languages for representing/handling “Semi-structured” data – Standards for supporting/exploiting semantic tagging Slides adapted from Rao (ASU) & Franklin (Berkeley)

Before we play havoc with databases, let’s quickly review the traditional art of db management so we know all that needs to change Slides adapted from Rao (ASU) & Franklin (Berkeley)

Concepts covered so far … • Information Retrieval – Text retrieval – Hyper-linked text retrieval – Improvements… • Information Mining – Clustering techniques to improve result presentation – Classification and filtering techniques Slides adapted from Rao (ASU) & Franklin (Berkeley)

What Is a Database System? • Database: a very large, integrated collection of data. • Models a real-world enterprise – Entities (e. g. , teams, games) – Relationships (e. g. , The Patriots are playing in The Superbowl) – More recently, also includes active components , often called “business logic”. (e. g. , the BCS ranking system) • A Database Management System (DBMS) is a software system designed to store, manage, and facilitate access to databases. Slides adapted from Rao (ASU) & Franklin (Berkeley)

Functionality of a DBMS • Data Dictionary Management • Storage management – Data storage Definition Language (DDL) • High level query and data manipulation language – SQL/XQuery etc. – May tell us what we are missing in text-based search • Efficient query processing – May change in the internet scenario • Transaction processing • Resiliency: recovery from crashes, • Different views of the data, security – May be useful to model a collection of databases together • Interface with programming languages Slides adapted from Rao (ASU) & Franklin (Berkeley)

Traditional Database Architecture Query (SQL) Answer (relation) Database Manager (DBMS) -Storage mgmt -Query processing -View management -(Transaction processing) Database (relational) Slides adapted from Rao (ASU) & Franklin (Berkeley)

Building an Application with a Database System • Requirements modeling (conceptual, pictures) – Decide what entities should be part of the application and how they should be linked. • Schema design and implementation – Decide on a set of tables, attributes. – Define the tables in the database system. – Populate database (insert tuples). • Write application programs using the DBMS – Now much easier, with data management API Slides adapted from Rao (ASU) & Franklin (Berkeley)

Conceptual Modeling name category name ssn Student Course Takes quarter Advises Teaches Professor address name Slides adapted from Rao (ASU) & Franklin (Berkeley) field

Data Models • A data model is a collection of concepts for describing data. • A schema is a description of a particular collection of data, using a given data model. • The relational model of data is the most widely used model today. – Main concept: relation, basically a table with rows and columns. – Every relation has a schema, which describes the columns, or fields. Slides adapted from Rao (ASU) & Franklin (Berkeley)

Levels of Abstraction • Views describe how users see the data. • Conceptual schema defines logical structure • Physical schema describes the files and indexes used. View 1 View 2 View 3 Conceptual Schema Physical Schema Slides adapted from Rao (ASU) & Franklin (Berkeley) DB

Example: University Database • Conceptual schema: – Students(sid: string, name: string, login: string, age: integer, gpa: real) – Courses(cid: string, cname: string, View 1 credits: integer) • External Schema (View): – Course_info(cid: string, enrollment: in teger) • Physical schema: – Relations stored as unordered files. – Index on first column of Students. View 2 View 3 Conceptual Schema Physical Schema DB If five people are asked to come up with a schema Slides adapted from Rao (ASU) & Franklin (Berkeley) for the data, what are the odds that they will come up with the same schema?

Data Independence • Applications insulated from how data is structured and stored. • Logical data independence: Protection from changes in logical structure of data. • Physical data independence: Protection from changes in physical structure of data. View 1 View 2 View 3 Conceptual Schema Physical Schema • Q: Why are these particularly important for DBMS? Slides adapted from Rao (ASU) & Franklin (Berkeley) DB

Schema Design & Implementation • Table Students • Separates the logical view from the physical view of the data. Slides adapted from Rao (ASU) & Franklin (Berkeley)

Terminology Attribute names tuples Students (Arity=3) Slides adapted from Rao (ASU) & Franklin (Berkeley)

Querying a Database • Find all the students taking CSE 594 in Q 1, 2004 • S(tructured) Q(uery) L(anguage) select E. name from Enroll E where E. course=CS 490 i and E. quarter=“Winter, 2000” • Query processor figures out how to answer the query efficiently. Slides adapted from Rao (ASU) & Franklin (Berkeley)

Defining Views (Virtual) Views are relations, except that they are not physically stored. They are used mostly in order to simplify complex queries and to define conceptually different views of the database to different classes of users. View: purchases of telephony products: CREATE VIEW telephony-purchases AS SELECT product, buyer, seller, store FROM Purchase, Product WHERE Purchase. product = Product. name AND Product. category = “telephony” Slides adapted from Rao (ASU) & Franklin (Berkeley)

A Different View CREATE VIEW Seattle-view AS SELECT buyer, seller, product, store FROM Person, Purchase WHERE Person. city = “Seattle” AND Person. name = Purchase. buyer We can later use the views: SELECT name, store FROM Seattle-view, Product WHERE Seattle-view. product = Product. name AND Product. category = “shoes” What’s really happening when we query a view? ? Slides adapted from Rao (ASU) & Franklin (Berkeley)

Updating Views How can I insert a tuple into a table that doesn’t exist? CREATE VIEW bon-purchase AS SELECT store, seller, product FROM Purchase WHERE store = “The Bon Marche” If we make the following insertion: INSERT INTO bon-purchase VALUES (“the Bon Marche”, Joe, “Denby Mug”) We can simply add a tuple (“the Bon Marche”, Joe, NULL, “Denby Mug”) to relation Purchase. Slides adapted from Rao (ASU) & Franklin (Berkeley)

Non-Updatable Views Given Purchase (buyer, seller, store, product) Person( name, phone-num, city) CREATE VIEW Seattle-view AS SELECT seller, product, store FROM Person, Purchase WHERE Person. city = “Seattle” AND Person. name = Purchase. buyer How can we add the following tuple to the view? (Joe, “Shoe Model 12345”, “Nine West”) Slides adapted from Rao (ASU) & Franklin (Berkeley)

Materialized Views • Views whose corresponding queries have been executed and the data is stored in a separate database – Uses: Caching • Issues – Using views in answering queries • Normally, the views are available in addition to database – (so, views are local caches) • In information integration, views may be the only things we have access to. – An internet source that specializes in woody allen movies can be seen as a view on a database of all movies. Except, there is no database out there which contains all movies. . – Maintaining consistency of materialized views Slides adapted from Rao (ASU) & Franklin (Berkeley)

Query Optimization Goal: Declarative SQL query SELECT S. buyer FROM Purchase P, Person Q WHERE P. buyer=Q. name AND Q. city=‘seattle’ AND Q. phone > ‘ 5430000’ Inputs: • the query • statistics about the data (indexes, cardinalities, selectivity factors) • available memory Imperative query execution plan: buyer City=‘seattle’ phone>’ 5430000’ Buyer=name Purchase (Table scan) (Simple Nested Loops) Person (Index scan) Ideally: Want to find best plan. Practically: Avoid worst plans! Slides adapted from Rao (ASU) & Franklin (Berkeley)