Скачать презентацию LDAP DNS Keyword and Content Indexing Zachary G Скачать презентацию LDAP DNS Keyword and Content Indexing Zachary G

775b5a7d7274ae813e5e2f808578bebb.ppt

  • Количество слайдов: 30

LDAP, DNS, Keyword and Content Indexing Zachary G. Ives University of Pennsylvania CIS 455 LDAP, DNS, Keyword and Content Indexing Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems February 1, 2010 Some portions derived from slides by Raghu Ramakrishnan

Readings & Reminders § Read next: § Doan et al. Chapter 8, XML, schema, Readings & Reminders § Read next: § Doan et al. Chapter 8, XML, schema, and XPath § XSLT tutorial § Reminder: HW 1 Milestone 1 due Wednesday 2/3 2

Naming People and Devices: LDAP § Lightweight Directory Access Protocol § Hierarchical naming system Naming People and Devices: LDAP § Lightweight Directory Access Protocol § Hierarchical naming system that can be partitioned and replicated § See http: //www. seas. upenn. edu/cets/answers/ldap. html to set up your email client to access Penn’s LDAP server 3

LDAP’s Schema LDAP information has a schema with different levels of containers: § A LDAP’s Schema LDAP information has a schema with different levels of containers: § A unique name in LDAP is called a Distinguished Name, “dn” and consists of a sequence of attributes representing a hierarchy, from most-specific to least-specific (as in DNS names): o = organization; dc = domain component ou = organizational unit uid = user ID cn = common name c = country; st = state; l = locality Can also have object. Class – the type of entity 4

LDAP Hierarchy Brad Marshall LDAP Tutorial, quark. humbug. au/publications/ldap_tut. html 5 LDAP Hierarchy Brad Marshall LDAP Tutorial, quark. humbug. au/publications/ldap_tut. html 5

Querying LDAP queries are mostly attribute-value predicates: § uid=zives; o=upenn; c = usa § Querying LDAP queries are mostly attribute-value predicates: § uid=zives; o=upenn; c = usa § (|(cn=Susan Davidson)(cn=Boon Thau Loo)(cn=Val Tannen)) § objectclass=posix. Account § (!cn=Val Tannen) How might we process these queries? 6

The Backbone of Internet Naming: Domain Name Service § A simple, hierarchical name system The Backbone of Internet Naming: Domain Name Service § A simple, hierarchical name system with a distributed database – each domain controls its own names Top Level Domains com edu … columbia … upenn berkeley … www amazon … cis www sas … www … … www 7

Top-Level Domains (TLDs) Mostly controlled by Network Solutions, Inc. today § § § § Top-Level Domains (TLDs) Mostly controlled by Network Solutions, Inc. today § § § § § . com: commercial. edu: educational institution. gov: US government. mil: US military. net: networks and ISPs (now also a number of other things). org: other organizations 244, 2 -letter country suffixes, e. g. , . us, . uk, . cz, . tv, … some variants on this for other institutions, e. g. , . eu and a bunch of new suffixes that are not very common, e. g. , . biz, . mobi, . name, . pro, … 8

Finding the Root 13 “root servers” store entries for all top level domains (TLDs) Finding the Root 13 “root servers” store entries for all top level domains (TLDs) DNS servers have a hard-coded mapping to root servers so they can “get started” 9

Excerpt from DNS Root Server Entries This file is made available by Inter. NIC Excerpt from DNS Root Server Entries This file is made available by Inter. NIC registration services under anonymous FTP as ; file /domain/named. root ; ; formerly NS. INTERNIC. NET ; . 3600000 IN NS A. ROOT-SERVERS. NET. 3600000 A 98. 41. 0. 4 ; ; formerly NS 1. ISI. EDU ; . 3600000 NS B. ROOT-SERVERS. NET. 3600000 A 128. 9. 0. 107 ; ; formerly C. PSI. NET ; . 3600000 NS C. ROOT-SERVERS. NET. 3600000 A 192. 33. 4. 12 (13 servers in total, A through M) 10

Supposing We Were to Build DNS How would we start? How is a lookup Supposing We Were to Build DNS How would we start? How is a lookup performed? (Hint: what do you need to specify when you add a client to a network that doesn’t do DHCP? ) 11

Issues in DNS § We know that everyone wants to be “mydomain”. com § Issues in DNS § We know that everyone wants to be “mydomain”. com § How does this mesh with the assumptions inherent in our hierarchical naming system? § What happens if things move frequently? § What happens if we want to provide different behavior to different requestors (e. g. , Akamai)? 12

Directories Summarized An efficient way of finding data, assuming: § Data doesn’t change too Directories Summarized An efficient way of finding data, assuming: § Data doesn’t change too often, hence it can be replicated and distributed § Hierarchy is relatively “wide and flat” § Caching is present, helping with repeated queries Directories generally rely on names at their core § Sometimes we want to search based on other content, e. g. , “key”s … 13

Finding Data by Content We’ve seen two approaches: § Do all the work at Finding Data by Content We’ve seen two approaches: § Do all the work at the data stores: flood the network with requests § Have a directory based on names § An alternative, two-step process: § Build a content index over what’s out there An index is a key -> value map § Typically limited in what kinds of queries can be supported § Most common instance: an index of document keywords 14

A Common Model for Search § Index the words in every document § “Inverted A Common Model for Search § Index the words in every document § “Inverted index”: word document (ID) § “Forward index”: document (ID) list of words 15

Inverted Indices § A conceptually very simple map-multiset data structure: <keyword, {list of occurrences}> Inverted Indices § A conceptually very simple map-multiset data structure: § In its simplest form, each occurrence includes a document pointer (e. g. , URI), perhaps a count and/or position § Requires two components, an indexer and a retrieval system § We’ll consider cost of building the index, plus searching the index using a single keyword 16

How Do We Lay Out an Inverted Index? § Some options: § § Unordered How Do We Lay Out an Inverted Index? § Some options: § § Unordered list (e. g. , a log) Ordered list Tree Hash table 17

Unordered and Ordered Lists § Assume that we have entries such as: <keyword, #items, Unordered and Ordered Lists § Assume that we have entries such as: § What does ordering buy us? § Assume that we adopt a model in which we use: § Do we get any additional benefits? § How about: where we fix the size of the keyword and the number of items? 18

Tree-Based Indices Trees have several benefits over lists: § Potentially logarithmic search time, as Tree-Based Indices Trees have several benefits over lists: § Potentially logarithmic search time, as with a well, designed sorted list, IF it’s balanced § Ability to handle variable-length records We’ve already seen how trees might make a natural way of distributing data, as well How does a binary search tree fare? § Cost of building? § Cost of finding an item in it? 19

B+ Tree: A Flexible, Height-Balanced, High-Fanout Tree § Insert/delete at log F N cost B+ Tree: A Flexible, Height-Balanced, High-Fanout Tree § Insert/delete at log F N cost § (F = fanout, N = # leaf pages) § Keep tree height-balanced § Minimum 50% occupancy (except for root) § Each node contains d <= m <= 2 d entries d is called the order of the tree § Can search efficiently based on equality (or also range, though we don’t need that here) Index Entries (Direct search) Data Entries ("Sequence set")

Example B+ Tree § Data (inverted list ptrs) is at leaves; intermediate nodes have Example B+ Tree § Data (inverted list ptrs) is at leaves; intermediate nodes have copies of search keys § Search begins at root, and key comparisons direct it to a leaf § Search for be↓, bobcat↓. . . Root art a↓ am ↓ ant↓ art↓ best but best↓ bit↓ bob↓ dog but↓ can↓ cry↓ Ø Based on the search for bobcat*, we dog↓ dry↓ elf↓ fox↓ know it is not in the tree!

Inserting Data into a B+ Tree § Find correct leaf L § Put data Inserting Data into a B+ Tree § Find correct leaf L § Put data entry onto L § If L has enough space, done! § Else, must split L (into L and a new node L 2) Redistribute entries evenly, copy up middle key Insert index entry pointing to L 2 into parent of L § This can happen recursively § To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits. ) § Splits “grow” tree; root split increases height § Tree growth: gets wider or one level taller at top

Inserting “and Example: Copy up ↓” Root art but best↓ bit↓ bob↓ a↓ am Inserting “and Example: Copy up ↓” Root art but best↓ bit↓ bob↓ a↓ am ↓ ant↓ art↓ be↓ and↓ best dog but↓ can↓ cry↓ dog↓ dry↓ elf↓ fox↓ Want to insert here; no room, so split & copy up: an a↓ am ↓ Entry to be inserted in parent node. (Note that key “an” is copied up and continues to appear in the leaf. ) an↓ ant↓ and↓ 23

Inserting “and Example: Push up 1/2 ↓” Need to split node & push up Inserting “and Example: Push up 1/2 ↓” Need to split node & push up Root art best but dog an a↓ am ↓ art↓ best↓ bit↓ bob↓ but↓can↓ cry↓ dog↓ dry↓ elf↓ fox↓ ant↓ and↓ 24

Inserting “and Example: Push up 2/2 ↓” Root an a↓ am ↓ art↓ best Inserting “and Example: Push up 2/2 ↓” Root an a↓ am ↓ art↓ best Entry to be inserted in parent node. (Note that best is pushed up and only appears once in the index. Contrast this with a leaf split. ) but dog best↓ bit↓ bob↓ but↓can↓ cry↓ dog↓ dry↓ elf↓ fox↓ ant↓ and↓ 25

Copying vs. Splitting, Summarized § Every keyword (search key) appears in at most one Copying vs. Splitting, Summarized § Every keyword (search key) appears in at most one intermediate node § Hence, in splitting an intermediate node, we push up § Every inverted list entry must appear in the leaf § We may also need it in an intermediate node to define a partition point in the tree § We must copy up the key of this entry § Note that B+ trees easily accommodate multiple occurrences of a keyword 26

Virtues of the B+ Tree § B+ tree and other indices are quite efficient: Virtues of the B+ Tree § B+ tree and other indices are quite efficient: § § Height-balanced; log. F N cost to search High fanout (F) means depth rarely more than 3 or 4 Almost always better than maintaining a sorted file Typically, 67% occupancy on average § Berkeley DB library (C, C++, Java; Oracle) is a toolkit for B+ trees that you will be using later in the semester: § Interface: open B+ Tree; get and put items based on key § Handles concurrency, caching, etc.

How Do We Distribute a B+ Tree? § We need to host the root How Do We Distribute a B+ Tree? § We need to host the root at one machine and distribute the rest § What are the implications for scalability? § Consider building the index as well as searching 28

Eliminating the Root § Sometimes we don’t want a tree-structured system because the higher Eliminating the Root § Sometimes we don’t want a tree-structured system because the higher levels can be a central point of congestion or failure § Two strategies: § Modified tree structure (e. g. , BATON, Jagadish et al. ) § Non-hierarchical structure (distributed hash table, discussed in a couple of weeks) 29

Next Time § A standard medium for data interchange: XML 30 Next Time § A standard medium for data interchange: XML 30