c5f7ec48fb37e39b8c096d086e338443.ppt
- Количество слайдов: 51
Metadata: Automated generation Thanks to: Carl Lagoze, Liz Liddy, Judith Klavans
Motivation • In some cases metadata is important – – Non-textual objects, especially data Not just search (browse, similarity, etc. ) Intranets, specialized searching Deep web • Human-generated metadata is problematic – Expensive when professionally done – Flakey or malicious when non-professionally done
How much can automation help? • Trivial approaches – Page scraping and trivial parsing • Non-trivial approaches – Natural Language Processing – Machine Learning • Naïve Bayes • Support Vector Machines • Logistic Regression
DC-dot • Heuristic parsing of HTML pages to produce embedded Dublin Core Metadata • http: //www. ukoln. ac. uk/metadata/dcdot/
Breaking the Meta. Data Generation Bottleneck • Syracuse University, U. Washington – Automatic Metadata Generation for course-oriented materials • Goal: Demonstrate feasibility of high-quality automaticallygenerated metadata for digital libraries through Natural Language Processing • Data: Full-text resources from ERIC and the Eisenhower National Clearinghouse on Science & Mathematics • Metadata Schema: Dublin Core + Gateway for Educational Materials (GEM) Schema
Metadata Schema Elements Dublin Core Metadata Elements • Contributor • Coverage • Creator • Date • Description • Format • Identifier • Language • Publisher • Relation • Rights • Source • Subject • Title • Type GEM Metadata Elements • Audience • • Cataloging Duration Essential Resources Grade Pedagogy Quality Standards
Method: Information Extraction • Natural Language Processing – – • Technology which enables a system to accomplish human-like understanding of document contents Extracts both explicit and implicit meaning Sublanguage Analysis – • Utilizes domain and genre-specific regularities vs. full-fledged linguistic analysis Discourse Model Development – Extractions specialized for communication goals of document type and activities under discussion
Information Extraction Types of Features: • • Non-linguistic • Length of document • HTML and XML tags Linguistic • Root forms of words • Part-of-speech tags • Phrases (Noun, Verb, Proper Noun, Numeric Concept) • Categories (Proper Name & Numeric Concept) • Concepts (sense disambiguated words / phrases) • Semantic Relations • Discourse Level Components
Sample Lesson Plan Stream Channel Erosion Activity Student/Teacher Background: Rivers and streams form the channels in which they flow. A river channel is formed by the quantity of water and debris that is carried by the water in it. The water carves and maintains the conduit containing it. Thus, the channel is self-adjusting. If the volume of water, or amount of debris is changed, the channel adjusts to the new set of conditions. …. . Student Objectives: The student will discuss stream sedimentation that occurred in the Grand Canyon as a result of the controlled release from Glen Canyon Dam. …
NLP Processing of Lesson Plan Input: The student will discuss stream sedimentation that occurred in the Grand Canyon as a result of the controlled release from Glen Canyon Dam. Morphological Analysis: The student will discuss stream sedimentation that occurred in the Grand Canyon as a result of the controlled release from Glen Canyon Dam. Lexical Analysis: The|DT student|NN will|MD discuss|VB stream|NN sedimentation|NN that|WDT occurred|VBD in|IN the|DT Grand|NP Canyon|NP as|IN a|DT result|NN of|IN the|DT controlled|JJ release|NN from|IN Glen|NP Canyon|NP Dam|NP. |.
NLP Processing of Lesson Plan (cont’d) Syntactic Analysis - Phrase Identification: The|DT student|NN will|MD discuss|VB
NLP Processing of Lesson Plan (cont’d) Semantic Analysis Phase 2 - Event & Role Extraction Teaching event: discuss event: stream sedimentation actor: topic: student stream sedimentation location: Grand Canyon cause: controlled release
Automatically Generated Metadata Title: Grade Levels: GEM Subjects: Keywords: Proper Names: Subject Keywords: Grand Canyon: Flood! - Stream Channel Erosion Activity 6, 7, 8 Science--Geology Mathematics--Geometry Mathematics--Measurement Colorado River (river), Grand Canyon (geography / location), Glen Canyon Dam (buildings&structures) channels, clayboard, conduit, controlled_release, cookie_sheet, cup, dam, flow_volume, hold, paper_towel, pencil, reservoir, rivers, roasting_pan, sand, sediment, streams, water,
Automatically Generated Metadata Pedagogy: Tool For: Resource Type: Format: Placed Online: Name: Role: Homepage: Collaborative learning Hands on learning Teachers Lesson Plan text/HTML 1998 -09 -02 PBS Online online. Provider http: //www. pbs. org (cont’d)
Project CLi. MB Computational Linguistics for Metadata Building • Columbia University 2001 -2004 • Extract metadata from associated scholarly texts • Use machine generation to assist expert catalogers
Problems in Image Access Cataloging digital images Traditional approach: manual expertise labor intensive Expensive General catalogue records not useful for discovery Can automated techniques help? Using expert input Understanding contextual information Enhancing existing records
CLi. MB Technical Contribution CLi. MB will identify and extract • proper nouns • terms and phrases from text related to an image: September 14, 1908, the basis of the Greenes' final design had been worked out. It featured a radically informal, V-shaped plan (that maintained the original angled porch) and interior volumes of various heights, all under a constantly changing roofline that echoed the rise and fall of the mountains behind it. The chimneys and foundation would be constructed of the sandstone boulders that comprised the local geology, and the exterior of the house would be sheathed in stained split-redwood shakes. —Edward R. Bosley. Greene & Greene. London : Phaidon, 2000. p. 127
Chinese Paper Gods Anne S. Goodrich Collection C. V. Starr East Asian Library, Columbia University
Pan-hu chih-shen God of tigers
Alex Katz American, born 1927 Six Women, 1975 Oil on canvas 114 x 282 in.
Alex Katz has developed a remarkable hybrid art that combines the aggressive scale and grandeur of modern abstract painting with a chic, impersonal realism. During the 1950 s and 1960 s— decades dominated by various modes of abstraction—Katz stubbornly upheld the validity of figurative painting. In major, mature works such as Six Women, the artist distances himself from his subject. Space is flattened, as are the personalities of the women, their features simplified and idealized: Katz’s models are as fetching and vacuous as cover girls. The artist paints them with the authority and license of a master craftsman, but his brush conveys little emotion or personality. In contrast to the turbulent paint effects favored by the abstract expressionist artists, Katz pacifies the surface of his picture. Through the virtuosic technique of painting wet-on- wet, he achieves a level and unifying smoothness. He further “cools” the image by adopting the casually cropped composition and overpowering size and indifference of a highway billboard or big-screen movie. In Six Women, Katz portrays a gathering of young friends at his Soho loft. The apparent informality of the scene is deceptive. It is, in fact, carefully staged. Note three pairs of figures: the foreground couple face each other; the middle ground pair alternately look out and into the picture; and the pair in the background stand at matching oblique angles. The artist also arranges the women into two conversational triangles. Katz studied each model separately, then artfully fit the models into the picture. The image suggests an actual event, but the only true event is the play of light. From the open windows, a cordial afternoon sunlight saturates the space, accenting the features of each woman. http: //ncartmuseum. org/collections/offviewcaptions. shtml#alex
Segmentation • Determination of relevant segment • Difficult for Greene & Greene – The exact text related to a given image is difficult to determine – Use of text object identifier to find this text • Easy for Chinese Paper Gods and for various art collections • Decision: set initial values manually and explore automatic techniques
Text Analysis and Filtering 1. 2. Divide text into words and phrases Gather features for each word and phrase • 3. 4. E. g. Is it in the related text? Is it very frequent? Develop algorithms using this information Use formulae to rank for usefulness as potential metadata
What Features do we Track? • Lexical features – Proper noun, common noun • Relevancy to domain – Text Object Identifier (TOI) – Presence in the Art & Architecture Thesaurus – Presence in the back-of-book index • Statistical observations – Frequency in the text – Frequency across a larger set of texts, within and outside the domain
Techniques for Filtering 1. Take an initial guess • • 2. Use automatic techniques to guess (machine-learning) • • 3. Collect input from users Alter formulae based on feedback Collect input from users Run programs to make predictions based on given opinions (Bayesian networks, classifiers, decision trees) The CLi. MB approach: Use both techniques!
Georgia O'Keeffe (American, 1887 -1986) Cebolla Church, 1945 Oil on canvas, 20 1/16 x 36 1/4 in. (51. 1 x 92. 0 cm. ) Purchased with funds from the North Carolina Art Society (Robert F. Phifer Bequest), in honor of Joseph C. Sloane, 72. 18. 1 North Carolina Museum of Art
MARC format 100 O’Keeffe, Georgia, ≠d 1887 -1986. 245 Cebolla church ≠ h [slide] / ≠ c Georgia O’Keeffe. 260 ≠c 2003 300 1 slide : ≠ b col. 500 Object date: 1945. 500 Oil on canvas. 500 20 x 36 in. 535 North Carolina Museum of Art ≠ b Raleigh, N. C. 650 Painting, American ≠ y 20 th century. 650 Women artist ≠ z United States 650 Church buildings in art.
Cebolla Church, 1945 Oil on canvas, 20 1/16 x 36 1/4 in. (51. 1 x 92. 0 cm. ) Purchased with funds from the North Carolina Art Society (Robert F. Phifer Bequest), in honor of Joseph C. Sloane, 72. 18. 1 Driving through the New Mexican highlands near her home, Georgia O'Keeffe would often pass through the village of Cebolla with its rude adobe Church of Santo Niño. The artist was moved by the poignancy of the little building: its sagging, sun-bleached walls and rusted tin roof seemed so typical of the difficult life of the people. When O'Keeffe came to paint the church she addressed it directly, emphasizing its isolation and stark simplicity. Literally formed out of the earth, the building affirms the permanence and the hard, defiant patience of the people. For O’Keeffe, it symbolized human endurance and aspiration. "I have always thought it one of my very good pictures", she wrote, "though its message is not as pleasant as many others". And the question remains: What is that in the window?
MARC format with CLi. MB subject terms 100 245 260 300 500 535 650 650 O’Keeffe, Georgia, ≠d 1887 -1986. Cebolla church ≠ h [slide] / ≠ c Georgia O’Keeffe. ≠c 2003 1 slide : ≠ b col. Object date: 1945. Oil on canvas. 20 x 36 in. North Carolina Museum of Art ≠ b Raleigh, N. C. Painting, American ≠ y 20 th century. Women artist ≠ z United States Church buildings in art. CLi. MB CLi. MB New Mexican highlands village of Cebolla adobe Church of Santo Niño sagging, sun-bleached walls rusted tin roof isolation human endurance window
Issues on OAI metadata for Cite. Seer • Automatic metatagging for Cite. Seer • Consistency/accuracy of metatags • Implementation architectures for metadata generation, storage and access – static vs dynamic • New metadata • Metadata Presentation
Automatic Metagging • • Regular expressions – Generating wrappers manually using regular expression match is tedious, time consuming and errorprone, not portable to a new domain • Widely varying format and content for the information to be extracted • Rely much on domain expertise/experience on the information to be extracted • Needs re-coding when new expression appears Machine learning – Automatic OAI metadata wrapper generation • Learn from training samples • Automatic, time economical • Robust -- accuracy of the metatags
Metadata Exported • • Dublin Core metadata standard 14 elements: – – – Title Creator Subject : keywords Description: abstract Contributor Publisher Date : archive date Type Format Identifier Source Relation • • – – • Description of the document References Is Referenced By Language Rights Missing item: Link to the original document Relations among documents
Algorithms for Meta. Data Extraction Regular Expression Match[citeseerdoc] • Header Generation – the body of the document is stored in doc{bodytext}), – if (there is a match for (Abstract | ABSTRACT | Introduction) in doc{bodytext} ){ everything in between the beginning of doc and the match will be the header }else if (there is a match for (Reference | Bibliography) (possibly preceded by section number) { everything in between the beginning of doc{bodytext} and the occurrence [all of doc{bodytext}] } If doc{header} as generated above is too long, it is truncated to n. Max. Header. Length and ". . . " is appended to it If doc{header} as generated above is too short, the status of the task is Error
Algorithms for Meta. Data Extraction Regular Expression Match • Abstract extraction (starting from copy of doc{bodytext}): if (there is a match for "abstract" later followed by a match for "Introduction“) { everything in between is considered to be the abstract and is placed in doc{abstract} }else if (there is a match for "abstract" only ) { then from 100 to 1000 of the characters following the occurrence are aggregated and placed in doc{abstract} }else if (there is a match for the following pattern : "This (paper | memo | technical)(. *)nn" ) { then everything in between the (paper | memo | technical) occurrence and the two new lines is considered to be the abstract and is placed in doc{abstract} }else { place in doc{abstract} the substring of doc{bodytext} that is located between the end of the header (as represented by doc{header}) and another 1000 characters further ahead ( and tags being removed) } If the abstract is too long (as stored in doc{abstract}), it is truncated to n. Max. Abstract. Length
Algorithms for Meta. Data Extraction Regular Expression Match • Body Text and Reference Extraction – Localizes the start of the reference part : this is defined to be the last occurrence of one of the following words (followed or not by any number of tags of the form <[^>] >): REFERENCES | References | BIBLIOGRAPHY | Bibliography | REFERENCES AND NOTES – – – This position identifies the substring that is to be put in the doc{bodytext} and the substring that is to be put in doc{citetext} Removes (occurrences of) appendices / acknowledgements of doc{citetext} (removes word plus then everything in-between new line and next line) Removes (occurrences of) figures / tables Removes "extraneous garbage" after citations, that is removes groups consisting of one new line followed by 0 to 40 characters that occur 10 times Removes form feeds in doc{citetext}, i. e. occurrences of 'f' If both doc{bodytext} and doc{citetext} are defined and have reached the minimum length (respectively n. Min. Body. Text. Length and m. Min. Cite. Text. Length) then returns 0
Algorithms for Meta. Data Extraction Regular Expression Match • Make. Reference. List The following operations are performed to identity how citations are marked in the document – removes page breaks – determines marker type by counting the number of occurrences of the following and comparing each count to n. Lines/5 1. Bracket Square ( reg. exp. is : m/ns*[/sg) 2. Bracket Parenthesis 3. Naked Number 4. Naked Number with Dot – determines yeartype – determines if cites are cleanly spaced in the text (i. e. more than n. Lines/5 cites per group of 2 to 5 lines – determines format of cite start and cite end
Algorithms for Meta. Data Extraction Regular Expression Match • Processes a single citation – – – – The reference length, that is the length of item{text}, must be within n. Min. Reference. Length and n. Max. Reference. Length Removes extra spaces and carriage returns and stores the resulting reference text in item{text} Check validity of citation : underlying call to parse: : Valid. Cite to validate the citation, in case the returned value is false returns Extract title information : underlying call to cite: : Guess. Title to get the title of the paper/document to which this citation refers Extract ID information : underlying call to parse: : Find. ID, returns if fails Extract Year information : underlying call to parse: : : Find. Year Extract Context information : underlying call to parse: : Find. Context. Matches, the result of this call (an array) is serialized to a string (separator is character | )and stored in item{position} Updates the initial parameter array by pushing the citation hash in the array items
Algorithms for Meta. Data Extraction Regular Expression Match • Valid. Cite: Tests the validity of the citation and returns the result of the test – 4 tests to decide whether this citation, as represented by the hash, is valid : • if item{text} matches /^(proceeding | in proc)/i returns 0 (this should be necessary when the citation delineation is improved) • If there is too little punctuation (count of commas, points and double quotes compared to (length of item{text})/50) then returns 0 • if two different years occur in the citation, returns 0 • if there is a lack of word (at least two words of at least 3 letters each are expected) then returns 0 • returns 1 otherwise, that is, the citation is valid and has successfully passed all the tests above
Algorithms for Meta. Data Extraction Regular Expression Match • Find. ID: Set ID field (citation ID tag), returns 1 on error. Description : Tries to locate the citation ID (in the document) using one of the following regular expressions : – – – [[^]]+] : e. g. [3] or [Omlin 93] ([^]]+) : e. g. (3) or (Omlin 93) (d+). ? : e. g. 3. If none of the above work, another attempt is made by extract the year from the author field and set this as ID If all attempts fail, then returns 1 Returns 0 otherwise
Algorithms for Meta. Data Extraction Regular Expression Match • Find. Year: Set year field If (year type has been set to "pastmarker" in doc and there is only one year occurrence in item{text} (pattern for year is b(19|20)ddb) ) { item{year} is set to the year found }else { matches text with year pattern and sets item{year} to this year }
Architectures for Exporting Metadata • Database Intermediated Metadata Database – For example, My. SQL • Store the metadata records of the OAI XML format • A set of indexes • Real-time Metadata Generation – No metadatabase – Generate the metadata records on the fly from the document database and assemble them into XML metadata format in response to the OAI protocol requests • Integrated Real-time Metadata Generation – Integrate the metadata CGI with the Cite. Seer CGI – Put metadata exporting module as part of the Cite. Seer package
Developing Techniques for Automatic Metatagging • Automatic metatagging using SVM • Generic Summarization and keywords Extraction using Mutual Reinforcement Principle and Sentence Clustering [zha. Sumarize 02]
Potentially Tagable Data • Tagable – – – – – Title Creators Address of the creators Affiliation of the creators Keywords Abstract Publisher archive date publish year References [SVM] • Content descriptive metadata [zha. Sumarize 02] – Subject (keywords) – Description(abstract)
Automatic Metatagging using SVM • Classification approach for adaptive Information Extraction • Adaptive IE: to create systems adaptable to new applications/domains by using only an analyst’s knowledge, I. e. knowledge about the domain/scenario [AKTweb] • Classification approach allows systems to adapt to a new domain simply by using a standard set of features [Chieu. IE 02]
Metatagging using SVM– Basic Model • Basic model: – Take line as the unit to be classified – Classify each line into 1+ classes of 14 classes • Title, author, affiliation, address, abstract, keyword, … – 14 classifiers - “one class vs. all others” – Feature Generation: • Convert the raw features to linguistic features – Domain databases(dictionary, city, country, state, postcode, name list) – Class top words(choose top words or phrases using DF thresholding from each of the 14 classes) • Categorical features: line position, line length…. – Feature Normalization
Metatagging using SVM – Iterative Converge Procedure • Exploit sequence information – Normally the title is ahead of authors – Abstract lines are continuous • Algorithm: Step 1. Use the previous and next N lines’ tags as part of current line’s features Step 2. Train SVM classifiers on this expanded set of features and get the new set of predicted tags for each of the line Step 3. Go to step 1 if convergence constraint is not satisfied
Metatagging using SVM– extract metadata out of each line • Why: – One line could contain multiple authors – Author name, address and affiliation could be on the same line • Possible Techniques: – SVM Classification – HMM – Other chunking methods
Generic Summarization and Keywords Extraction using Mutual Reinforcement Principle and Sentence Clustering [zha. Sumarize 02] • • • Each document is composed of two sets of objects: A set of terms T= {t 1, … tn) A set of sentences S = {s 1, . . sn) Bipartite graph from T and S: if the term ti appears in sentence sj, we then create an edge from ti to sj. wij is indicates the weight on the edge(ti , sj). Mutual reinforcement principle: u(ti) v(sj) ~ u(ti) wij v(sj), v(sj) u(ti) ~ v(sj) wij u (ti) u is the saliency score vector for terms; v is the saliency score vector for sentences; A term should have a high saliency score if it appears in many sentences with high saliency score while a sentence should have a high saliency score if it contains many terms with high saliency scores. Metadata Output: Keywords, Keyphrases and Descriptions (salient sentences) – Hierarchical keywords, keyphrases
Existing Ontologies for Documents • Document Ontology: An ontology that models documents, particularly publications. – Shoe Ontology http: //www. cs. umd. edu/projects/plus/SHOE/onts/index. html#doc • Many of the document types were borrowed from the Structuralist Dublin Core Resource Types proposal. Others were borrowed from the Pub. Med document classifications. – Daml Ontology http: //www. daml. org/ontologies/62. html • DAML version of a SHOE ontology
Shoe Ontology for Publications • Category: – Document • Publication – Article » Book. Article Conference. Paper Journal. Article Workshop. Paper – Book – Technical. Report – Thesis • Relationships: – – – – » » Doctoral. Thesis Masters. Thesis author(Document, Person) author. Org(Document, Organization) contained. In(Document, Document) publish. Date(Document, . DATE) publisher(Document, Organization) subject(Document, . SHOEEntity) title(Document, . STRING) volume(Periodical, . NUMBER)
Bib. Tex Ontology • Bib. Tex Ontology – View publications as a citation list http: //www. daml. org/publications/bibtex-ont. daml • Example of the Bibtex Ontology http: //www. daml. org/publications/sample-publications. daml