3eb2c1a8241f4bbf062f4bde0a218779.ppt
- Количество слайдов: 50
CLi. MB: Computational Linguistics for Metadata Building Center for Research on Information Access Columbia University Libraries January 21, 2003 CLi. MB - Columbia University
January 21, 2003 CLi. MB - Columbia University 2
Overall Goals • • • Research: Development of richer retrieval through increased numbers of descriptors Research and Practice: Creation of enabling technologies for new large digitization projects Research and Practice: Expand capability for cross-collection searching Practice: Development of suite of CLi. MB tools Resources: Vocabulary list which can be used by other visual resource professionals The essence of CLi. MB: • Use scholars themselves as “catalogers” by utilizing scholarly publications • Enhance existing descriptive metadata January 21, 2003 CLi. MB - Columbia University 3
Computational Linguistic Techniques • What techniques have we tried? • How well have they worked? • What else do we want to try? January 21, 2003 CLi. MB - Columbia University 4
Computational Linguistic Techniques • What techniques have we tried? – Goal: Identify high quality metadata terms – Goal: Use metadata for finding images • How well have they worked? • What else do we want to try? January 21, 2003 CLi. MB - Columbia University 5
Text about Images The Blacker House is known for its porte cochère and adjacent terraces. Samuel Parker Williams, an occasional Greene collaborator, worked on the site, particularly on the sandstone boulder foundation for the sleeping porch. -- Based on Bosley January 21, 2003 CLi. MB - Columbia University 6
Techniques We Have Tried Supervised (using existing resources) – Matching algorithms - proper names & variants – Back of book index analysis – Composite list of terms from authoritative lists Unsupervised – Part of speech tagging – Noun phrase identification – Proper noun identification January 21, 2003 CLi. MB - Columbia University 7
What about LSI? • • Latent Semantic Indexing Builds a representation of a document Effective in information retrieval Why not for CLi. MB? – LSI is useful for text query and document retrieval – LSI, a statistical technique, removes phrasal info – CLi. MB needs high quality phrases – May be useful in later stages January 21, 2003 CLi. MB - Columbia University 8
Indexing for What Purpose • Index = find important terms and phrases • Index = characterize a document with a set of terms that occurs in the doc January 21, 2003 CLi. MB - Columbia University 9
Indexing for What Purpose • Index = find important terms and phrases – sleeping porch – occasional collaborator – sandstone boulder foundation • Index = characterize a document with a set of terms that occurs in the doc – sleep*, porch, occas*, collaborat*, foundat* – enables location of doc’s with similar profile January 21, 2003 CLi. MB - Columbia University 10
Finding Similar Documents • Linear Algebra Techniques – Latent Semantic Indexing • Singular Value Decomposition (SVD) – Semidiscrete Decomposition • Vector Space Models – Term by Document matrices – Term Weighting – Polysemy and Synonymy • Clustering Techniques – K-means – EM Clustering – Wavelet January 21, 2003 CLi. MB - Columbia University 11
Computational Linguistic Techniques • What techniques have we tried? – Goal: Identify high quality metadata terms – Goal: Load metadata into image search database – Goal: Use enriched metadata for finding images • How well have they worked? • What else do we want to try? January 21, 2003 CLi. MB - Columbia University 12
Art Object Identification (AO-ID) • Need Unique Identifiers – Key of database records • Varies from collection to collection – Greene & Greene – Project Names – Chinese Paper Gods – God Names – South Asian Temples – Temple Names January 21, 2003 CLi. MB - Columbia University 13
Text about Images The Blacker House is known for its porte cochère and adjacent terraces. Samuel Parker Williams, an occasional Greene collaborator, worked on the site, particularly on the sandstone boulder foundation for the sleeping porch. -- Based on Bosley January 21, 2003 CLi. MB - Columbia University 14
Compile list of subject vocabulary Find meaningful terms in texts Segment relevant texts Collect terms from all sources. Identify and link AO-ID described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate January 21, 2003 CLi. MB - Columbia University 15
Create Composite List of Subject Terms Philosophy: Use whatever resources exist • Catalog records – Robert R. Blacker house (Pasadena, Calif. ) – Greene, Charles Sumner – Blacker, Robert R. • Art and Architecture Thesaurus – porte cochère • Back of the book index – Blacker house January 21, 2003 CLi. MB - Columbia University 16
Progress – Composite List • Greene & Greene – Extracted back of the book indexes – Direct matching of index terms to the text • Terms found - highlighted in yellow – David Gamble – Pasadena – Westmoreland Place – furniture January 21, 2003 CLi. MB - Columbia University 17
January 21, 2003 CLi. MB - Columbia University 18
Compile list of subject vocabulary Find meaningful terms in texts Segment relevant texts Collect terms from all sources. Identify and link AO-ID described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate January 21, 2003 CLi. MB - Columbia University 19
Three Term Types and Approaches 1) Art Object ID names and other proper nouns important to the domain (Charles Pratt) 1. Named Entity noun phrase finders, POS taggers 2) Common noun terms, semantically significant to the domain (V-shaped plan) 1. List of domain terms from authority sources 3) Common noun phrases in a generic domain vocabulary (chimney) 1. Statistical methods for identifying relevant terms January 21, 2003 CLi. MB - Columbia University 20
Part of Speech (POS) taggers • Why use a part of speech tagger? – To identify nouns, verbs and proper nouns • The Blacker House is known for its porte cochère… – <Determiner>The – <Proper_Noun> • <Singular_Proper_Noun>Blacker • <Singular_Proper_Noun>House – <Verb_Present>is – <Verb_Past_Participle>known – <Preposition>for – <Possessive_Pronoun>its – <Adjective>adjacent – <Noun_Plural>terraces January 21, 2003 CLi. MB - Columbia University 21
Part of Speech (POS) taggers • Strength: An essential step allows the rest of the system to work • Weakness: The best POS taggers have 95% accuracy – A typical 20 -word sentence is likely to have a mistake! • But: some errors do not matter much – E. g. sleeping porch January 21, 2003 CLi. MB - Columbia University 22
What We Tried: POS Taggers • Mitre Alembic Work. Bench – Freeware from Mitre corporation – Strong for proper nouns – Average for common nouns • IBM’s Nominator – Accurate for both – Restrictive licensing January 21, 2003 CLi. MB - Columbia University 23
Proper Nouns • Alembic Work. Bench Results – 91. 2% recall • Misses The senior Pratt, Hall brothers – 97. 5% precision using Alembic • Successfully finds William Issac Ott, University of California • This is very good! • Highlighted in light green – – Mary Greene Persian Etc. January 21, 2003 CLi. MB - Columbia University 24
January 21, 2003 CLi. MB - Columbia University 25
Noun Phrase Chunking [The [ Blacker House ] ] is known for [ [its Porte Cochère] and [adjacent terraces] ]. [Samuel Parker Williams], [an occasional Greene collaborator], worked on [the site], particularly on [the [ [sandstone boulder] foundation] ] for [the [ sleeping porch ] ]. -- Based on Bosley January 21, 2003 CLi. MB - Columbia University 26
NP Chunkers • Columbia’s Link. IT – Regular expression grammar over POS tags – Improves Work. Bench results through finding simplex NPs • LTChunk – By LTG Group, University of Edinburgh – Not as many NPs • Arizona - commercialized • IBM – also commercial. University January 21, 2003 CLi. MB - Columbia 27
Results: Proper Nouns January 21, 2003 CLi. MB - Columbia University 28
Results: Proper Nouns January 21, 2003 CLi. MB - Columbia University 29
Results: NP Chunking • Highlighted in purple: – The design process – The southwest adobe-stucco – July 1907 January 21, 2003 CLi. MB - Columbia University 30
January 21, 2003 CLi. MB - Columbia University 31
Experiments with Algorithms • TF/IDF and term frequency ratios – Filter technical terms from frequent common nouns – Term frequency ratio algorithm to improve accuracy • Co-occurrence – Useful terms may appear near other good ones • Machine learning – Use learning algorithms to discover complex associational context January 21, 2003 CLi. MB - Columbia University 32
Compile list of subject vocabulary Find meaningful terms in texts Segment relevant texts Collect terms from all sources. Identify and link AO-ID described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate January 21, 2003 CLi. MB - Columbia University 33
What is Segmentation? • Divide texts into cohesive chunks • Needed for determining associational context • Needed to determine what terms are related to an art object January 21, 2003 CLi. MB - Columbia University 34
Results: Segmentation • • Use the frequency that our terms appear within a document to estimate where the document is about that term This graph shows where different names are mentioned in Bosley on Greene & Greene Ch. 5 January 21, 2003 CLi. MB - Columbia University 35
What We’ve Tried: Segmenters • Marti Hearst’s Text. Tiling – Performs well for a general algorithm, but not sufficient for this specialized task – M. Hearst, ACL, 1993 • F. Choi’s C 99 segmenter – Performance comparable to Text. Tiling – F. Y. Y. Choi, NAACL, 2000 • Frequency ratio approach outperformed Text. Tiling • In-house tool to be tested – Kan & Klavans, WVLC-6, 1998, Segmenter January 21, 2003 CLi. MB - Columbia University 36
Meronymy as “Part-Of” • Why is this potentially useful? – A method for identifying “hot” paragraphs • Descriptive text contains “part of” relations • Details that correlate to the whole – Porch is a part of house • An early hypothesis – in testing stages January 21, 2003 CLi. MB - Columbia University 37
Meronymy for Cohesion The Spinks house design is an elaboration of the rectangular, large-gabled form of the “California House” …. has … porches and terraces. In front, an expanse of …lawn rises nearly to the level of the entry terrace…. The front door is approached obliquely in the shaded recess of the terrace…. January 21, 2003 CLi. MB - Columbia University 38
Meronymy and Other Relations The California House Other Houses Spinks House porch terrace entry terrace front entry front door January 21, 2003 CLi. MB - Columbia University 39
Compile list of subject vocabulary Find meaningful terms in texts Segment relevant texts Collect terms from all sources. Identify and link AO-ID described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate January 21, 2003 CLi. MB - Columbia University 40
Progress – Project Name Matching • Finding project names in Greene & Greene • Challenge: finding variations – – AO-ID Robert Roe Blacker House RRB House The house 1214 Fairlawn Terrace. • Possible techniques to improve matching – Developing a semi-automatic technique – Use existing information to label text – An iterative platform for manual intervention January 21, 2003 CLi. MB - Columbia University 41
Variants of The Culbertson House • • Cordelia A. Culbertson house (Pasadena, Calif. ) Francis F. Prentiss house (Pasadena, Calif. ) Culbertson sisters house (Pasadena, Calif. ) Prentiss, Francis F. Culbertson, Cordelia A. Allen, Elizabeth S. Allen, Mrs. Dudley P. • House was purchased by Allen’s, who remarried and became Prentiss! January 21, 2003 CLi. MB - Columbia University 42
Zaoshen (Chinese deity) • • • • USE FOR: Dingfuzhenjun (Chinese deity) USE FOR: Kitchen God (Chinese deity) USE FOR: Simingzaojun (Chinese deity) USE FOR: Simingzaoshen (Chinese deity) USE FOR: Ssu-ming-tsao-chèun (Chinese deity) USE FOR: Ssu-ming-tsao-shen (Chinese deity) USE FOR: Ting-fu-chen-chèun (Chinese deity) USE FOR: Tsao-shen (Chinese deity) USE FOR: Tsao-wang-yeh (Chinese deity) USE FOR: Zaojun (Chinese deity) USE FOR: Zaowang (Chinese deity) REFERENCE: Encyc. Britannicab(Tsao Shen, pinyin Zao Shen, in Chinese mythology, the god of the kitchen (god of the hearth), who is believed to report to the celestial gods on family conduct and have it within his power to bestow poverty or riches on individual families; has also been confused with Ho Shen (god of fire) and Tsao Chèun (Furnace Prince)) January 21, 2003 CLi. MB - Columbia University 43
Some Data to Illustrate • Unaltered Project Names – 0 matches (both case sensitive and insensitive) • Case Insensitive Project Name matching – – – 4 matches {Theodore Irwin house} occurs 1 time {California Institute of Technology} occurs 1 time {William R. Thorsen house} occurs 1 time {William T. Bolton house} occurs 1 time • At least double in the chapter January 21, 2003 CLi. MB - Columbia University 44
A Future Solution • Bootstrapping algorithm – Seed terms hand labelled – Terms mapped into multi-dimensional feature space – Other terms that are close to the seed terms are added to the set • Features: – Window size – Headedness – Modifier similar to that of a seed term January 21, 2003 CLi. MB - Columbia University 45
Summary: Research Tools Tested • Part of Speech Taggers • Noun Phrase Chunkers • Merging techniques • Proper Noun Finders • Proper Name Variant Finder • Segmenters January 21, 2003 CLi. MB - Columbia University 46
Compile list of subject vocabulary Find meaningful terms in texts Segment relevant texts Collect terms from all sources. Identify and link AO-ID described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate January 21, 2003 CLi. MB - Columbia University 47
Future: Determine relationships • The Blacker House related to Greene – The Greenes built the house. • Porte Cochère is related to Blacker House – because they are directly a part of the house. • William Issac Ott is related to – Blacker House (on which he worked) – Greene (with whom he worked). • Detecting these semantic relationships statistically is a challenge for our next steps: – Co-occurrence – Use of subject headings – Meronymy and other relations (Word. Net) January 21, 2003 CLi. MB - Columbia University 48
Compile list of subject vocabulary Find meaningful terms in texts Segment relevant texts Collect terms from all sources. Identify and link AO-ID described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate January 21, 2003 CLi. MB - Columbia University 49
Thank you! Any questions? www. columbia. edu/cu/cria/climb January 21, 2003 CLi. MB - Columbia University 50
3eb2c1a8241f4bbf062f4bde0a218779.ppt