bf5f48018cddd2b787b6fbedaa2f26f9.ppt
- Количество слайдов: 24
Open Language Archives Steven Bird, University of Pennsylvania Gary Simons, SIL International 1
The World’s Languages 2
Countries with >150 languages New Guinea: Indonesia: Nigeria: India: Mexico: Cameroon: 3 823 726 505 387 288 279 Australia: Congo (DRC): China (PRC): Brazil: USA: Philippines: 235 218 201 192 176 169
Major Language Archives n American Philosophical Society n n National Anthropological Archives n n n >70 million words of Greek, Latin, English, Italian, German Aboriginal Studies Electronic Data Archive n n 4 manuscripts, field-notes, photographs, maps, video 1, 300 recordings of myths, legends, stories, songs Perseus Project n n Wordlists, texts, manuscripts, audio; 200 languages texts, dictionaries, grammars and teaching materials 300 Australian languages
Major European Archives n Germany n n n France n n n 5 INALF: Institute National à Langue Français (Paris) LACITO: Langues et Cultures à Tradition Orale (Paris) United Kingdom n n IDS: Institüt für Deutsche Sprache (Mannheim) BAS: Bavarian Archive of Speech (Munich) OTA: Oxford Text Archive (Oxford) Many others …
Alaska Native Language Center n n n Founded in 1972 20 native languages 10, 000 documents n n n 6 Texts Ethnographies Place names Lexicons 3, 000 recordings
An ANLC Record Title: Gwich’in Wordlist Author: Zimmerman, Herbert Date: 1959 Language: Gwich’in Format: Non-digital RESOURCE TYPE? LANGUAGE NAME? AVAILABILITY? Description: MS, 75 pp Description: 1400 items based on SIL schedule 7
American Indian Studies Research Institute, Indiana n n Interactive language lessons for American Indian languages Multimedia dictionaries n n 8 audio photographic images
UC Berkeley Survey of Californian Languages n n n 9 90 languages Field notes 750 cassettes Catalog is an HTML document Typical…
Linguistic Data Consortium n Data for new language technologies: n n E. g. SWITCHBOARD Corpus n n n 10 ASR, NLP, MT, IR, TREC, MUC, TDT, … ~200 CD-ROM publications (largest 82 CDs) >1 terabyte of audio data 2400 transcribed telephone calls Distributed on 26 CDs (web is inappropriate) Published, ISBN, distribution mechanism
ACL Natural Language Software Repository n n Hosted by the German Foundation for AI (DFKI) Software metadata: n n n n 11 Authors Functionality Linguistic datatype (e. g. lexicon) File format Operating system availability URL
Resource Types n DATA n n TOOLS n n n Software for creating, storing, querying and viewing language data Formats for storage and interchange (e. g. TEI) ADVICE n 12 Sound recording Shoebox of hand-written index cards Descriptive grammar Mailing list archives, FAQs
The Community n Linguists >13, 000 members of LINGUIST n Ethnologue >500, 000 page hits / month n n Engineers n n 13 ~1, 000 organizations which buy LDC resources Language teachers Archivists Software developers
Challenges n Endangered languages n n Endangered data n n Creating new data using XML and Unicode Finding aids n 14 Saving old recordings before they disintegrate Best practices n n Preserving languages before they die Locating resources (mailing lists)
Finding Aids n n Goal: “bringing like things together and differentiating among them” (Svenonius) Traditional databases versus the web n n We need a middle ground: n n 15 Metadata is coherent, but highly distributed Bottom-up, distributed initiatives Consistent, centralized finding aids
Language Archives within the OAI n n n Specialist communities can define their own metadata format Service providers can exploit the metadata Philadelphia Workshop (December 2000) linguists, anthropologists, archivists, engineers, funding agencies, publishers n North America, South America, Europe, Middle. East, Africa, Asia, Australia n Commitment to implement OAI n 16
Structure of OLAC Three groups: n n n Advisory board Member archives Participating data providers Three phases: n n n 17 Alpha test [Dec 2000] Pilot [Fall 2001] Operational [Fall 2002]
Primary Service Provider n n Eastern Michigan Univ & Wayne State Univ Funded by NSF >13, 000 members n Complete union catalog n 18
A Community defined by its metadata OPEN n n Rights. openness Format. openness LANGUAGE n n Encoding scheme: RFC 1766 Subject. language ARCHIVES n n 19 Type. data Type. functionality
Language Identification n Existing standards (ISO 639, RFC 1766) n n incomplete: 7% coverage inconsistent: e. g. Quechua, Bantu (other) Undocumented: only gives a name Issues to be addressed: n n Impossible to create a static inventory Multiple names for a language n 20 E. g. Fedicca, Fadicha, Fedija, Fiadidja, Fiyadikkya
SIL Ethnologue n n The only complete language identification scheme openly available on the web For each of 6, 800 languages: n n n 21 Language name and variants, 3 -letter code Population, location Linguistic classification Dialects, alternative names for dialects Notes on language use and available literature
LDC Prototype Service Provider Harvests data from LDC, ELRA, DFKI Query for “language=Bulgarian”: oai: ldc: LDC 95 T 5 ECI Multilingual Text Lang: Albanian, Bulgarian, Chinese, Czech, … Applications: IR, MT, LM oai: elra: L 0030 Bulgarian Morphological Dictionary Lang: Bulgarian 67, 500 entries, 242 inflectional types, … oai: dfki: KPML 22 Grammar development workbench Lang: Spanish, Russian, Japanese, Bulgarian, …
Our Experience with the OAI n Experience of OLAC alpha testers n n n OAI support n n n 23 Harvesting protocol Dublin Core Specialized metadata OAI representative at our meeting (Michael Nelson) Solves our problem with cataloging distributed, dynamic resources
Challenges ahead… n Large legacy catalogs n n n Overlap with other OAI groups n n e-prints & digital museums OAI as a springboard n n 24 cleansing and exporting hierarchical collections digitization of legacy data formats for access in perpetuity
bf5f48018cddd2b787b6fbedaa2f26f9.ppt