3edc54aaef8a801b6acd9768ed485cba.ppt
- Количество слайдов: 17
CIS 530 Orientation November 2001 Linguistic Data Consortium University of Pennsylvania Philadelphia, PA 19104 n CIS 530 Orientation - November 2001 1
Motivation v There are several thousand languages. Over 320 are spoken by over 1, 000 speakers. The ability to process foreign languages supports v global economy, internationalization of business, software localization, military roles, intelligence gathering, humanitarian efforts, foreign policy v To develop technology for language requires large amounts of data appropriately selected sampled, organized annotated in corpora v Corpus creation requires special equipment, unique legal arrangements and business models and specialized skills not usually taught in the programs of users of language data v LDC exists to make language data broadly available for linguistic education, research and technology development n CIS 530 Orientation - November 2001 2
LDC Role v LDC began in 1993 as a specialized publisher of language data. The data was typically produced elsewhere. v Distributed over 14, 000 copies of 196 corpora to >1000 organizations worldwide v LDC gradually developed the ability to create language resources locally v newswires/text collection, collection of conversational data via telephone, broadcast news collection v transcription, time-alignment, topic relevance annotation, named entity annotation, phonological /morphological resources v LDC more recently extended its research program v Talk. Bank & Linguistic Exploration, Open Languages Archives, African Language Lexicons, DASL v Linguistic technologies v Information Detection, Extraction and Summarization v Speech Recognition and Speech Synthesis v Machine Translation v Language and Speaker Identification v Language Teaching, Linguistics n CIS 530 Orientation - November 2001 3
Annotating LDC Corpora: TDT Topic Detection & Tracking (TDT) Corpora • TDT 4 Corpus (most recent) contains 9 months of data in 6 languages • Subset of 4 months of English, Chinese, Arabic for annotation • Topics selected and defined from all sources • Topic is a specific event or activity along with all directly related events (e. g. , Hurricane Mitch) • Multiple levels of annotation • segmentation of audio signal into individual stories • topic-story relevance judgements • first story identification • story-link identification • Millions of annotation decisions n CIS 530 Orientation - November 2001 4
Audio Segmentation Using commercial transcripts or closed-caption annotators ¨assess existing story boundaries ¨add, delete, move boundaries as needed ¨classify units as “news” or “not news” (commercials, etc. ) ¨set and confirm timestamps for all story boundaries n CIS 530 Orientation - November 2001 5
Topic-Story Annotation ¨Annotators read and evaluate news stories against topic list ¨Classify story as directly, briefly or not at all related to a target topic n CIS 530 Orientation - November 2001 6
Annotating LDC Corpora: ACE ¨Automatic Content Extraction Project (ACE) ¨Develop technology to support automatic processing of human language in text form ¨Classification, filtering, representing language content ¨Four annotation tasks ¨Identify all nominal entities in news story ¨Categorize according to type ¨Persons, organizations, GPE, ¨Name, nominal, pronominal location, facility ¨Co-index all mentions of single entity within story ¨Classify relations among entities n CIS 530 Orientation - November 2001 7
Nominal Entity Tagging n CIS 530 Orientation - November 2001 8
• Best practices in use of large-scale corpora in study of linguistic variation • Focus on -t/d deletion in American English (well-known variable) • Four LDC Corpora, all created for linguistic technology development • All data already transcribed, segmented to provide fine-grained access • Basic demographic information available (gender, age, education, region, race/ethnicity) n CIS 530 Orientation - November 2001 9
DASL Technology u Create concordance -regular expression search of corpus u Create tag set -specify which factors to code u Create annotation file -combines data with tag set u Annotate using web browser -play each example, tool supports common audio formats -code factors in each factor group, adding comments when needed -demographic information displayed u Save results and output to text file -can be exported to Excel Spreadsheet, statistical analysis package n CIS 530 Orientation - November 2001 10
n CIS 530 Orientation - November 2001 11
TDT Overview n CIS 530 Orientation - November 2001 12
Transcripts <DOC> <DOCNO> ABC 19981001. 1830. 0750 </DOCNO> <DOCTYPE> NEWS STORY </DOCTYPE> <DATE_TIME> 10/01/1998 18: 42: 30. 46 </DATE_TIME> <BODY> <TEXT> In the U. S. and Canada tonight, there is intense concern. It is fair to say, about the insulation used on 1, 000 airplanes. It is the same insulation used on Swissair flight 111 and it has been linked to fires on three other planes. Swissair went down off Nova Scotia, which is why the Canadians are concerned. The company that made that planes warned of the fire hazard years ago. ABC's Lisa Stark is in Washington. <TURN> <ANNOTATION> Reporter: </ANNOTATION> This is the type of insulation in question. . <TURN> Lisa Stark, ABC News, Washington. </TEXT> </BODY> <END_TIME> 10/01/1998 18: 44: 37. 14 </END_TIME> </DOC> n CIS 530 Orientation - November 2001 <W recid=1651> In <W recid=1652> the <W recid=1653> U. S. <W recid=1654> and <W recid=1655> Canada <W recid=1656> tonight, <W recid=1657> there <W recid=1658> is <W recid=1659> intense <W recid=1660> concern. <W recid=1661> It <W recid=1662> is <W recid=1663> fair <W recid=1664> to <W recid=1665> say, <W recid=1666> about <W recid=1667> the <W recid=1668> insulation <W recid=1669> used <W recid=1670> on <W recid=1671> 1, 000 <W recid=1672> airplanes. 13
ASR Output <X Bsec=749. 13 Dur=1. 34 Conf=NA> <W recid=1822 Bsec=750. 47 Dur=0. 16 Clust=NA Conf=NA> IN <W recid=1823 Bsec=750. 63 Dur=0. 10 Clust=NA Conf=NA> THE <W recid=1824 Bsec=750. 73 Dur=0. 14 Clust=NA Conf=NA> U. <W recid=1825 Bsec=750. 87 Dur=0. 16 Clust=NA Conf=NA> S. <W recid=1826 Bsec=751. 03 Dur=0. 13 Clust=NA Conf=NA> AND <W recid=1827 Bsec=751. 16 Dur=0. 41 Clust=NA Conf=NA> CANADA <W recid=1828 Bsec=751. 57 Dur=0. 33 Clust=NA Conf=NA> TONIGHT <W recid=1829 Bsec=751. 90 Dur=0. 21 Clust=NA Conf=NA> THERE <W recid=1830 Bsec=752. 11 Dur=0. 26 Clust=NA Conf=NA> IS <W recid=1831 Bsec=752. 40 Dur=0. 76 Clust=NA Conf=NA> INTENSE <W recid=1832 Bsec=753. 18 Dur=0. 64 Clust=NA Conf=NA> CONCERN <W recid=1833 Bsec=753. 82 Dur=0. 13 Clust=NA Conf=NA> IT <W recid=1834 Bsec=753. 95 Dur=0. 12 Clust=NA Conf=NA> IS <W recid=1835 Bsec=754. 07 Dur=0. 20 Clust=NA Conf=NA> FAIR <W recid=1836 Bsec=754. 27 Dur=0. 09 Clust=NA Conf=NA> TO <W recid=1837 Bsec=754. 36 Dur=0. 21 Clust=NA Conf=NA> SAY <W recid=1838 Bsec=754. 57 Dur=0. 23 Clust=NA Conf=NA> ABOUT <W recid=1839 Bsec=754. 80 Dur=0. 15 Clust=NA Conf=NA> THE <W recid=1840 Bsec=754. 95 Dur=0. 69 Clust=NA Conf=NA> INSULATION <W recid=1841 Bsec=755. 64 Dur=0. 44 Clust=NA Conf=NA> USED <W recid=1842 Bsec=756. 13 Dur=0. 12 Clust=NA Conf=NA> ON <W recid=1843 Bsec=756. 25 Dur=0. 06 Clust=NA Conf=NA> A <W recid=1844 Bsec=756. 31 Dur=0. 57 Clust=NA Conf=NA> THOUSAND <W recid=1845 Bsec=756. 88 Dur=0. 66 Clust=NA Conf=NA> AIRPLANES n CIS 530 Orientation - November 2001 <W recid=1651> In <W recid=1652> the <W recid=1653> U. S. <W recid=1654> and <W recid=1655> Canada <W recid=1656> tonight, <W recid=1657> there <W recid=1658> is <W recid=1659> intense <W recid=1660> concern. <W recid=1661> It <W recid=1662> is <W recid=1663> fair <W recid=1664> to <W recid=1665> say, <W recid=1666> about <W recid=1667> the <W recid=1668> insulation <W recid=1669> used <W recid=1670> on <W recid=1671> 1, 000 <W recid=1672> airplanes. 14
Boundary Table Tokenized Text <BOUNDARY docno=ABC 19981001. 1830. 0617 doctype=MISCELLANEOUS Bsec=617. 87 Esec=750. 46 Brecid=1525 Erecid=1650> <BOUNDARY docno=ABC 19981001. 1830. 0750 doctype=NEWS Bsec=750. 46 Esec=877. 14 Brecid=1651 Erecid=2014> <BOUNDARY docno=ABC 19981001. 1830. 0877 doctype=NEWS Bsec=877. 14 Esec=896. 86 Brecid=2015 Erecid=2063> <W recid=1632> The <W recid=1633> most <W recid=1634> luxurious <W recid=1635> minivan <W recid=1636> you <W recid=1637> can <W recid=1638> buy. . . <W recid=1639> Chrysler <W recid=1640> town <W recid=1641> and <W recid=1642> country. <W recid=1643> We <W recid=1644> call <W recid=1645> it <W recid=1646> limited. n CIS 530 Orientation - November 2001 <W recid=1647> You'll <W recid=1648> call <W recid=1649> it <W recid=1650> unlimited. <W recid=1651> In <W recid=1652> the <W recid=1653> U. S. <W recid=1654> and <W recid=1655> Canada <W recid=1656> tonight, <W recid=1657> there <W recid=1658> is <W recid=1659> intense <W recid=1660> concern. 15
Relevance Table Topic Definition Rule of Interpretation <ONTOPIC topicid=3016 level=YES docno=ABC 19981001. 1830. 0750 fileid=19981001_1830_1900_ABC_WNT comments="NO"> 30016. Swiss. Air 111 Crash Seminal Event WHAT: Swiss. Air Flight 111 crashes WHERE: Off the coast of Halifax, Nova Scotia. WHEN: The crash occurs on 9/2/98; the investigation continues through the fall of 1998. Topic Explication The MD-11 aircraft was en route from New York to Geneva, Switzerland when it crashed into the Atlantic Ocean, killing all 229 people on board. On topic: Stories covering the crash and ensuing investigation; plans to compensate the victims' families; any safety measures proposed or adopted as a direct result of this crash. Rule of Interpretation Rule 5: Accidents 5. Accidents: Examples - plane- car- train crash, bridge collapse, accidental shootings, boats sinking. The event would be causal activities and unavoidable consequences like death tolls, injuries, loss of property. The topic includes mourners pursuit of legal action, investigations, issues with responsible parties (like drug and alcohol tests for drivers etc. ) n CIS 530 Orientation - November 2001 16
Story Links Story Link Table Linked Story <LINK seed_docno=APW 19981122. 0381 comp_docno=ABC 19981001. 1830. 0750 label=Y> <DOCNO> APW 19981122. 0381 </DOCNO> <DOCTYPE> NEWS STORY </DOCTYPE> <DATE_TIME> 11/22/1998 09: 21: 00 </DATE_TIME> (. . . ) <HEADLINE> Swissair CEO defends installation of in-flight entertainment </HEADLINE> <TEXT> ZURICH, Switzerland (AP) _ Swissair ``did everything correctly'' in installing a state-of-the-art entertainment system switched off last month in the wake of the crash of Flight 111, the airline's chief executive said in an interview published Sunday. Swissair acted voluntarily to disconnect the video-on-demand system, connected to a power supply routed through the cockpit, after Canadian investigators detected signs of heat damage on wiring and other debris from the ceiling around the cockpit of the MD-11. (. . . ) </TEXT> </BODY> (. . . ) </DOC> n CIS 530 Orientation - November 2001 17
3edc54aaef8a801b6acd9768ed485cba.ppt