4adce06f70d55b895fa1c0368de1de6a.ppt
- Количество слайдов: 35
What's Happened Since the First SIGDAT Meeting? Kenneth Ward Church AT&T Labs-Research kwc@research. att. com
The First SIGDAT Meeting • WVLC-1 was held just before ACL-93 • Great turnout! – More like a conference than a workshop • We knew that corpora were “hot, ” – but didn't appreciate just how hot they would turn out to be.
Sister meetings have also done very well since 1993 • Information Retrieval – http: //www. acm. org/sigir/ • Digital Libraries – http: //fox. cs. vt. edu/DL 99/ • Machine Learning – http: //www. cs. cmu. edu/Web/Groups/NIPS • Data-mining, Databases, Data Warehousing – http: //www. acm. org/sigkdd/ – http: //www. vldb. org/
Empiricism has a long history • In the 1950’s, empiricism dominated a broad set of fields: – from psychology (behaviorism) – to electrical engineering (information theory). • At the time, it was common practice in linguistics to classify words not only on the basis of their meanings – but also on the basis of their co-occurrence with other words. – ``You shall know a word by the company it keeps” (Firth, 1957) • Regrettably, interest in empiricism faded in the 1960’s: – Chomsky's criticism of ngrams in Syntactic Structures (1957) and – Minsky and Papert's criticism of neural nets in Perceptrons (1969).
1990’s Revival • Empiricism regained a dominant position: – Ngrams and Hidden Markov Models (HMMs) became the method of choice in Speech. – Neural Networks (Perceptrons + Hidden Layers) helped create Machine Learning. • Empiricism Rationalism Empiricism – Oscillates about once a career • Mark Twain: Grandparents and Grandchildren have a natural alliance.
Why the Revival? “It was a bad idea then, and it is still a bad idea now” • More powerful computers? ? • Availability of massive quantities of data!! – Text is available like never before. – Not long ago, the Brown Corpus was considered large. – But now, text is available like never before! • First came collection efforts (www. ldc. upenn. org), • And now everyone has access to the Web! • Experiments are routinely carried out on gigabytes of text. • Some researchers are even working with terabytes.
Big Changes Since 1993 • The Web, stupid! – Demos – Data • Research: – Shared resources + evaluation – Scale: How large is very large? – Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street
The Web, Stupid! • If you publish a paper about neat stuff, it is expected that you will post it on the web. • I’ll mention just a few examples of neat stuff on the web. – Demos – Data – Tools
Lots of Neat Demos on the Web • Web Searching with Machine Translation – www. altavista. com(uses Systran) • Cross-Language Information Retrieval (CLIR): – www. xrce. xerox. com • Parallel Corpora: www-rali. iro. umontreal. ca • Latent Semantic Indexing (LSI) – superbook. bellcore. com/~remde/lsi – lsa. colorado. edu • Speech Synthesis: www. bell-labs. com/project/tts • Dotplot: www. cs. unm. edu/~jon/dotplot
Lots of Neat Data on the Web • Wordnet: www. cogsci. princeton. edu/~wn • Linguistic Data Consortium (LDC): – www. ldc. upenn. org • SIGLEX: www. clres. com/siglex. html • Discourse Resource Initiative (DRI) – www. georgetown. edu/luperfoy/Discourse-Treebank/dri -home. html • The Federalist Papers: – www. mcs. net/~knautzr/fed
More Neat Data on the Web (in Lots of Languages) • Chinese: – rocling. iis. sinica. edu. tw – www. sinica. edu. tw • Japanese: cl. aist-nara. ac. jp/lab/resource. html – Electronic Dictionary Research (EDR): www. iijnet. or. jp/edr – Advanced Telecommunications Research (ATR): www. atr. co. jp – www. rdt. monash. edu. au/~jwb/japanese. html • Korean: korterm. kaist. ac. kr • European Language Resources Association (ELRA) – www. icp. grenet. fr/ELRA • Parallel Text (Resnik, ACL-99) – Canadian Hansards: WWW. Parl. GC. CA – Turkish: www. nlp. cs. bilkent. edu. tr – Swedish: svenska. gu. se
Lots of Neat Tools on the Web • Penntools (links to all over the world) – www. cis. upenn. edu/~adwait/penntools. html • Part of Speech Taggers (see above) • Juman/Chasen – pine. kuee. kyoto-u. ac. jp/nl-resource/juman. html – cl. aist-nara. ac. jp/lab/nlt/chasen. html • Suffix Arrays – http: //cm. bell-labs. com/cm/cs/who/doug/ssort. c
Big Changes Since 1993 • The Web, stupid! – Demos – Data 4 Research: – Shared resources + evaluation – Scale: How large is very large? – Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street
Shared Resources + Evaluation • Common tasks: – Trec (trec. nist. gov), Tipster, MUC • Common benchmark corpora: Brown, Penn Treebank, Wall Street Journal, Switchboard • Shared lexical resources: Wordnet (www. cogsci. princeton. edu/~wn/) • Common labeling conventions/standards in all areas of NLP from Speech to Discourse • Evaluation, evaluation – Required to get a paper accepted anywhere.
In 1993, it wasn’t like this. . . • Invited talks at ACL-93 – “Planning Multimodal Discourse” – “Transfers of Meaning” – “Quantificational Domains and Recursive Contexts” • Less sharing of resources • Evaluation not required
Empiricism vs. Rationalism • Pluses: Clear measurable progress – Speech Recognition – Part of Speech Tagging – Parsing • Minuses: Herd mentality, incrementalism, mindless metrics, duplicated effort – Recall: empiricism fell out of favor in 1960 s when methodology became too burdensome.
Big Changes Since 1993 • The Web, stupid! – Demos – Data • Research: – Shared resources + evaluation – Scale: How large is very large? – Increased breadth: Geography, Topics 4 Commercial: Wall Street & Main Street
Main Street: Big change since 1993 • Large corpora are now having an impact on ordinary users: – Web search engines/portals – Managing gigabytes, not just a popular book, but something that ordinary users are beginning to take for granted.
Huge Commercial Successes (Since 1993) • Information Retrieval & Digital Libraries – Web search engines/portals: highly successful on both Wall Street as well as Main Street • Invited talks from Lycos (1997) & Infoseek (1998) • Machine Translation & Speech – Available wherever software is sold – Can’t use a phone without talking to a computer
Big Changes Since 1993 • The Web, stupid! – Demos – Data • Research: – Shared resources + evaluation 4 Scale: How large is very large? – Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street
How Large is Very Large?
Mirror, mirror on the wall • Who is the largest of them all? – The Web? – Lexis-Nexis? – West? • We have had invited talks from all three – Web: Lycos (1997) & Infoseek (1998) – Lexis-Nexis (1993) – West (1997)
Big Changes Since 1993 • The Web, stupid! – Demos – Data • Research: – Shared resources + evaluation – Scale: How large is very large? 4 Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street
Internationalization • SIGDAT-93: Nearly equal participation – America : 4 papers – Asia: 4 papers – Europe: 3 papers • Great growth in activity around the world, especially Asia • SIGDAT has met in a dozen cities (50% in America) – America: Columbus, Cambridge, Philadelphia, Providence, Montreal, College Park – Asia: Kyoto, Beijing, Hong Kong – Europe: Dublin, Copenhagen, Grenada
Some Topics that are Behind the International Expansion • Classic Issues – Machine Translation (MT) / Tools – Input Method Editor (IME): MS-IME 98 – Morphology: Juman, Chasen • New Issues – Cross-language Information Retrieval (CLIR) – Browsing the Internet: integrate IME + CLIR + MT – Parallel and comparable corpora – Terminology Extraction & Alignment – Suffix Arrays
Big Changes Since 1993 • The Web, stupid! – Demos – Data • Research: – Shared resources + evaluation – Scale: How large is very large? 4 Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street
Broader (and More Applied) View of Computational Linguistics • • Data-mining, Databases, Data Warehousing Digital Libraries Information Retrieval, Categorization, Extraction Lexicography Machine Learning Machine Translation Speech Text Analysis
Data-Mining Issues (How Large is Very Large? ) • Similar technology to corpus-based methods • But much larger datasets – Newswire (AP): 1 million words per week – Telephone calls: 1 -10 billion per month – IP packets: expected to be even larger • Tasks: Fraud, Marketing, Operations, Care – Identify knobs that business partners can turn • Increase demand (buy TV ads, reduce price) • Increase supply (buy network capacity, enhance operations) – Target opportunities for improvement (marketing prospects) – Track market response in real time (supply/demand by knob)
Best of SIGDAT • Best Invited Talk • Work of Note (in Related Fields)
Best Invited Talk at a SIGDAT Meeting • Henry Kučera and Nelson Francis – Third Workshop on Very Large Corpora (1995) – Massachusetts Institute of Technology (MIT) – Cambridge, MA, USA • Described their work on the Brown Corpus – At a time when empiricism was out of fashion – especially at MIT – Personal & Touching (received standing ovation)
Work of Note • Statistical Machine Translation / Alignment – Brown et al. • Statistical Parsing (In 1993, poor use of lexical info) – Jelinek, Magerman, Charniak, Collins • Statistical PP Attachment – Hindle and Rooth • Word-sense Disambiguation – Yarowsky • Text-tiling (Discourse Parsing) – Hearst
Work of Note (in Related Fields) • Learning – Classification and Regression Trees (CART) – Riper • Web Tools – Managing Gigabytes, Harvest, SGML XML • Representation – Suffix Arrays – Latent Semantic Indexing
Summary: Reaching a Wider Audience • Commercial Successes – Main Street & Wall Street • Internationalization – Goal: equal rep from America, Asia & Europe • More topic areas – Information Retrieval, Speech, Machine Translation, Machine Learning, Data-mining
Self-organizing vs. EDA • Self-organizing: Learning, HMM – Statistics do it all • Manual – Wilks’ Stone Soup: Statistics don’t do nothing • Exploratory Data Analysis (EDA) – Hybrid of above
Time for a little controversy: Two types of Empiricism • New Linguistic Insights vs. Methodology • Reviewers do what reviewers do – Safe, conservative, seek precedents, case law – Reviewers go easy on methodology papers • Grim historical reminder: – Recall: empiricism fell out of favor in 1960 s when methodology became too burdensome. • Shouldn’t let the methodology get in the way of what we are here to do.


