Скачать презентацию Multimodal Technology Integration for News-on-Demand SRI International News-on-Demand Скачать презентацию Multimodal Technology Integration for News-on-Demand SRI International News-on-Demand

3185333cfd0ab559b1cd5cd55409c5a1.ppt

  • Количество слайдов: 24

Multimodal Technology Integration for News-on-Demand SRI International News-on-Demand Compare & Contrast DARPA September 30, Multimodal Technology Integration for News-on-Demand SRI International News-on-Demand Compare & Contrast DARPA September 30, 1998 1

Personnel n n n Speech: Dilek Hakkani, Madelaine Plauche, Zev Rivlin, Ananth Sankar, Elizabeth Personnel n n n Speech: Dilek Hakkani, Madelaine Plauche, Zev Rivlin, Ananth Sankar, Elizabeth Shriberg, Kemal Sonmez, Andreas Stolcke, Gokhan Tur Natural language: David Israel, David Martin, John Bear Video Analysis: Bob Bolles, Marty Fischler, Marsha Jo Hannah, Bikash Sabata OCR: Greg Myers, Ken Nitz Architectures: Luc Julia, Adam Cheyer 2

SRI News-on-Demand Highlights Focus on technologies n New technologies: scene tracking, speaker tracking, flash SRI News-on-Demand Highlights Focus on technologies n New technologies: scene tracking, speaker tracking, flash detection, sentence segmentation n Exploit technology fusion n MAESTRO multimedia browser n 3

Outline n n n n Goals for News-on-Demand Component Technologies The MAESTRO testbed Information Outline n n n n Goals for News-on-Demand Component Technologies The MAESTRO testbed Information Fusion Prosody for Information Extraction Future Work Summary 4

High-level Goal Develop techniques to provide direct and natural access to a large database High-level Goal Develop techniques to provide direct and natural access to a large database of information sources through multiple modalities, including video, audio, and text. 5

Information We Want Geographical location n Topic of the story n News-makers n Who Information We Want Geographical location n Topic of the story n News-makers n Who or what is in the picture n Who is speaking n 6

Component Technologies n Speech processing • • n Automatic speech recognition (ASR) Speaker identification Component Technologies n Speech processing • • n Automatic speech recognition (ASR) Speaker identification Speaker tracking/grouping Sentence boundary/disfluency detection Video analysis • Scene segmentation • Scene tracking/grouping • Camera flashes n Optical character recognition (OCR) • Video caption • Scene text (light or dark) • Person identification n Information extraction (IE) • Names of people, places, organizations • Temporal terms • Story segmentation/classification 7

Component Flowchart 8 Component Flowchart 8

MAESTRO n n Testbed for multimodal News-on-Demand Technologies Links input data and output from MAESTRO n n Testbed for multimodal News-on-Demand Technologies Links input data and output from component technologies through common time line MAESTRO “score” visually correlates component technologies output Easy to integrate new technologies through uniform data representation format 9

IR Results MAESTRO Interface ASR Output Score Video 10 IR Results MAESTRO Interface ASR Output Score Video 10

The Technical Challenge n n Problem: Knowledge sources are not always available or reliable The Technical Challenge n n Problem: Knowledge sources are not always available or reliable Approaches • Make existing sources more reliable • Combine multiple sources for increased reliability and functionality (fusion) • Exploit new knowledge sources 11

Two Examples Technology Fusion: Speech recognition + Named entity finding = better OCR n Two Examples Technology Fusion: Speech recognition + Named entity finding = better OCR n New knowledge source: Speech prosody for finding names and sentence boundaries n 12

Fusion Ideas n n n Use the names of people detected in the audio Fusion Ideas n n n Use the names of people detected in the audio track to suggest names in captions Use the names of people detected in yesterday’s news to suggest names in audio Use a video caption to identify a person speaking, and then use their voice to recognize them again 13

Information Fusion Text Recog “Moore” + add to lexicon ASR moore NL “Moore” Information Fusion Text Recog “Moore” + add to lexicon ASR moore NL “Moore”

INPUT MODALITITES TECHNOLOGY COMPONENTS Face Det/Rec EXTRACTED INFORMATION Who’s speaking Scene Seg/Clust/Class Video object INPUT MODALITITES TECHNOLOGY COMPONENTS Face Det/Rec EXTRACTED INFORMATION Who’s speaking Scene Seg/Clust/Class Video object tracking Video imagery Caption Recog Speaker Seg/Clust/Class Scene Text Det/Rec Who / What’s in view Story topic Audio track Audio event detection Geographic focus Speech Recog Name Extraction Auxiliary text news sources Topic detection Story start/end Input processing paths First-pass fusion opportunities

Augmented Lexicon Improves Recognition Results Without lexicon: TONY BLAKJB With lexicon: TONY BLAIR WNITEE Augmented Lexicon Improves Recognition Results Without lexicon: TONY BLAKJB With lexicon: TONY BLAIR WNITEE SIATEE UNITED STATES 16

Prosody for Enhanced Speech Understanding Prosody = Rhythm and Melody of Speech n Measured Prosody for Enhanced Speech Understanding Prosody = Rhythm and Melody of Speech n Measured through duration (of phones and pauses), energy, and pitch n Can help extract information crucial to speech understanding n Examples: Sentence boundaries and Named Entities n 17

Prosody for Sentence Segmentation Finding sentence boundaries important for information extraction, structuring output for Prosody for Sentence Segmentation Finding sentence boundaries important for information extraction, structuring output for retrieval n Ex. : Any surprises? No. Tanks are in the area. n Experiment: Predict sentence boundaries based on duration and pitch using decision trees classifiers n 18

Sentence Segmentation: Results Baseline accuracy = 50% (same number boundaries & non-boundaries) n Accuracy Sentence Segmentation: Results Baseline accuracy = 50% (same number boundaries & non-boundaries) n Accuracy using prosody = 85. 7% n Boundaries indicated by: long pauses, low pitch before, high pitch after n Pitch cues work much better in Broadcast News than in Switchboard n 19

Prosody for Named Entities Finding names (of people, places, organizations) key to info extraction Prosody for Named Entities Finding names (of people, places, organizations) key to info extraction n Names tend to be important to content, hence prosodic emphasis n Prosodic cues can be detected even if words are misrecognized: could help find new named entities n 20

Named Entities: Results n n n Baseline accuracy = 50% Using prosody only: accuracy Named Entities: Results n n n Baseline accuracy = 50% Using prosody only: accuracy = 64. 9% N. E. s indicated by • longer duration (more careful pronunciation) • more within-word pitch variation n Challenges • only first mentions are accented • only one word in longer N. E. marked • non-names accented 21

Using Prosody in No. D: Summary Prosody can help information extraction independent of word Using Prosody in No. D: Summary Prosody can help information extraction independent of word recognition n Preliminary positive results for sentence segmentation and N. E. finding n Other uses: topic boundaries, emotion detection n 22

Ongoing and Future Work Combine prosody and words for name finding n Implement additional Ongoing and Future Work Combine prosody and words for name finding n Implement additional fusion opportunities: n • OCR helping speech • speaker tracking helping topic tracking n Leverage geographical information for recognition technologies 23

Conclusions News-on-Demand technologies are making great strides n Robustness still a challenge n Improved Conclusions News-on-Demand technologies are making great strides n Robustness still a challenge n Improved reliability through data fusion and new knowledge sources n 24