62c9762ee56ecaa2551aeddc889f7012.ppt
- Количество слайдов: 36
Ø WP 3 – Retrieval systems Enabling Access to Sound Archives through Integration, Enrichment and Retrieval
Introduction to Workpackage overview Ø Objectives: To provide retrieval systems offering the ability to search by various musical similarity measures. q To search for spoken words or phrases. q To search across different media for associated content. q Queries may be text-based, spoken or audio examples. q Ø Tasks: q q T 3. 1: Music retrieval T 3. 2: Speech retrieval T 3. 3: Cross-media retrieval T 3. 4: Vocal query interface 12 Month Review Meeting Project #033902
Introduction to Workpackage participants & schedule Ø Participants and their contributions q q q QMUL DIT: ALL: LFUI: NICE: 22 mm – music & cross media retrieval 5 mm – music retrieval 24 mm – speech retrieval & vocal queries 3 mm – integration of retrieval engines 12 mm – retrieval for attributes of speech Ø Schedule q q q T 3. 1 T 3. 2 T 3. 3: q T 3. 4: 12 Month Review Meeting Project #033902 Music retrieval: month 9 – month 20 Speech retrieval: month 8 – month 20 Cross-media retrieval: month 1 – month 6 month 21 – month 26 Vocal query interface: month 7 – month 20
Introduction to Workpackage Task 3. 1 - Music retrieval Ø Searching and organizing music collections: q q q Adequate representation for the audio in the query Textual and keyword: e. g Author, title, date, genre, etc Automatic feature extraction Low-level acoustic similarity measures Mid-level features – characterize the rhythmic structure High-level features musically relevant parameters v visualisation of key events along the assets' timeline v 12 Month Review Meeting Project #033902
Introduction to Workpackage Task 3. 2 - Speech retrieval Ø Retrieving the content of speech corpuses in English and Hungarian languages q q Building test corpuses Levels of recognition Phoneme level recognition v Pronunciation dictionary filter or morphological analysis v Text corpus based language model v Phoneme and word level indexing v Fast retrieval v q Improve the performance 12 Month Review Meeting Project #033902
Introduction to Workpackage Task 3. 3 - Cross-media retrieval Ø Searching media in various formats (audio recordings, video recordings, notated scores, images) q q q Using metadata Feature extraction Similarity measures Optimised multidimensional search methods Video analysis enter a piece of media as a query and might retrieve an entirely different type of media 12 Month Review Meeting Project #033902
Introduction to Workpackage Task 3. 4 - Vocal query interface Ø Voice initiated media retrieval (without natural language processing) q q Recording of query Phoneme level recognition Pronunciation dictionary based word(s) identification Speaker adaptation 12 Month Review Meeting Project #033902
Deliverables Ø D 3. 1 Report outlining retrieval system functionality and specification (Month 6) Ø D 3. 2 Prototype on speech and music retrieval systems with vocal query interface (Month 20) Ø D 3. 3 Prototype on cross-media retrieval system (Month 20) 12 Month Review Meeting Project #033902
Deliverable D 3. 1 – Report outlining retrieval system functionality and specification Topics described: Ø User requirements Ø Relations to other work packages Ø Music retrieval Ø Speech retrieval Ø Indexing and speaker retrieval Ø Cross-media retrieval Ø Vocal query interface Ø Retrieval system integration and knowledge management – role of ontology Ø Example user interfaces Contributors: ALL, DIT, NICE, QMUL, RSAMD 12 Month Review Meeting Project #033902
Milestones Ø M 3. 1 Initial vocal query system tested, initial speech and music retrieval algorithms developed (Month 8) Ø M 3. 2 Vocal query is fully-functional, speech and music retrieval implemented, cross-media retrieval method finalized (Month 14) Ø M 3. 3 Vocal query finished, speech and music retrieval systems established, basic cross-media retrieval implemented (Month 20) Ø M 3. 4 Cross-media retrieval fully functional, further work is only refinement and optimization (Month 26) 12 Month Review Meeting Project #033902
Milestones M 3. 1 – Speech retrieval & Vocal query Ø Vocal query system: q Is ready for demonstration in Hungarian and in English q phoneme level recognition implemented and tested, performance improvement in progress q Hungarian Tri-phone management is under development q English pronunciation dictionary embedded, with multiple versions of pronunciation q Morphological analyzer implemented for the Hungarian pronunciation q The performance of the Hungarian version is better than the English, the reason is under investigation Ø Speech retrieval: q See above q Language model established for the Hungarian version, for English we are looking for a good text corpus q Word level recognition implemented, under testing, the performance depends on the content/domain of the speech q Phoneme based search finished, word based is under implementation 12 Month Review Meeting Project #033902
Milestones M 3. 1 – Music retrieval Ø Music retrieval: q Extractors for tempo, key changes and mode detection have been implemented as VAMP plugins. q Sound. Bite similarity retrieval is fully functional and available as a MAC OS application. q A segmenter based on Sound. Bite has been implemented as a VAMP plugin. q A framework for the automatic extraction of audio features has been built. It uses VAMP plugins and outputs descriptors directly in RDF format, allowing easy integration with the ontology. 12 Month Review Meeting Project #033902
Workpackage Progress Ø Parallel development of separate retrieval engines is in Ø Ø Ø good progress Results are according to the schedule of the technical annex We aim to demonstrate our results by running software modules Scientific research induce risk when targeting improvement of performance Integration of different retrieval engines into a common architecture and user interface will be challenging Utilization of the power of the ontology centric approach 12 Month Review Meeting Project #033902
Workpackage Progress Music retrieval - 1 Ø Music retrieval system q Searches assets according to their relevance to musicrelated queries, using various metadata and automatically extracted features to produce a ranked list of audio files. q Search methods: Textual and keyword: e. g Author, title, date, genre, etc. . . v Similarity based on automatically extracted low-level features v Music-related parameters using automatically extracted descriptors: Instrument, orchestration, tempo, key, etc. . . v 12 Month Review Meeting Project #033902
Workpackage Progress Music retrieval – 2 – Music Analysis module Input Audio File (PCM) To PCM & compressed audio assets repository Archive application (musical audio) Compression De-Noising / Restoration Source Separation Mid-Level Feature Extractors Mid-level descriptors similarity search High-Level Features Extractors High-level features (parametric search) Optimal source separation & denoising parameters Manual Tags & Manual High Level Features Manual Entry Tags/Data 12 Month Review Meeting Project #033902 Reliability Metric To Metadata Repository
Workpackage Progress Music retrieval - 3 Ø Music Analysis module q Musically relevant descriptors are automatically extracted by a module running a series of “VAMP” plugins. q Descriptors returned by the plugins can be classified as: v v Mid-level features used by the retrieval system to search for similar audio assets (e. g. Timbre profile ). High-level features enable the user to search for audio assets using musically relevant parameters. High level descriptors are also used for the visualisation of key events along the assets' timeline (e. g. position of beats, bars, key changes and instruments), providing a considerable aid in the analysis of a piece of music. 12 Month Review Meeting Project #033902
Workpackage Progress Music retrieval - 4 Ø Similarity Search q Based on Sound. Bite algorithm: It allows simultaneous segmentation, thumbnailing and modelling of an audio asset 12 Month Review Meeting Project #033902
Workpackage Progress Speech retrieval - 1 Ø Research and testing is made on well prepared corpuses q q Text and corresponding speech needed 10 -20 hours Quality is very important v v v Low noise Mixed sound source omitted Accent sensitive Field interviews, phone conversations Quality of the recording Silence to silence segmentation - automated and manual Half of the corpus for training Half of the corpus for testing Ø Hungarian corpus q Hungarian radio station ‘Kossuth’ broadcast quality q More than 20 hours - segmentation enhanced manually Ø English corpus q TIMIT for research purpose q US Supreme Court recordings q q q 12 Month Review Meeting Project #033902
Workpackage Progress Speech retrieval - 2 Ø Preparation for speech recognition q q Training on segmented corpuses Importance of same accent and same speech content domain Ø Layers of speech recognition Phoneme level recognition -> acoustic score Pronunciation dictionary/morphological analysis based recognition -> word exist or not q Language model -> final probability q q Ø Mixed, phoneme and word based indexing keeping probabilities Ø Index based fast retrieval with score value 12 Month Review Meeting Project #033902
Workpackage Progress Speech retrieval - 3 Ø Phoneme level recognition ALL spent many months with research to refine its algorithms q Successful results q Using gaussian mixture model in HMM nodes v Triphone and allophone identification and management v Speaker clustering v Our phoneme level recognition exceeded the 60% hit rate This solution was not good enough to build up a reliable speech retrieval on it q The workflow: q q Input: silence to silence segments of wave v Output: probable phoneme sequences with acoustic score v 12 Month Review Meeting Project #033902
Workpackage Progress Speech retrieval - 4 Ø Dictionary level q Filtering of feasible phoneme sequences by v v q a pronunciation dictionary in English (custom based pronunciation) a morphological analyzer in Hungarian (rule based pronunciation) The workflow: v v Input: phoneme sequences with acoustic score Output: feasible word sequences with kept acoustic score Ø Language level On the basis of big text corpuses we rank the probability of the word sequences q The solution performs much better on domain specific speech (legal, medical domain) q The workflow: q v v Input: word sequences with acoustic score Output: word sequences with modified acoustic score 12 Month Review Meeting Project #033902
Workpackage Progress Speech retrieval - 5 Ø Integration of speech retrieval into the EASAIER architecture q Speech retrieval works as a black box q Relies on binary indexes q Business logic layer needed q Temporal RDF triplets generated q User initiated retrieval performed 12 Month Review Meeting Project #033902
Workpackage Progress Cross media retrieval - 1 Ø Cross-media retrieval The CM retrieval engine and its functionalities were specified in Deliverable 3. 1 q Video analysis modules necessary for CMR are specified in internal EASAIER software modules document and consist of: q Video Transcoding Module v Shot Detection and Key-Frame Extraction Module v Low-Level Feature Extraction Module v 12 Month Review Meeting Project #033902
Workpackage Progress Cross media retrieval – 2 – Video Analysis module Multimedia assets repository Input Video File Compression Audio Stream Extraction Original video file Streaming video file (eg. mpeg 4) PCM Keyframes Video Segmentation and Keyframe extraction Audio stream analysis Keyframe Analysis Metadata Features Temporal data KF Extracted Features KF temporal data Video segments temporal data Manual Annotation Video segments metadata Metadata repository 12 Month Review Meeting Project #033902
Workpackage Progress Cross media retrieval - 3 Ø For transcoding purposes we chose and tested ffmpeg software Ø First version of Shot Detection and Key-Frame Extraction were developed at QMUL and it is ready for integration Ø MPEG-7 e. Xperimentation Model (XM) will be integrated for purpose of Low-Level Feature Extraction. 12 Month Review Meeting Project #033902
Workpackage Progress Vocal query Ø Technology based on the first two layer of speech recognition (see above) q q Phoneme level recognition Pronunciation dictionary or morphological analyzer q q q Phoneme level recognition with acoustic score Matching to the dictionary Solution is the item with the best acoustic score which also found in the dictionary Ø Process Ø Technology will be demonstrated q q q Speed have to be tuned Performance under evaluation Performance difference between English and Hungarian 12 Month Review Meeting Project #033902
Contributions and Connections with Other Workpackages 12 Month Review Meeting Project #033902
Upcoming Work Plan Months 12 -24 – Music related Ø Up to month 20 q Music retrieval - prototype available, remaining time will be integration, testing, and continuing development of music retrieval based on other similarity measures. Ø Month 20 - Deliver D 3. 2 Ø Between month 20 and 26 q Music retrieval - only further work is refinement based on user studies (WP 7) Ø Month 26 q Deliver D 3. 3 12 Month Review Meeting Project #033902
Upcoming Work Plan Months 12 -24 – Speech related Ø Before D 3. 2 – Month 20 q Phoneme level recognition v v q Dictionary level v q More pronunciation variants in English Language model v q Speed tuning Language specific performance improvement Triphone implementation in English Speaker clustering Bigger text corpuses in English and in Hungarian Indexing and retrieval v Reimplementation for mixed phoneme and word search Ø After Month 20 q q Testing and refinement Performance tuning 12 Month Review Meeting Project #033902
Upcoming Work Plan Months 12 -24 – Cross media related Ø All cross-media retrieval effort will be focused on coding and testing the software. Ø Material produced for working documents (System Architecture, Software Modules, Metadata) specify the software to be developed and integrated, and how this will be done. Ø The required routines have for the most part been developed at a 'proof-of-concept' level (ontological relationships between different media, feature extraction for images and video, key frame display for retrieved video) Ø Month 26 M 3. 4 -Cross-media retrieval fully functional, further work is only refinement and optimization. q D 3. 3 Prototype on cross-media retrieval system q Ø Between month 20 and 26, QMUL will work closely with SILOGIC to integrate cross-media retrieval into the EASAIER system, which we intend to use as the prototype for this. 12 Month Review Meeting Project #033902
Demonstration Overview Ø Retrieval engines and tools are demonstrated separately Ø Speech related Demo q q q Segmentation user interface Vocal query (isolated word search) in Hungarian Vocal query (isolated word search) in English Ø Music related Demo q Soundbite – timbre-based music similarity embedded in ITunes 12 Month Review Meeting Project #033902
Demonstration Segmentation User interface for manual segmentation Ø Synchronizing silence to silence speech segments to the text Ø Checking automated silence to silence segmentation Ø Synchronizing word boundaries to the text Ø Synchronizing phoneme boundaries to the text 12 Month Review Meeting Project #033902
Demonstration Segmentation – screen shot 12 Month Review Meeting Project #033902
Demonstration Vocal query Searching for a spoken word in a dictionary Ø Recording input Ø Phoneme level recognition Ø Matching probable phoneme sequences to content of the pronunciation dictionary Ø Display of the most probable solution from the dictionary 12 Month Review Meeting Project #033902
Demonstration Vocal query – screen shot 12 Month Review Meeting Project #033902
Demonstration Music retrieval – screen shot 12 Month Review Meeting Project #033902
62c9762ee56ecaa2551aeddc889f7012.ppt