
715cc4f39c02a5843b3de101d19f71aa.ppt
- Количество слайдов: 76
1 Cross-Modal (Visual-Auditory) Denoising Dana Segev Yoav Y. Schechner Michael Elad Technion – Israel Institute of Technology
Motivation Digits sequence Noisy digits sequence Denoised by state of the art algorithm of Cohen & Berdugo Segev, Schechner, Elad, Cross-Modal Denoising
Motivation Use one modality to denoise another? • Use video to denoise a soundtrack? Segev, Schechner, Elad, Cross-Modal Denoising
Noise Very intense Non-stationary Unknown a Unseen source. Single microphone Segev, Schechner, Elad, Cross-Modal Denoising
Input video very noisy audio time (sec) Algorithm Output denoised audio Cross-modal Example-Based For human and machine hearing Segev, Schechner, Elad, Cross-Modal Denoising
Intuition Segev, Schechner, Elad, Cross-Modal Denoising 6
Intuition Segev, Schechner, Elad, Cross-Modal Denoising 7
Intuition 8 I nput test set Training E xample set Segev, Schechner, Elad, Cross-Modal Denoising
9 Speech Examples Extraction Segev, Schechner, Elad, Cross-Modal Denoising
10 Speech Examples Extraction ~syllable (0. 25 sec) Segev, Schechner, Elad, Cross-Modal Denoising
Music Segments Extraction 11 Xylophone Segev, Schechner, Elad, Cross-Modal Denoising
Music Segments Extraction Xylophone Sound Segev, Schechner, Elad, Cross-Modal Denoising 12
Examples Principle 13 . . . Segev, Schechner, Elad, Cross-Modal Denoising
Examples Principle Segev, Schechner, Elad, Cross-Modal Denoising 14 . . .
Examples Audio Only Segev, Schechner, Elad, Cross-Modal Denoising 15 . . .
Examples Audio Only Segev, Schechner, Elad, Cross-Modal Denoising 16 . . .
Cross-Modal Denoising ü Cross-modal representation. • Generating multimodal features. • Learning feature statistics. • Cross-modal pattern recognition. • Rendering a denoised signal. Segev, Schechner, Elad, Cross-Modal Denoising 17
Feature-space Creation Input video Video feature-space time (sec) Input audio Segev, Schechner, Elad, Cross-Modal Denoising Audio feature-space 18
Feature-space Creation time (sec) Input audio-video 19 Audio-video feature -space Segev, Schechner, Elad, Cross-Modal Denoising
Feature-space Creation time (sec) Training audio-video 20 Audio-video examples feature-space Segev, Schechner, Elad, Cross-Modal Denoising
Distance-measure 21 Feature-space Segev, Schechner, Elad, Cross-Modal Denoising
Distance-measure 22 Feature-space Segev, Schechner, Elad, Cross-Modal Denoising
Distance-measure 23 Feature-space Segev, Schechner, Elad, Cross-Modal Denoising
Distance-measure Nearest Neighbor Segev, Schechner, Elad, Cross-Modal Denoising 24 Feature-space
Distance-measure Nearest Neighbor Segev, Schechner, Elad, Cross-Modal Denoising 25 Feature-space
Examples Distance-measure . . . 26 . . . Segev, Schechner, Elad, Cross-Modal Denoising
Examples Distance-measure . . . 27 . . . Segev, Schechner, Elad, Cross-Modal Denoising
Rendering a denoised signal 28 Noisy audio Clean segment Segev, Schechner, Elad, Cross-Modal Denoising
Rendering a denoised signal 29 Noisy audio Clean segment Denoised Segev, Schechner, Elad, Cross-Modal Denoising
Examples Distance-measure . . . 30 . . . Segev, Schechner, Elad, Cross-Modal Denoising
Cross-Modal Association 31 . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising
Cross-Modal Association 32 . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising
Cross-Modal Association 33 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising
Cross-Modal Association 34 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising
35 Bartender experiment Segev, Schechner, Elad, Cross-Modal Denoising
Cross-Modal Association 36 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising
Cross-Modal Denoising 37 ü Cross-modal representation. • Generating multimodal features. • Learning feature statistics. ü Cross-modal pattern recognition (NN). ü Rendering a denoised signal. Segev, Schechner, Elad, Cross-Modal Denoising
Feature Statistics as a Prior 38 Feature-space Segev, Schechner, Elad, Cross-Modal Denoising
Feature Statistics as a Prior For the k-th example segment: Segev, Schechner, Elad, Cross-Modal Denoising 39 Feature-space
Feature Statistics as a Prior bi - fif - ty- two Feature-space For the k-th example segment: bi ty ar fif two Segev, Schechner, Elad, Cross-Modal Denoising 40
Feature Statistics as a Prior Next cluster bi ty fif two ar two 41 ar 1 1 1 2 1 Feature-space 1 bi 1 Current cluster ty ar fif two Segev, Schechner, Elad, Cross-Modal Denoising
Feature Statistics as a Prior 42 Syllable consecutive probability Next cluster bi ty fif two ar bi 26 5 1 53 23 ty 12 fif 22 60 43 17 6 4 1 5 3 13 6 12 21 7 2 7 11 two 2 ar 9 Current cluster Number of examples in training set = The probability for transition between clusters Segev, Schechner, Elad, Cross-Modal Denoising
Feature Statistics as a Prior 43 Hidden Markov Model fif bi Time delay P bi fif Segev, Schechner, Elad, Cross-Modal Denoising
Feature Statistics as a Prior 44 Audio noise fif bi Time delay P bi fif Segev, Schechner, Elad, Cross-Modal Denoising
Feature Statistics as a Prior Hidden Markov Model 45 Audio noise fif bi Time delay P + bi fif Segev, Schechner, Elad, Cross-Modal Denoising
Cross-Modal Association 46 . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising
Cross-Modal Association 47 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising
Cross-Modal Association 48 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising
Cross-Modal Association 49 Input video Segev, Schechner, Elad, Cross-Modal Denoising
Cross-Modal Association 50 Input video Segev, Schechner, Elad, Cross-Modal Denoising
Cross-Modal Association 51 Input video Segev, Schechner, Elad, Cross-Modal Denoising
Cross-Modal Association A Cost function A Data term A Regularization term Segev, Schechner, Elad, Cross-Modal Denoising 52
Cross-Modal Association A Cost function A Data term A Regularization term Optimally vector of indices Segev, Schechner, Elad, Cross-Modal Denoising 53
Cross-Modal Association 54 . . . . Examples Input • • Dynamic Programming Complexity: : Complexity nodes Segev, Schechner, Elad, Cross-Modal Denoising edges
Cross-Modal Association 55 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising
Cross-Modal Association 56 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising
Cross-Modal Association 57 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising
Cross-Modal Denoising 58 ü Cross-modal representation. • Generating multimodal features. ü Learning feature statistics. ü Cross-modal pattern recognition. ü Rendering a denoised signal. Segev, Schechner, Elad, Cross-Modal Denoising
59 Audio Features Requirements Speech Features Music Features Visual Features • Sensitivity to sound perception. • Dimension reduction • Focusing on the motion of interest • Dimension reduction MFCCs DCT coefficients Spectrogram of each segment The spatial trajectory of a hitting rod Segev, Schechner, Elad, Cross-Modal Denoising
Audio Features 60 MFCCs – Mel-frequency Ceptral Coefficients Audio signal Signal spectrum Mel-frequency filter bank log(. ) DCT MFCCs Segev, Schechner, Elad, Cross-Modal Denoising
Audio Features 61 Spectrogram of each segment Spectrogram Xylophne signal Spectrogram accumulation Segev, Schechner, Elad, Cross-Modal Denoising
Visual Features 62 The given movie speech . . . Segev, Schechner, Elad, Cross-Modal Denoising
Visual Features 63 Locking on the object of interest speech . . . Segev, Schechner, Elad, Cross-Modal Denoising
Visual Features 64 Extracting global motion by tracking speech . . . Segev, Schechner, Elad, Cross-Modal Denoising
Visual Features 65 Extracting global motion by tracking speech . . . Segev, Schechner, Elad, Cross-Modal Denoising
Visual Features Extracting features speech DCT coefficients which highly represent motion between frames Segev, Schechner, Elad, Cross-Modal Denoising 66
Visual Features 67 The given movie Xylophone . . . Segev, Schechner, Elad, Cross-Modal Denoising
Visual Features 68 Locking on the object of interest Xylophone . . . Segev, Schechner, Elad, Cross-Modal Denoising
Visual Features 69 Extracting global motion by tracking Xylophone Z Y . . . X Segev, Schechner, Elad, Cross-Modal Denoising
Visual Features 70 Extracting global motion by tracking Xylophone Z Y . . . X Segev, Schechner, Elad, Cross-Modal Denoising
Visual Features Extracting features Xylophone Z Y Hitting rod spatial coordinates X Segev, Schechner, Elad, Cross-Modal Denoising 71
Experiments 72 Speech • A corpus of a limited number of words and syllables: Digits and bar beverages. • Video rate 25 fps, Audio rate 8000 Hz. • Kmeans clustering, 350 clusters. • Distance measurement l 2 norm. Xylophone • A corpus of a limited sounds. • Video rate 25 fps, Audio rate 16000 Hz • Distance measurement l 2 norm. Segev, Schechner, Elad, Cross-Modal Denoising
Xylophone 73 • Training duration: 103 sec • Testing duration : 100 sec Music from song by GNR: SNR = 0. 9 Xylophone Melody: SNR = 1 Segev, Schechner, Elad, Cross-Modal Denoising
Experiments 74 Speech: Digits • Training duration: 60 sec • Testing duration : 240 sec Noisy Denoised SNR = 0. 07 Segev, Schechner, Elad, Cross-Modal Denoising
Experiments • Training duration: 48 sec Music from song by Phil Collins SNR = 0. 59 Speech: Bartender • Testing duration : 350 sec Male Speech White Gaussian SNR = 0. 38 Segev, Schechner, Elad, Cross-Modal Denoising 75
Summary video Input very noisy audio time (sec) Algorithm Output denoised audio Segev, Schechner, Elad, Cross-Modal Denoising • Example-based • Hidden Markov Model For human and machine hearing