Скачать презентацию 1 Cross-Modal Visual-Auditory Denoising Dana Segev Yoav Y Скачать презентацию 1 Cross-Modal Visual-Auditory Denoising Dana Segev Yoav Y

715cc4f39c02a5843b3de101d19f71aa.ppt

  • Количество слайдов: 76

1 Cross-Modal (Visual-Auditory) Denoising Dana Segev Yoav Y. Schechner Michael Elad Technion – Israel 1 Cross-Modal (Visual-Auditory) Denoising Dana Segev Yoav Y. Schechner Michael Elad Technion – Israel Institute of Technology

Motivation Digits sequence Noisy digits sequence Denoised by state of the art algorithm of Motivation Digits sequence Noisy digits sequence Denoised by state of the art algorithm of Cohen & Berdugo Segev, Schechner, Elad, Cross-Modal Denoising

Motivation Use one modality to denoise another? • Use video to denoise a soundtrack? Motivation Use one modality to denoise another? • Use video to denoise a soundtrack? Segev, Schechner, Elad, Cross-Modal Denoising

Noise Very intense Non-stationary Unknown a Unseen source. Single microphone Segev, Schechner, Elad, Cross-Modal Noise Very intense Non-stationary Unknown a Unseen source. Single microphone Segev, Schechner, Elad, Cross-Modal Denoising

Input video very noisy audio time (sec) Algorithm Output denoised audio Cross-modal Example-Based For Input video very noisy audio time (sec) Algorithm Output denoised audio Cross-modal Example-Based For human and machine hearing Segev, Schechner, Elad, Cross-Modal Denoising

Intuition Segev, Schechner, Elad, Cross-Modal Denoising 6 Intuition Segev, Schechner, Elad, Cross-Modal Denoising 6

Intuition Segev, Schechner, Elad, Cross-Modal Denoising 7 Intuition Segev, Schechner, Elad, Cross-Modal Denoising 7

Intuition 8 I nput test set Training E xample set Segev, Schechner, Elad, Cross-Modal Intuition 8 I nput test set Training E xample set Segev, Schechner, Elad, Cross-Modal Denoising

9 Speech Examples Extraction Segev, Schechner, Elad, Cross-Modal Denoising 9 Speech Examples Extraction Segev, Schechner, Elad, Cross-Modal Denoising

10 Speech Examples Extraction ~syllable (0. 25 sec) Segev, Schechner, Elad, Cross-Modal Denoising 10 Speech Examples Extraction ~syllable (0. 25 sec) Segev, Schechner, Elad, Cross-Modal Denoising

Music Segments Extraction 11 Xylophone Segev, Schechner, Elad, Cross-Modal Denoising Music Segments Extraction 11 Xylophone Segev, Schechner, Elad, Cross-Modal Denoising

Music Segments Extraction Xylophone Sound Segev, Schechner, Elad, Cross-Modal Denoising 12 Music Segments Extraction Xylophone Sound Segev, Schechner, Elad, Cross-Modal Denoising 12

Examples Principle 13 . . . Segev, Schechner, Elad, Cross-Modal Denoising Examples Principle 13 . . . Segev, Schechner, Elad, Cross-Modal Denoising

Examples Principle Segev, Schechner, Elad, Cross-Modal Denoising 14 . . . Examples Principle Segev, Schechner, Elad, Cross-Modal Denoising 14 . . .

Examples Audio Only Segev, Schechner, Elad, Cross-Modal Denoising 15 . . . Examples Audio Only Segev, Schechner, Elad, Cross-Modal Denoising 15 . . .

Examples Audio Only Segev, Schechner, Elad, Cross-Modal Denoising 16 . . . Examples Audio Only Segev, Schechner, Elad, Cross-Modal Denoising 16 . . .

Cross-Modal Denoising ü Cross-modal representation. • Generating multimodal features. • Learning feature statistics. • Cross-Modal Denoising ü Cross-modal representation. • Generating multimodal features. • Learning feature statistics. • Cross-modal pattern recognition. • Rendering a denoised signal. Segev, Schechner, Elad, Cross-Modal Denoising 17

Feature-space Creation Input video Video feature-space time (sec) Input audio Segev, Schechner, Elad, Cross-Modal Feature-space Creation Input video Video feature-space time (sec) Input audio Segev, Schechner, Elad, Cross-Modal Denoising Audio feature-space 18

Feature-space Creation time (sec) Input audio-video 19 Audio-video feature -space Segev, Schechner, Elad, Cross-Modal Feature-space Creation time (sec) Input audio-video 19 Audio-video feature -space Segev, Schechner, Elad, Cross-Modal Denoising

Feature-space Creation time (sec) Training audio-video 20 Audio-video examples feature-space Segev, Schechner, Elad, Cross-Modal Feature-space Creation time (sec) Training audio-video 20 Audio-video examples feature-space Segev, Schechner, Elad, Cross-Modal Denoising

Distance-measure 21 Feature-space Segev, Schechner, Elad, Cross-Modal Denoising Distance-measure 21 Feature-space Segev, Schechner, Elad, Cross-Modal Denoising

Distance-measure 22 Feature-space Segev, Schechner, Elad, Cross-Modal Denoising Distance-measure 22 Feature-space Segev, Schechner, Elad, Cross-Modal Denoising

Distance-measure 23 Feature-space Segev, Schechner, Elad, Cross-Modal Denoising Distance-measure 23 Feature-space Segev, Schechner, Elad, Cross-Modal Denoising

Distance-measure Nearest Neighbor Segev, Schechner, Elad, Cross-Modal Denoising 24 Feature-space Distance-measure Nearest Neighbor Segev, Schechner, Elad, Cross-Modal Denoising 24 Feature-space

Distance-measure Nearest Neighbor Segev, Schechner, Elad, Cross-Modal Denoising 25 Feature-space Distance-measure Nearest Neighbor Segev, Schechner, Elad, Cross-Modal Denoising 25 Feature-space

Examples Distance-measure . . . 26 . . . Segev, Schechner, Elad, Cross-Modal Denoising Examples Distance-measure . . . 26 . . . Segev, Schechner, Elad, Cross-Modal Denoising

Examples Distance-measure . . . 27 . . . Segev, Schechner, Elad, Cross-Modal Denoising Examples Distance-measure . . . 27 . . . Segev, Schechner, Elad, Cross-Modal Denoising

Rendering a denoised signal 28 Noisy audio Clean segment Segev, Schechner, Elad, Cross-Modal Denoising Rendering a denoised signal 28 Noisy audio Clean segment Segev, Schechner, Elad, Cross-Modal Denoising

Rendering a denoised signal 29 Noisy audio Clean segment Denoised Segev, Schechner, Elad, Cross-Modal Rendering a denoised signal 29 Noisy audio Clean segment Denoised Segev, Schechner, Elad, Cross-Modal Denoising

Examples Distance-measure . . . 30 . . . Segev, Schechner, Elad, Cross-Modal Denoising Examples Distance-measure . . . 30 . . . Segev, Schechner, Elad, Cross-Modal Denoising

Cross-Modal Association 31 . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising Cross-Modal Association 31 . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising

Cross-Modal Association 32 . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising Cross-Modal Association 32 . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising

Cross-Modal Association 33 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising Cross-Modal Association 33 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising

Cross-Modal Association 34 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising Cross-Modal Association 34 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising

35 Bartender experiment Segev, Schechner, Elad, Cross-Modal Denoising 35 Bartender experiment Segev, Schechner, Elad, Cross-Modal Denoising

Cross-Modal Association 36 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising Cross-Modal Association 36 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising

Cross-Modal Denoising 37 ü Cross-modal representation. • Generating multimodal features. • Learning feature statistics. Cross-Modal Denoising 37 ü Cross-modal representation. • Generating multimodal features. • Learning feature statistics. ü Cross-modal pattern recognition (NN). ü Rendering a denoised signal. Segev, Schechner, Elad, Cross-Modal Denoising

Feature Statistics as a Prior 38 Feature-space Segev, Schechner, Elad, Cross-Modal Denoising Feature Statistics as a Prior 38 Feature-space Segev, Schechner, Elad, Cross-Modal Denoising

Feature Statistics as a Prior For the k-th example segment: Segev, Schechner, Elad, Cross-Modal Feature Statistics as a Prior For the k-th example segment: Segev, Schechner, Elad, Cross-Modal Denoising 39 Feature-space

Feature Statistics as a Prior bi - fif - ty- two Feature-space For the Feature Statistics as a Prior bi - fif - ty- two Feature-space For the k-th example segment: bi ty ar fif two Segev, Schechner, Elad, Cross-Modal Denoising 40

Feature Statistics as a Prior Next cluster bi ty fif two ar two 41 Feature Statistics as a Prior Next cluster bi ty fif two ar two 41 ar 1 1 1 2 1 Feature-space 1 bi 1 Current cluster ty ar fif two Segev, Schechner, Elad, Cross-Modal Denoising

Feature Statistics as a Prior 42 Syllable consecutive probability Next cluster bi ty fif Feature Statistics as a Prior 42 Syllable consecutive probability Next cluster bi ty fif two ar bi 26 5 1 53 23 ty 12 fif 22 60 43 17 6 4 1 5 3 13 6 12 21 7 2 7 11 two 2 ar 9 Current cluster Number of examples in training set = The probability for transition between clusters Segev, Schechner, Elad, Cross-Modal Denoising

Feature Statistics as a Prior 43 Hidden Markov Model fif bi Time delay P Feature Statistics as a Prior 43 Hidden Markov Model fif bi Time delay P bi fif Segev, Schechner, Elad, Cross-Modal Denoising

Feature Statistics as a Prior 44 Audio noise fif bi Time delay P bi Feature Statistics as a Prior 44 Audio noise fif bi Time delay P bi fif Segev, Schechner, Elad, Cross-Modal Denoising

Feature Statistics as a Prior Hidden Markov Model 45 Audio noise fif bi Time Feature Statistics as a Prior Hidden Markov Model 45 Audio noise fif bi Time delay P + bi fif Segev, Schechner, Elad, Cross-Modal Denoising

Cross-Modal Association 46 . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising Cross-Modal Association 46 . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising

Cross-Modal Association 47 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising Cross-Modal Association 47 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising

Cross-Modal Association 48 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising Cross-Modal Association 48 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising

Cross-Modal Association 49 Input video Segev, Schechner, Elad, Cross-Modal Denoising Cross-Modal Association 49 Input video Segev, Schechner, Elad, Cross-Modal Denoising

Cross-Modal Association 50 Input video Segev, Schechner, Elad, Cross-Modal Denoising Cross-Modal Association 50 Input video Segev, Schechner, Elad, Cross-Modal Denoising

Cross-Modal Association 51 Input video Segev, Schechner, Elad, Cross-Modal Denoising Cross-Modal Association 51 Input video Segev, Schechner, Elad, Cross-Modal Denoising

Cross-Modal Association A Cost function A Data term A Regularization term Segev, Schechner, Elad, Cross-Modal Association A Cost function A Data term A Regularization term Segev, Schechner, Elad, Cross-Modal Denoising 52

Cross-Modal Association A Cost function A Data term A Regularization term Optimally vector of Cross-Modal Association A Cost function A Data term A Regularization term Optimally vector of indices Segev, Schechner, Elad, Cross-Modal Denoising 53

Cross-Modal Association 54 . . . . Examples Input • • Dynamic Programming Complexity: Cross-Modal Association 54 . . . . Examples Input • • Dynamic Programming Complexity: : Complexity nodes Segev, Schechner, Elad, Cross-Modal Denoising edges

Cross-Modal Association 55 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising Cross-Modal Association 55 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising

Cross-Modal Association 56 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising Cross-Modal Association 56 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising

Cross-Modal Association 57 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising Cross-Modal Association 57 . . . . Examples Input Segev, Schechner, Elad, Cross-Modal Denoising

Cross-Modal Denoising 58 ü Cross-modal representation. • Generating multimodal features. ü Learning feature statistics. Cross-Modal Denoising 58 ü Cross-modal representation. • Generating multimodal features. ü Learning feature statistics. ü Cross-modal pattern recognition. ü Rendering a denoised signal. Segev, Schechner, Elad, Cross-Modal Denoising

59 Audio Features Requirements Speech Features Music Features Visual Features • Sensitivity to sound 59 Audio Features Requirements Speech Features Music Features Visual Features • Sensitivity to sound perception. • Dimension reduction • Focusing on the motion of interest • Dimension reduction MFCCs DCT coefficients Spectrogram of each segment The spatial trajectory of a hitting rod Segev, Schechner, Elad, Cross-Modal Denoising

Audio Features 60 MFCCs – Mel-frequency Ceptral Coefficients Audio signal Signal spectrum Mel-frequency filter Audio Features 60 MFCCs – Mel-frequency Ceptral Coefficients Audio signal Signal spectrum Mel-frequency filter bank log(. ) DCT MFCCs Segev, Schechner, Elad, Cross-Modal Denoising

Audio Features 61 Spectrogram of each segment Spectrogram Xylophne signal Spectrogram accumulation Segev, Schechner, Audio Features 61 Spectrogram of each segment Spectrogram Xylophne signal Spectrogram accumulation Segev, Schechner, Elad, Cross-Modal Denoising

Visual Features 62 The given movie speech . . . Segev, Schechner, Elad, Cross-Modal Visual Features 62 The given movie speech . . . Segev, Schechner, Elad, Cross-Modal Denoising

Visual Features 63 Locking on the object of interest speech . . . Segev, Visual Features 63 Locking on the object of interest speech . . . Segev, Schechner, Elad, Cross-Modal Denoising

Visual Features 64 Extracting global motion by tracking speech . . . Segev, Schechner, Visual Features 64 Extracting global motion by tracking speech . . . Segev, Schechner, Elad, Cross-Modal Denoising

Visual Features 65 Extracting global motion by tracking speech . . . Segev, Schechner, Visual Features 65 Extracting global motion by tracking speech . . . Segev, Schechner, Elad, Cross-Modal Denoising

Visual Features Extracting features speech DCT coefficients which highly represent motion between frames Segev, Visual Features Extracting features speech DCT coefficients which highly represent motion between frames Segev, Schechner, Elad, Cross-Modal Denoising 66

Visual Features 67 The given movie Xylophone . . . Segev, Schechner, Elad, Cross-Modal Visual Features 67 The given movie Xylophone . . . Segev, Schechner, Elad, Cross-Modal Denoising

Visual Features 68 Locking on the object of interest Xylophone . . . Segev, Visual Features 68 Locking on the object of interest Xylophone . . . Segev, Schechner, Elad, Cross-Modal Denoising

Visual Features 69 Extracting global motion by tracking Xylophone Z Y . . . Visual Features 69 Extracting global motion by tracking Xylophone Z Y . . . X Segev, Schechner, Elad, Cross-Modal Denoising

Visual Features 70 Extracting global motion by tracking Xylophone Z Y . . . Visual Features 70 Extracting global motion by tracking Xylophone Z Y . . . X Segev, Schechner, Elad, Cross-Modal Denoising

Visual Features Extracting features Xylophone Z Y Hitting rod spatial coordinates X Segev, Schechner, Visual Features Extracting features Xylophone Z Y Hitting rod spatial coordinates X Segev, Schechner, Elad, Cross-Modal Denoising 71

Experiments 72 Speech • A corpus of a limited number of words and syllables: Experiments 72 Speech • A corpus of a limited number of words and syllables: Digits and bar beverages. • Video rate 25 fps, Audio rate 8000 Hz. • Kmeans clustering, 350 clusters. • Distance measurement l 2 norm. Xylophone • A corpus of a limited sounds. • Video rate 25 fps, Audio rate 16000 Hz • Distance measurement l 2 norm. Segev, Schechner, Elad, Cross-Modal Denoising

Xylophone 73 • Training duration: 103 sec • Testing duration : 100 sec Music Xylophone 73 • Training duration: 103 sec • Testing duration : 100 sec Music from song by GNR: SNR = 0. 9 Xylophone Melody: SNR = 1 Segev, Schechner, Elad, Cross-Modal Denoising

Experiments 74 Speech: Digits • Training duration: 60 sec • Testing duration : 240 Experiments 74 Speech: Digits • Training duration: 60 sec • Testing duration : 240 sec Noisy Denoised SNR = 0. 07 Segev, Schechner, Elad, Cross-Modal Denoising

Experiments • Training duration: 48 sec Music from song by Phil Collins SNR = Experiments • Training duration: 48 sec Music from song by Phil Collins SNR = 0. 59 Speech: Bartender • Testing duration : 350 sec Male Speech White Gaussian SNR = 0. 38 Segev, Schechner, Elad, Cross-Modal Denoising 75

Summary video Input very noisy audio time (sec) Algorithm Output denoised audio Segev, Schechner, Summary video Input very noisy audio time (sec) Algorithm Output denoised audio Segev, Schechner, Elad, Cross-Modal Denoising • Example-based • Hidden Markov Model For human and machine hearing