Скачать презентацию Gesture Recognition in Complex Scenes Vassilis Athitsos Computer Скачать презентацию Gesture Recognition in Complex Scenes Vassilis Athitsos Computer

5616e78bf05f0c3a785ee23ea9183498.ppt

  • Количество слайдов: 60

Gesture Recognition in Complex Scenes Vassilis Athitsos Computer Science and Engineering Department University of Gesture Recognition in Complex Scenes Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1

Collaborators n n n n Jonathan Alon (ex-BU, now Negevtech, Israel). Jingbin Wang (ex-BU, Collaborators n n n n Jonathan Alon (ex-BU, now Negevtech, Israel). Jingbin Wang (ex-BU, now Google). Quan Yuan (Boston University). Alexandra Stefan (Boston University). Stan Sclaroff (Boston University). George Kollios (Boston University). Margrit Betke (Boston University). 2

Motivation: ASL Dictionary 3 Motivation: ASL Dictionary 3

Motivation: ASL Dictionary n Addresses needs of a large community: n 500, 000 to Motivation: ASL Dictionary n Addresses needs of a large community: n 500, 000 to 2 million ASL users in the US. n n Direct impact in education of Deaf children. n n ? ? ? in the European Union. Most born to hearing parents, learn ASL at school. Challenging problems in vision, learning, database indexing. n n n Large-scale motion-based video retrieval. Efficient large-scale multiclass recognition. Learning complex patterns from few examples. 4

Sources of Information Hand motion. n Hand pose. n n n Shape. Orientation. Facial Sources of Information Hand motion. n Hand pose. n n n Shape. Orientation. Facial expressions. n Body pose. n 5

Dynamic Gestures n What gesture did the user perform? Class “ 8” 6 Dynamic Gestures n What gesture did the user perform? Class “ 8” 6

Typical Motion Recognition Approach input sequence trajectory Detector Tracker Classifier class “ 0” 7 Typical Motion Recognition Approach input sequence trajectory Detector Tracker Classifier class “ 0” 7

Bottom-up Shortcoming input frame n n n hand likelihood Hand detection is often hard! Bottom-up Shortcoming input frame n n n hand likelihood Hand detection is often hard! Color, motion, background subtraction are often not enough. Bottom-up frameworks are a fundamental computer vision bottleneck. 8

Key Idea input frame n hand candidates Hand detection can return multiple candidates. n Key Idea input frame n hand candidates Hand detection can return multiple candidates. n Design a recognition module for this type of input. 9

Nearest-Neighbor Recognition Query M 1 M 2 n Question: how should we measure similarity? Nearest-Neighbor Recognition Query M 1 M 2 n Question: how should we measure similarity? MN 10

Database Sequences Example database gesture n Assumption: hand location is known in all frames Database Sequences Example database gesture n Assumption: hand location is known in all frames of the database gestures. n n n Database is built offline. In worst case, manual annotation. Online user experience is not affected. 11

Comparing Trajectories 2 2 1 3 4 1 5 4 5 6 n i Comparing Trajectories 2 2 1 3 4 1 5 4 5 6 n i n n 3 7 6 7 8 is the hand position at frame i. Temporary assumption: known hand location. How do we compare these trajectories? 12

Comparing Trajectories 2 2 1 3 4 1 5 4 5 n 3 6 Comparing Trajectories 2 2 1 3 4 1 5 4 5 n 3 6 7 8 Comparing i-th frame to i-th frame is problematic. n What do we do with frame 8? 13

Comparing Trajectories 2 2 1 3 4 1 5 4 5 n 3 6 Comparing Trajectories 2 2 1 3 4 1 5 4 5 n 3 6 7 8 Alignment: ((f 1, g 1), …, (fm, gm)). n n Must include all frames of both sequences. A frame can occur multiple consecutive times. 14

Comparing Trajectories 1 2 2 3 4 1 5 4 5 n 3 6 Comparing Trajectories 1 2 2 3 4 1 5 4 5 n 3 6 7 8 ((1, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8)) 15

Optimal Alignment n Cost of ((f 1, g 1), …, (fm, gm)) has two Optimal Alignment n Cost of ((f 1, g 1), …, (fm, gm)) has two terms: n n n Correspondence cost: average cost of each (fi, gi), Transition cost: cost of two consecutive pairings. Dynamic Time Warping (DTW) computes optimal alignment. n Complexity: quadratic to length of sequences. 16

Frame 1 DTW Q Frame 50 . . Frame 80 . . M Frame Frame 1 DTW Q Frame 50 . . Frame 80 . . M Frame 1 . . Frame 32 . . Frame 51 n For each cell (i, j): n n n Compute optimal alignment of M(1: i) to Q(1: j). Answer depends only on (i-1, j), (i, j-1), (i-1, j-1). Time complexity proportional to size of table. 17

DSTW Q . . M . . W W . . K 2 1 DSTW Q . . M . . W W . . K 2 1 n Alignment: ((f 1, g 1 , k 1), …, (fm, gm , km)): n n n Matching cost: average cost of each (fi , gi , ki), Transition cost: cost of two consecutive pairings. How do we find the optimal alignment? 18

DSTW Q . . M . . W W . . K 2 1 DSTW Q . . M . . W W . . K 2 1 n For each cell (i, j, k): n n Compute optimal alignment of M(1: i) to Q(1: j), using the k-th candidate for frame Q(j). Answer depends on (i-1, j, k), (i, j-1, *), (i-1, j-1, *). 19

DSTW Q . . M W W K 2 1 n Result: optimal alignment. DSTW Q . . M W W K 2 1 n Result: optimal alignment. n n ((f 1, g 1, k 1), (f 2, g 2, k 2), …, (fm, gm, km)). We get hand locations for free! 20

Application: Gesture Recognition with Short Sleeves! 21 Application: Gesture Recognition with Short Sleeves! 21

Experiment: 10 Digits. 22 Experiment: 10 Digits. 22

Experiment: 10 Digits. Test set: 90 gestures, from 3 users. n Database: 90 gestures Experiment: 10 Digits. Test set: 90 gestures, from 3 users. n Database: 90 gestures from 3 users. n n n Each test gesture was only matched to the 60 examples from the other users Accuracy: 91%. 23

Discussion n Higher level module (recognition) tolerant to lower-level (detection) ambiguities. n Recognition disambiguates Discussion n Higher level module (recognition) tolerant to lower-level (detection) ambiguities. n Recognition disambiguates detection. This is important for designing plug-andplay modules. n Use in ASL dictionary. n n n User signs unknown word in front of computer. Video sequences of signs are ranked in order of DSTW score. 24

Static Gestures (Hand Poses) n Given a hand model, and a single image of Static Gestures (Hand Poses) n Given a hand model, and a single image of a hand, estimate: n n 3 D hand shape (joint angles). 3 D hand orientation. Joints Input image Articulated hand model 25

Static Gestures n Given a hand model, and a single image of a hand, Static Gestures n Given a hand model, and a single image of a hand, estimate: n n 3 D hand shape (joint angles). 3 D hand orientation. Input image Articulated hand model 26

Similarity Based Matching n x 1 x 2 q query gesture x 3 Goal: Similarity Based Matching n x 1 x 2 q query gesture x 3 Goal: n n Estimate the class of query gesture q. Method: n Find the most similar database gestures. xn database 27

Problems n x 1 n x 2 q x 3 xn Tolerate errors in Problems n x 1 n x 2 q x 3 xn Tolerate errors in feature extraction. n n query gesture How do we measure similarity? Hand detection and segmentation. How do we achieve efficient retrieval? n Efficient approximations of slow similarity measures. database 28

Goal: Hand Tracking Initialization n Given the 3 D hand pose in the previous Goal: Hand Tracking Initialization n Given the 3 D hand pose in the previous frame, estimate it in the current frame. n Problem: no good way to automatically initialize a tracker. Rehg et al. (1995), Heap et al. (1996), Shimada et al. (2001), Wu et al. (2001), Stenger et al. (2001), Lu et al. (2003), … 29

Assumptions in Our Approach n A few tens of distinct hand shapes. 30 Assumptions in Our Approach n A few tens of distinct hand shapes. 30

Assumptions in Our Approach n A few tens of distinct hand shapes. n n Assumptions in Our Approach n A few tens of distinct hand shapes. n n All 3 D orientations should be allowed. Motivation: American Sign Language. 31

Assumptions in Our Approach n A few tens of distinct hand shapes. n n Assumptions in Our Approach n A few tens of distinct hand shapes. n n n All 3 D orientations should be allowed. Motivation: American Sign Language. Input: single image, bounding box of hand. 32

Assumptions in Our Approach input image skin detection n segmented hand We do not Assumptions in Our Approach input image skin detection n segmented hand We do not assume precise segmentation! n No clean contour extracted. 33

Approach: Database Search n Over 100, 000 computer-generated images. n Known hand pose. input Approach: Database Search n Over 100, 000 computer-generated images. n Known hand pose. input 34

Why? n We avoid direct estimation of 3 D info. n n With a Why? n We avoid direct estimation of 3 D info. n n With a database, we only match 2 D to 2 D. We can find all plausible estimates. n Hand pose is often ambiguous. input 35

Building the Database 26 hand shapes 36 Building the Database 26 hand shapes 36

Building the Database 4128 images are generated for each hand shape. Total: 107, 328 Building the Database 4128 images are generated for each hand shape. Total: 107, 328 images. 37

Features: Edge Pixels n We use edge images. n n Easy to extract. Stable Features: Edge Pixels n We use edge images. n n Easy to extract. Stable under illumination changes. input edge image 38

Similarity Measure: Chamfer Distance input model Overlaying input and model How far apart are Similarity Measure: Chamfer Distance input model Overlaying input and model How far apart are they? 39

Directed Chamfer Distance n Input: two sets of points. n n red, green. c(red, Directed Chamfer Distance n Input: two sets of points. n n red, green. c(red, green): n Average distance from each red point to nearest green point. 40

Directed Chamfer Distance n Input: two sets of points. n n c(red, green): n Directed Chamfer Distance n Input: two sets of points. n n c(red, green): n n red, green. Average distance from each red point to nearest green point. c(green, red): n Average distance from each red point to nearest green point. 41

Chamfer Distance n Input: two sets of points. n n c(red, green): n n Chamfer Distance n Input: two sets of points. n n c(red, green): n n red, green. Average distance from each red point to nearest green point. c(green, red): n Average distance from each red point to nearest green point. Chamfer distance: C(red, green) = c(red, green) + c(green, red) 42

Evaluating Retrieval Accuracy n A database image is a correct match for the input Evaluating Retrieval Accuracy n A database image is a correct match for the input if: n n the hand shapes are the same, 3 D hand orientations differ by at most 30 degrees. correct matches input incorrect matches 43

Evaluating Retrieval Accuracy n An input image has 25 -35 correct matches among the Evaluating Retrieval Accuracy n An input image has 25 -35 correct matches among the 107, 328 database images. n Ground truth for input images is estimated by humans. correct matches input incorrect matches 44

Evaluating Retrieval Accuracy n Retrieval accuracy measure: what is the rank of the highest Evaluating Retrieval Accuracy n Retrieval accuracy measure: what is the rank of the highest ranking correct match? correct matches input incorrect matches 45

Evaluating Retrieval Accuracy input … rank 1 rank 2 rank 3 rank 4 rank Evaluating Retrieval Accuracy input … rank 1 rank 2 rank 3 rank 4 rank 5 rank 6 highest ranking correct match … 46

Results on 703 Real Hand Images Rank of highest Percentage of ranking correct match Results on 703 Real Hand Images Rank of highest Percentage of ranking correct match test images 1 15% 1 -10 40% 1 -100 73% 47

Results on 703 Real Hand Images Rank of highest Percentage of ranking correct match Results on 703 Real Hand Images Rank of highest Percentage of ranking correct match test images 1 15% 1 -100 n 40% 73% Results are better on “nicer” images: n n n Dark background. Frontal view. For half the images, top match was correct. 48

Examples segmented hand edge image initial image correct match rank: 1 49 Examples segmented hand edge image initial image correct match rank: 1 49

Examples segmented hand edge image initial image correct match rank: 644 50 Examples segmented hand edge image initial image correct match rank: 644 50

Examples segmented hand edge image initial image incorrect match rank: 1 51 Examples segmented hand edge image initial image incorrect match rank: 1 51

Examples segmented hand edge image initial image correct match rank: 1 52 Examples segmented hand edge image initial image correct match rank: 1 52

Examples segmented hand edge image initial image correct match rank: 33 53 Examples segmented hand edge image initial image correct match rank: 33 53

Examples segmented hand edge image initial image incorrect match rank: 1 54 Examples segmented hand edge image initial image incorrect match rank: 1 54

Examples segmented hand edge image “hard” case “easy” case 55 Examples segmented hand edge image “hard” case “easy” case 55

Discussion 3 D pose estimation from a single image is hard! n What is Discussion 3 D pose estimation from a single image is hard! n What is our system good for? n n Cleanly segmented frontal views. Generating hypotheses that domain knowledge/constraints can disambiguate. Tracker initialization and error recovery. n How would our system be integrated with a tracker? 56

Research Directions n More accurate similarity measures. n n Problem: higher-level features are more Research Directions n More accurate similarity measures. n n Problem: higher-level features are more informative, but harder to calculate. Better tolerance to segmentation errors. n n Clutter. Incorrect scale and translation. 57

Current Work: ASL Dictionary 58 Current Work: ASL Dictionary 58

Current Work: ASL Dictionary n Computer vision challenge: n n n Estimate hand pose Current Work: ASL Dictionary n Computer vision challenge: n n n Estimate hand pose and motion accurately and fast. Our existing hand pose method leaves many questions unanswered. Machine learning challenge: n n Currently, in DSTW, there is no learning. learn models of signs. n n 4000 classes, 1 -5 examples per sign. Data mining challenge: n indexing methods for large numbers of classes. 59

n Comments, n E-mail: n Web: questions, complaints… athitsos@uta. edu http: //crystal. uta. edu/~athitsos/ n Comments, n E-mail: n Web: questions, complaints… athitsos@uta. edu http: //crystal. uta. edu/~athitsos/ END 60