A Hand Gesture Detection Dataset

Content


This section presents every proposed dictionary (understanding it as a gesture set).

Data Set composition

Natural Data:

  • Number of subjects: 11 
  • Static Pose videos:  1 per gesture (252 frames).
  • Number of Executions videos: 5 per gesture.

 Synthetic  Data:

  • Number of Points of Video per Pose: 200 
  • Points of View domains:

D1_eq_1

D1_eq2

Dictionaries:

In Figure 1 we can find the pose-based dictionaries, while in Figure 2 and 3 the motion-based and the compound gestures respectively.






References:

(Kollorz et al., 2008)  Kollorz, E., Penne, J., Hornegger, J., Barke, A.: Gesture recognition with a time-of-flight camera. Int. J. Intell. Syst. Technol. Appl. 5(3/4), 334–343 (2008).

(Molina et al., 2011)  Molina, J., Escudero-Viñolo, M., Signoriello, A., Pardás, M., Ferrán, C., Bescós, J., Marqués, F., Martínez, J.: Real-time user independent hand gesture recognition from time-of-flight camera video using static and dynamic models. Machine Vision and Applications pp. 1-18 (2011).

(Soutschek et al., 2008) Soutschek, S., Penne, J., Hornegger, J., Kornhuber, J.: 3-d gesture-based scene navigation in medical imaging applications using time-of-flight cameras. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1-6 (2008).




 

 
Critical Factors

Temporal coherence: the examined data-sets just provide with either images or videos, being sets of single images per gesture sample the most common situation. However, video temporal
continuity allows temporal filtering either in the analysis or in the decision phase. Moreover, considering videos as annotation units allows a more adequate adaptation to real situations, in which gestures are performed during some consecutive frames. Besides, natural hand transitions during a gesture, which use to be the hardest poses to model, are intrinsicly included in videos.
Representativeness: when designing and generating a data-set, one of the main objectives is to cover as many practical situations as possible. In this line, the representativeness of a gesture data-set increases with the number of users, with their heterogeneity and with the variations in the point of view of the captures. Besides, the more available and heterogeneous dictionaries
(according to the nature of their gestures), the more scenarios could be considered when designing a recognition solution. Finally, the availability of videos instead of single images, provide transitory frames in which the performed gestures vary in appearance respect to the iconic models of their front-side versions, which enhances representativeness.
Nature of gestures: in the SoA we can identify four different kinds of gestures: a) pose-based, defined entirely by the pose of the hand; b) motion-based, in which the hand pose is not relevant, i.e., the hand trajectory explicitly defines the gesture; c) pose-motion based, defined both by a pose and certain motion pattern in the execution; and d) compound, which are gestures composed of a sequence of pose-based gestures.
Scalability: this refers to the capacity of easily extending a data-set to include a new collection of gestures, which is a very valuable characteristic. In practice, it is hard to collect a representative group of users to perform a recording session.
Capture technology: RGB cameras are the most common technology due to its low cost. However, the trend in the last years is to use hand's range information, either via stereo-vision or via Time-Of-Flight (TOF) cameras. TOF technology has several advantages: it allows to obtain 3D data in a non-intrusive way, without using markers or gloves, with a simpler set-up than stereo-vision systems, and it is robust to illumination conditions. Additionally, the hand segmentation process becomes easier than with exclusively color data, and much simpler than in stereovision solutions, even in the presence of camera motion.
Pose issues: we here group some problematic factors, either intrinsical to the gesture definition or introduced by the acquistion process, that may hinder pose detection with the existing analysis techniques. These issues can be significant when they make two or more poses more similar:
• Finger occlusion, owing to either crossed fingers or to a lateral point of view of the camera.
• Hand-core occlusion, understanding the hand-core as the part of the hand that it is not fingers. Oclussion happens when the point of view of the camera hides the palm and the opisthenar area.
• 2D silhouettes with no protuberances: Many hand gesture detection approaches in the of the SoA use a description of the detected silhouette rather than that of the hand. When there is more than one gesture in which the fingers are not identifiable on the basis of the hand silhouette, the gesture detection task becomes more difficult. The absence of a representative 2D silhouette for more than one gesture introduces a handicap for the detection of these gestures.
• Forearm presence: the miss-segmentation of the forearm as part of the hand may increase the difficultty to later classiffy a gesture, which was trained from forearm-free samples. This only applies to videos capturing real users and depends on the acquistion process.