This section
presents every proposed dictionary (understanding
it as a gesture set).
Data Set composition |
Natural Data:
- Number of subjects: 11
- Static
Pose videos: 1 per gesture (252 frames).
- Number of
Executions videos: 5 per gesture.
Synthetic
Data:
- Number of
Points of Video per Pose: 200
- Points of
View domains:
Dictionaries:
In Figure 1 we can find the pose-based dictionaries, while in Figure 2 and 3 the motion-based and the compound gestures respectively.
References:
(Kollorz et al., 2008) Kollorz, E., Penne, J., Hornegger, J., Barke, A.: Gesture recognition
with a time-of-flight camera. Int. J. Intell. Syst. Technol. Appl.
5(3/4), 334–343 (2008).
(Molina et al., 2011) Molina,
J., Escudero-Viñolo, M., Signoriello, A., Pardás, M., Ferrán, C.,
Bescós, J., Marqués, F., Martínez, J.: Real-time user independent hand
gesture recognition from time-of-flight camera video using static and
dynamic models. Machine Vision and Applications pp. 1-18 (2011).
(Soutschek et al., 2008) Soutschek,
S., Penne, J., Hornegger, J., Kornhuber, J.: 3-d gesture-based scene
navigation in medical imaging applications using time-of-flight
cameras. In: IEEE Computer Society Conference on Computer Vision and
Pattern Recognition Workshops, pp. 1-6 (2008).
|
Critical Factors |
Temporal coherence: the
examined data-sets just provide with either images or videos, being
sets of single images per gesture sample the most common situation.
However, video temporal
continuity allows temporal filtering either in the analysis or in the
decision phase. Moreover, considering videos as annotation units allows
a more adequate adaptation to real situations, in which gestures are
performed during some consecutive frames. Besides, natural hand
transitions during a gesture, which use to be the hardest poses to
model, are intrinsicly included in videos.
Representativeness:
when designing and generating a data-set, one of the main objectives is
to cover as many practical situations as possible. In this line, the
representativeness of a gesture data-set increases with the number of
users, with their heterogeneity and with the variations in the point of
view of the captures. Besides, the more available and heterogeneous
dictionaries
(according to the nature of their gestures), the more scenarios could
be considered when designing a recognition solution. Finally, the
availability of videos instead of single images, provide transitory
frames in which the performed gestures vary in appearance respect to
the iconic models of their front-side versions, which enhances
representativeness.
Nature of gestures:
in the SoA we can identify four different kinds of gestures: a)
pose-based, defined entirely by the pose of the hand; b) motion-based,
in which the hand pose is not relevant, i.e., the hand trajectory
explicitly defines the gesture; c) pose-motion based, defined both by a
pose and certain motion pattern in the execution; and d) compound,
which are gestures composed of a sequence of pose-based gestures.
Scalability: this
refers to the capacity of easily extending a data-set to include a new
collection of gestures, which is a very valuable characteristic. In
practice, it is hard to collect a representative group of users to
perform a recording session.
Capture technology: RGB
cameras are the most common technology due to its low cost. However,
the trend in the last years is to use hand's range information, either
via stereo-vision or via Time-Of-Flight (TOF) cameras. TOF technology
has several advantages: it allows to obtain 3D data in a non-intrusive
way, without using markers or gloves, with a simpler set-up than
stereo-vision systems, and it is robust to illumination conditions.
Additionally, the hand segmentation process becomes easier than with
exclusively color data, and much simpler than in stereovision
solutions, even in the presence of camera motion.
Pose issues:
we here group some problematic factors, either intrinsical to the
gesture definition or introduced by the acquistion process, that may
hinder pose detection with the existing analysis techniques. These
issues can be significant when they make two or more poses more similar:
• Finger occlusion, owing to either crossed fingers or to a lateral point of view of the camera.
• Hand-core occlusion, understanding the hand-core as the part of the
hand that it is not fingers. Oclussion happens when the point of view
of the camera hides the palm and the opisthenar area.
• 2D silhouettes with no protuberances: Many hand gesture detection
approaches in the of the SoA use a description of the detected
silhouette rather than that of the hand. When there is more than one
gesture in which the fingers are not identifiable on the basis of the
hand silhouette, the gesture detection task becomes more difficult. The
absence of a representative 2D silhouette for more than one gesture
introduces a handicap for the detection of these gestures.
• Forearm presence: the miss-segmentation of the forearm as part of the
hand may increase the difficultty to later classiffy a gesture, which
was trained from forearm-free samples. This only applies to videos
capturing real users and depends on the acquistion process.
|
|
|
|