Algorithm
Table footer: –: none.
The skin color method involves various challenges, such as illumination variation, background issues and other types of noise. A study by Perimal et al. [ 37 ] provided 14 gestures under controlled-conditions room lighting using an HD camera at short distance (0.15 to 0.20 m) and, the gestures were tested with three parameters, noise, light intensity and size of hand, which directly affect recognition rate. Another study by Sulyman et al. [ 38 ] observed that using Y–Cb–Cr color space is beneficial for eliminating illumination effects, although bright light during capture reduces the accuracy. A study by Pansare et al. [ 11 ] used RGB to normalize and detect skin and applied a median filter to the red channel to reduce noise on the captured image. The Euclidian distance algorithm was used for feature matching based on a comprehensive dataset. A study by Rajesh et al. [ 15 ] used HSI to segment the skin color region under controlled environmental conditions, to enable proper illumination and reduce the error.
Another challenge with the skin color method is that the background must not contain any elements that match skin color. Choudhury et al. [ 39 ] suggested a novel hand segmentation based on combining the frame differencing technique and skin color segmentation, which recorded good results, but this method is still sensitive to scenes that contain moving objects in the background, such as moving curtains and waving trees. Stergiopoulou et al. [ 40 ] combined motion-based segmentation (a hybrid of image differencing and background subtraction) with skin color and morphology features to obtain a robust result that overcomes illumination and complex background problems. Another study by Khandade et al. [ 41 ] used a cross-correlation method to match hand segmentation with a dataset to achieve better recognition. Karabasi et al. [ 42 ] proposed hand gestures for deaf-mute communication based on mobile phones, which can translate sign language using HSV color space. Zeng et al. [ 43 ] presented a hand gesture method to assist wheelchair users indoors and outdoors using red channel thresholding with a fixed background to overcome the illumination change. A study by Hsieh et al. [ 44 ] used face skin detection to define skin color. This system can correctly detect skin pixels under low lighting conditions, and even when the face color is not in the normal range of skin chromaticity. Another study, by Bergh et al. [ 45 ], proposed a hybrid method based on a combination of the histogram and a pre-trained Gaussian mixture model to overcome lighting conditions. Pansare et al. [ 46 ] aligned two cameras (RGB and TOF) together to improve skin color detection with the help of the depth property of the TOF camera to enhance detection and face background limitations.
This method depends on extracting the image features in order to model visual appearance such as hand and comparing these parameters with feature extracted from the input image frames. Where the features are directly calculated by the pixel intensities without a previous segmentation process. The method is executed in real time due to the easy 2D image features extracted and is considered easier to implement than the 3D model method. In addition, this method can detect various skin tones. Utilizing the AdaBoost learning algorithm, which maintains fixed feature such as key points for a portion of a hand, which can solve the occlusion issue [ 47 , 48 ], it can separate into two models: a motion model and a 2D static model. Table 2 presents a set of research papers that use different segmentation techniques based on appearance recognition to detect region of interest (ROI).
A set of research papers that have used appearance-based detection for hand gesture application.
Author | Type of Camera | Resolution | Techniques/ Methods for Segmentation | Feature Extract Type | Classify Algorithm | RECOGNITION RATE | No. of Gestures | Application Area | Dataset Type | Invariant Factor | Distance from Camera |
---|---|---|---|---|---|---|---|---|---|---|---|
[ ] | Logitech Quick Cam web camera | 320 × 240 pixels | Haar -like features & AdaBoost learning algorithm | hand posture | parallel cascade structure | above 90% | 4 hand postures | real-time vision-based hand gesture classification | Positive and negative hand sample collected by author | – | – |
[ ] | webcam-1.3 | 80 × 64 resize image for train | OTSU & canny edge detection technique for gray scale image | hand sign | feed-forward back propagation neural network | 92.33% | 26 static signs | American Sign Language | Dataset created by author | low differentiation | different distances |
[ ] | camera video | 320 × 240 pixels | Gaussian model describes hand color in HSV & AdaBoost algorithm | hand gesture | palm–finger configuration | 93% | 6 hand gestures | real-time hand gesture recognition method | – | – | – |
[ ] | camera–projector system | 384 × 288 pixels | background subtraction method | hand gesture | Fourier-based classification | 87.7% | 9 hand gestures | user-independent application | ground truth data set collected manually | point coordinates geometrically distorted & skin color | – |
[ ] | Monocular web camera | 320 × 240 pixels | combine Y–Cb–Cr & edge extraction & parallel finger edge appearance | hand posture based on finger gesture | finger model | – | 14 static gestures | substantial applications | The test data are collected from videos captured by web-camera | variation in lightness would result in edge extraction failure | ≤ 500 mm |
A study by Chen et al. [ 49 ] proposed two approaches for hand recognition. The first approach focused on posture recognition using Haar-like features, which can describe the hand posture pattern effectively used the AdaBoost learning algorithm to speed up the performance and thus rate of classification. The second approach focused on gesture recognition using context-free grammar to analyze the syntactic structure based on the detected postures. Another study by Kulkarni and Lokhande [ 50 ] used three feature extraction method such as a histogram technique to segment and observe images that contained a large number of gestures, then suggested using edge detection such as Canny, Sobel and Prewitt operators to detect the edges with a different threshold. The classification gesture performed using feed forward back propagation artificial neural network with supervision learns. Some of the limitation reported by the author where conclude when use histogram technique the system gets misclassified result because histogram can only be used for the small number of gesture which completely different from each other. Fang et al. [ 51 ] used an extended AdaBoost method for hand detection and combined optical flow with the color cue for tracking. They also collected hand color from the neighborhood of features’ mean position using a single Gaussian model to describe hand color in HSV color space. Where multi feature extracted and gesture recognition using palm and finger decomposition, then utilizing scale-space feature detection where integrated into gesture recognition in order to encounter the limitation of aspect ratio which facing most of the learning of hand gesture methods. Licsa’r et al. [ 52 ] used a simple background subtraction method for hand segmentation and extended it to handle background changes in order to face some challenges such as skin like color and complex and dynamic background then used boundary-based method to classify hand gesture. Finally, Zhou et al. [ 53 ] proposed a novel method to directly extract the fingers where the edges were extracted from the gesture images, and then the finger central area was obtained from the obtained edges. Fingers were then obtained from the parallel edge characteristics. The proposed system cannot recognize the side view of hand pose. Figure 7 below show simple example on appearance recognition.
Example on appearance recognition using foreground extraction in order to segment only ROI, where the object features can be extracted using different techniques such as pattern or image subtraction and foreground and background segmentation algorithms.
According to information mentioned in Table 2 . The first row indicates Haar-like feature which consider a good for analyze ROI pattern efficiently. Haar-like features can efficiently analyze the contrast between dark and bright object within a kernel, which can operate faster compared with pixel based system. In addition, it is immune for noise and lighting variation because they calculate the gray value difference between the white and black rectangles. The result of first row is 90%, but if compared with single gaussian model which used to describe hand color in HSV color space in the third row the result of recognition rate is 93%. Although both proposed system used the Adaboost algorithm to speed up the system and classification.
Motion-based recognition can be utilized for detection purposes; it can be extracts the object through a series of image frames. The AdaBoost algorithm utilized for object detection, characterization, movement modeling, and pattern recognition is needed to recognize the gesture [ 16 ]. The main issue encounter motion recognition is this is an occasion if one more gesture is active at the recognition process and also dynamic background has a negative effect. In addition, the loss of gesture may be caused by occlusion among tracked hand gesture or error in region extraction from tracked gesture and effect long-distance on the region appearance Table 3 presents a set of research papers that used different segmentation techniques based on motion recognition to detect ROI.
A set of research papers that have used motion-based detection for hand gesture application.
Author | Type of Camera | Resolution | Techniques/ Methods for Segmentation | Feature Extract Type | Classify Algorithm | Recognition Rate | No. of Gestures | Application Area | Dataset Type | Invariant Factor | Distance from Camera |
---|---|---|---|---|---|---|---|---|---|---|---|
[ ] | off-the-shelf cameras | – | RGB, HSV, Y–Cb–Cr & motion tracking | hand gesture | histogram distribution model | 97.33% | 10 gestures | human–computer interface | Data set created by author | other object moving and background issue | – |
[ ] | Canon GL2 camera | 720 × 480 pixels | face detection & optical flow | motion gesture | leave-one-out cross-validation | – | 7 gestures | gesture recognition system | Data set created by author | – | – |
[ ] | time of flight (TOF) SR4000 | 176 × 144 pixels | depth information, motion patterns | motion gesture | motion patterns compared | 95% | 26 gestures | interaction with virtual environments | cardinal directions dataset | depth range limitation | 3000 mm |
[ ] | digital camera | – | YUV & CAMShift algorithm | hand gesture | naïve Bayes classifier | high | unlimited | human and machine system | Data set created by author | changed illumination, rotation problem, position problem | – |
Two stages for efficient hand detection were proposed in [ 54 ]. First, the hand detected for each frame and center point is used for tracking the hand. Then, the second stage matching model applying to each type of gesture using a set of features is extracted from the motion tracking in order to provide better classification where the main drawback of the skin color is affected by lighting variations which lead to detect non-skin color. A standard face detection algorithm and optical flow computation was used by [ 55 ] to give a user-centric coordinate frame in which motion features were used to recognize gestures for classification purposes using the multiclass boosting algorithm. A real-time dynamic hand gesture recognition system based on TOF was offered in [ 56 ], in which motion patterns were detected based on hand gestures received as input depth images. These motion patterns were compared with the hand motion classifications computed from the real dataset videos which do not require the use of a segmentation algorithm. Where the system provides good result except the depth rang limitation of TOF camera. In [ 57 ], YUV color space was used, with the help of the CAMShift algorithm, to distinguish between background and skin color, and the naïve Bayes classifier was implemented to assist with gesture recognition. The proposed system faces some challenges such as illumination variation where light changes affect the result of the skin segment. Other challenges are the degree of gesture freedom which affect directly on the output result by change rotation. Next, hand position capture problem, if hand appears in the corner of the frame and the dots which must cover the hand does not lie on hand that may led to failing captured user gesture. In addition, the hand size quite differs between humans and maybe causes a problem with the interaction system. However, the major still challenging problem is the skin-like color which affects overall system and can abort the result. Figure 8 gives simple example on hand motion recognition.
Example on motion recognition using frame difference subtraction to extract hand feature, where the moving object such as hand extracted from the fixed background.
According to information mentioned in Table 3 . The first row recognition rate of system is 97%, where the hybrid system based on skin detect and motion detection is more reliable for gesture recognition, where the motion hand can track using multiple track candidates depend on stand derivation calculation for both skin and motion approach. Where every single gesture encoded as chain-code in order to model every single gesture which considers a simple model compared with (HMM) and classified gesture using a model of the histogram distribution. The proposed system in the third row use depth camera based on (TOF) where the motion pattern of the arm model for human utilized to define motion patterns, were the authors confirm that using the depth information for hand trajectories estimation is to improve gesture recognition rate. Moreover, the proposed system no need for the segmentation algorithm, where the system is examined using 2D and 2.5D approaches, were 2.5D performs better than 2D and gives recognition rate 95%.
The skeleton-based recognition specifies model parameters which can improve the detection of complex features [ 16 ]. Where the various representations of skeleton data for the hand model can be used for classification, it describes geometric attributes and constraint and easy translates features and correlations of data, in order to focus on geometric and statistic features. The most common feature used is the joint orientation, the space between joints, the skeletal joint location and degree of angle between joints and trajectories and curvature of the joints. Table 4 presents a set of research papers that use different segmentation techniques based on skeletal recognition to detect ROI.
Set of research papers that have used skeleton-based recognition for hand gesture application.
Author | Type of Camera | Resolution | Techniques/ Methods for Segmentation | Feature Extract Type | Classify Algorithm | Recognition Rate | No. of Gestures | Application Area | Dataset Type | Invariant Factor | Distance from Camera |
---|---|---|---|---|---|---|---|---|---|---|---|
[ ] | Kinect camera depth sensor | 512 × 424 pixels | Euclidean distance & geodesic distance | fingertip | skeleton pixels extracted | – | hand tracking | real time hand tracking method | – | – | – |
[ ] | Intel Real Sense depth camera | – | skeleton data | hand-skeletal joints’ positions | convolutional neural network (CNN) | 91.28% 84.35% | 14 gestures 28 gestures | classification method | Dynamic Hand Gesture-14/28 (DHG) dataset | only works on complete sequences | – |
[ ] | Kinect camera | 240 × 320 pixels | Laplacian-based contraction | skeleton points clouds | Hungarian algorithm | 80% | 12 gestures | hand gesture recognition method | ChaLearn Gesture Dataset (CGD2011) | HGR less performance in the viewpoint 0◦condition | – |
[ ] | RGB video sequence recorded | – | vision-based approach & skeletal data | hand and body skeletal features | skeleton classification network | – | hand gesture | sign language recognition | LSA64 dataset | difficulties in extracting skeletal data because of occlusions | – |
[ ] | Intel Real Sense depth camera | 640 × 480 pixels | depth and skeletal dataset | hand gesture | supervised learning classifier support vector machine (SVM) with a linear kernel | 88.24% 81.90% | 14 gestures 28 gestures | hand gesture application | Create SHREC 2017 track “3D Hand Skeletal Dataset | – | – |
[ ] | Kinect v2 camera sensor | 512 × 424 pixels | depth metadata | dynamic hand gesture | SVM | 95.42% | 10 gesture 26 gesture | Arabic numbers (0–9) letters (26) | author own dataset | low recognition rate, “O”, “T” and “2” | – |
[ ] | Kinect RGB camera & depth sensor | 640 × 480 | skeleton data | hand blob | – | – | hand gesture | Malaysian sign language | – | – | – |
Hand segmentation using the depth sensor of the Kinect camera, followed by location of the fingertips using 3D connections, Euclidean distance, and geodesic distance over hand skeleton pixels to provide increased accuracy was proposed in [ 58 ]. A new 3D hand gesture recognition approach based on a deep learning model using parallel convolutional neural networks (CNN) to process hand skeleton joints’ positions was introduced in [ 59 ], the proposed system has a limitation where it works only with complete sequence. The optimal viewpoint was estimated and the point cloud of gesture transformed using a curve skeleton to specify topology, then Laplacian-based contraction was applied to specify the skeleton points in [ 60 ]. Where the Hungarian algorithm was applied to calculate the match scores of the skeleton point set, but the joint tracking information acquired by Kinect is not accurate enough which give a result with constant vibration. A novel method based on skeletal features extracted from RGB recorded video of sign language, which presents difficulties to extracting accurate skeletal data because of occlusions, was offered in [ 61 ]. A dynamic hand gesture using depth and skeletal dataset for a skeleton-based approach was presented in [ 62 ], where supervised learning (SVM) used for classification with a linear kernel. Another dynamic hand gesture recognition using Kinect sensor depth metadata for acquisition and segmentation which used to extract orientation feature, where the support vector machine (SVM) algorithm and HMM was utilized for classification and recognition to evaluate system performance where the SVM bring a good result than HMM in some specification such elapsed time, average recognition rate, was proposed in [ 63 ]. A hybrid method for hand segmentation based on depth and color data acquired by the Kinect sensor with the help of skeletal data were proposed in [ 64 ]. In this method, the image threshold is applied to the depth frame and the super-pixel segmentation method is used to extract the hand from the color frame, then the two results are combined for robust segmentation. Figure 9 show an example on skeleton recognition.
Example of skeleton recognition using depth and skeleton dataset to representation hand skeleton model [ 62 ].
According to information mentioned in Table 4 . The depth camera provides good accuracy for segmentation, because not affected by lightening variations and cluttered background. However, the main issue is in the range of detection. The Kinect V1 sensor has an embedded system in which gives feedback information received by depth sensor as a metadata, which gives information about human body joint coordinate. The Kinect V1 provides information used to track skeletal joint up to 20 joints, that’s help to module the hand skeleton. While Kinect V2 sensor can tracking joint as 25 joints and up to six people at the same time with full joints tracking. With a range of detection between (0.5–4.5) meter.
Approaches have proposed for solving hand gesture recognition using different types of cameras. A depth camera provides 3D geometric information about the object [ 65 ]. Previously, both major approximations were utilized: TOF precepts and light coding. The 3D data from a depth camera directly reflects the depth field if compared with a color image which contains only a projection [ 66 ]. Using this approach, the lighting, shade, and color did not affect the result image. However, the cost, size and availability of the depth camera will limit its use [ 67 ]. Table 5 presents a set of research papers that use different segmentation techniques based on depth recognition to detect ROI.
Set of research papers that have used depth-based detection for hand gesture and finger counting application.
Author | Type of Camera | Resolution | Techniques/ Methods for Segmentation | Feature Extract Type | Classify Algorithm | Recognition Rate | No. of Gestures | Application Area | Invariant Factor | Distance from Camera |
---|---|---|---|---|---|---|---|---|---|---|
[ ] | Kinect V1 | RGB - 640 × 480 depth - 320 × 240 | threshold & near-convex shape | finger gesture | finger–earth movers distance (FEMD) | 93.9% | 10 gestures | human–computer interactions (HCI) | – | – |
[ ] | Kinect V2 | RGB - 1920 × 1080 depth - 512 × 424 | local neighbor method & threshold segmentation | fingertip | convex hull detection algorithm | 96% | 6 gestures | natural human–robot interaction | – | (500–2000) mm |
[ ] | Kinect V2 | Infrared sensor depth - 512 × 424 | operation of depth and infrared images | finger counting & hand gesture | number of separate areas | – | finger count & two hand gestures | mouse-movement controlling | – | < 500 mm |
[ ] | Kinect V1 | RGB - 640 × 480 depth - 320 × 240 | depth thresholds | finger gesture | finger counting classifier & finger name collect & vector matching | 84% one hand 90% two hand | 9 gestures | chatting with speech | – | (500–800) mm |
[ ] | Kinect V1 | RGB - 640 × 480 depth - 320 × 240 | frame difference algorithm | hand gesture | automatic state machine (ASM) | 94% | hand gesture | human–computer interaction | – | – |
[ ] | Kinect V1 | RGB - 640 × 480 depth - 320 × 240 | skin & motion detection & Hu moments an orientation | hand gesture | discrete hidden Markov model (DHMM) | – | 10 gestures | human–computer interfacing | – | – |
[ ] | Kinect V1 | depth - 640 × 480 | range of depth image | hand gestures 1–5 | kNN classifier & Euclidian distance | 88% | 5 gestures | electronic home appliances | – | (250–650) mm |
[ ] | Kinect V1 | depth - 640 × 480 | distance method | hand gesture | – | – | hand gesture | human–computer interaction (HCI) | – | – |
[ ] | Kinect V1 | depth - 640 × 480 | threshold range | hand gesture | – | – | hand gesture | hand rehabilitation system | – | 400–1500 mm |
[ ] | Kinect V2 | RGB - 1920 × 1080 depth - 512 × 424 | Otsu’s global threshold | finger gesture | kNN classifier & Euclidian distance | 90% | finger count | human–computer interaction (HCI) | hand not identified if it’s not connected with boundary | 250–650 mm |
[ ] | Kinect V1 | RGB - 640 × 480 depth - 640 × 480 | depth-based data and RGB data together | finger gesture | distance from the device and shape bases matching | 91% | 6 gesture | finger mouse interface | – | 500––800 mm |
[ ] | Kinect V1 | depth - 640 × 480 | depth threshold and K-curvature | finger counting | depth threshold and K-curvature | 73.7% | 5 gestures | picture selection application | detection fingertips should though the hand was moving or rotating | – |
[ ] | Kinect V1 | RGB - 640 × 480 depth - 320 × 240 | integrate the RGB and depth information | hand gesture | forward recursion & SURF | 90% | hand gesture | virtual environment | – | – |
[ ] | Kinect V2 | depth - 512 × 424 | skeletal data stream & depth & color data streams | hand gesture | support vector machine (SVM) & artificial neural networks (ANN) | 93.4% for SVM 98.2% for ANN | 24 alphabets hand gesture | American Sign Language | – | 500––800 mm |
The finger earth mover’s distance (FEMD) approach was evaluated in terms of speed and precision, and then compared with the shape-matching algorithm using the depth map and color image acquired by the Kinect camera [ 65 ]. Improved depth threshold segmentation was offered in [ 68 ], by combining depth and color information using the hierarchical scan method, then hand segmentation by the local neighbor method; this approach gives a result over a range of up to two meters. A new method was proposed in [ 69 ], based on a near depth range of less than 0.5 m where skeletal data were not provided by Kinect. This method was implemented using two image frames, depth and infrared. A depth threshold was used in order to segment the hand, then a K-mean algorithm was applied to obtain both user’s hand pixels [ 70 ]. Next, Graham’s scan algorithm was used to detect the convex hulls of the hand in order to merge with the result of the contour tracing algorithm to detect the fingertip. The depth image frame was analyzed to extract 3D hand gestures in real time, which were executed using frame differences to detect moving objects [ 71 ]. The foremost region was utilized and classified using an automatic state machine algorithm. The skin–motion detection technique was used to detect the hand, then Hu moments were applied to feature extraction, after which HMM was used for gesture recognition [ 72 ]. Depth range was utilized for hand segmentation, then Otsu’s method was used for applying threshold value to the color frame after it was converted into a gray frame [ 14 ]. A kNN classifier was then used to classify gestures. In [ 73 ], where the hand was segmented based on depth information using a distance method, background subtraction and iterative techniques were applied to remove the depth image shadow and decrease noise. In [ 74 ], the segmentation used 3D depth data selected using a threshold range. In [ 75 ], the proposed algorithm used an RGB color frame, which converted to a binary frame using Otsu’s global threshold. After that, a depth range was selected for hand segmentation and then the two methods were aligned. Finally, the kNN algorithm was used with Euclidian distance for finger classification. Depth data and an RGB frame were used together for robust hand segmentation and the segmented hand matched with the dataset classifier to identify the fingertip [ 76 ]. This framework was based on distance from the device and shape based matching. The fingertips selected using depth threshold and the K-curvature algorithm based on depth data were presented in [ 77 ]. A novel segmentation method was implemented in [ 78 ] by integrating RGB and depth data, and classification was offered using speeded up robust features (SURF). Depth information with skeletal and color data were used in [ 79 ], to detect the hand, then the segmented hand was matched with the dataset using SVM and artificial neural networks (ANN) for recognition. The authors concluded that ANN was more accurate than SVM. Figure 10 shows an example of segmentation using Kinect depth sensor.
Depth-based recognition: ( a ) hand joint distance from camera; ( b ) different feature extraction using Kinect depth sensor.
The 3D model essentially depends on 3D Kinematic hand model which has a large degree of freedom, where hand parameter estimation obtained by comparing the input image with the two-dimensional appearance projected by three-dimensional hand model. In addition, the 3D model introduces human hand feature as pose estimation by forming volumetric or skeletal or 3D model that identical to the user’s hand. Where the 3D model parameter updated through the matching process. Where the depth parameter is added to the model to increase accuracy. Table 6 presents a set of research papers based on 3D model.
Set of research papers that have used 3D model-based recognition for HCI, VR and human behavior application.
Author | Type of Camera | Techniques/ Methods for Segmentation | Feature Extract Type | Classify Algorithm | Type of Error | Hardware Run | Application Area | Dataset Type | Runtime Speed |
---|---|---|---|---|---|---|---|---|---|
[ ] | RGB camera | network directly predicts the control points in 3D | 3D hand poses, 6D object poses ,object classes and action categories | PnP algorithm & Single-shot neural network | Fingertips 48.4 mm Object coordinates 23.7 mm | real-time speed of 25 fps on an NVIDIA Tesla M40 | framework for understanding human behavior through 3Dhand and object interactions | First-person hand action (FPHA) dataset | 25 fps |
[ ] | Prime sense depth cameras | depth maps | 3D hand pose estimation & sphere model renderings | Pose estimation neural network | mean joint error (stack = 1) 12.6 mm (stack = 2) 12.3 mm | – | design hand pose estimation using self-supervision method | NYU Hand Pose Dataset | – |
[ ] | RGB-D camera | Single RGB image direct feed to the network | 3D hand shape and pose | train networks with full supervision | Mesh error 7.95 mm Pose error 8.03 mm | Nvidia GTX 1080 GPU | design model for estimate 3D hand shape from a monocular RGB image | Stereo hand pose tracking benchmark (STB) & Rendered Hand Pose Dataset (RHD) | 50 fps |
[ ] | Kinect V2 camera | segmentation mask Kinect body tracker | hand | machine learning | Marker error 5% subset of the frames in each sequence & pixel classification error | CPU only | interactions with virtual and augmented worlds | Finger paint dataset & NYU dataset used for comparison | high frame-rate |
[ ] | raw depth image | CNN-based hand segmentation | 3D hand pose regression pipeline | CNN-based algorithm | 3D Joint Location Error 12.9 mm | Nvidia Geforce GTX 1080 Ti GPU | applications of virtual reality (VR) | dataset contains 8000 original depth images created by authors | – |
[ ] | Kinect V2 camera | bounding box around the hand & hand mask | hand | appearance and the kinematics of the hand | percentage of template vertices over all frames | – | Interaction with deformable object & tracking | synthetic dataset generated with the Blender modeling software | – |
[ ] | RGBD data from 3 Kinect devices | regression-based method & hierarchical feature extraction | 3D hand pose estimation | 3D hand pose estimation via semi-supervised learning. | Mean error 7.7 mm | NVIDIA TITAN Xp GPU | human–computer interaction (HCI), computer graphics and virtual/augmented reality | For evaluation ICVL Dataset & MSRA Dataset & NYU Dataset | 58 fps |
[ ] | single depth images. | depth image | 3D hand pose | 3D point cloud of hand as network input and outputs heat-maps | mean error distances | Nvidia TITAN Xp GPU | (HCI), computer graphics and virtual/augmented reality | For evaluation NYU dataset & ICVL dataset & MSRA datasets | 41.8 fps |
[ ] | depth images | predicting heat maps of hand joints in detection-based methods | hand pose estimation | dense feature maps through intermediate supervision in a regression-based framework | mean error 6.68 mm maximal per-joint error 8.73 mm | GeForce GTX 1080 Ti | (HCI), virtual and mixed reality | For evaluation ‘HANDS 2017′ challenge dataset & first-person hand action | – |
[ ] | RGB-D cameras | – | 3D hand pose estimation | weakly supervised method | mean error 0.6 mm | GeForce GTX 1080 GPU with CUDA 8.0. | (HCI), virtual and mixed reality | Rendered hand pose (RHD) dataset | – |
A study by Tekin et al. [ 80 ] proposed a new model to understand interactions between 3D hands and object using single RGB image, where single image is trained end-to-end using neural network, and show jointly estimation of the hand and object poses in 3D. Wan et al. [ 81 ] proposed 3D hand pose estimation from single depth map using self-supervision neural network by approximating the hand surface with a set of spheres. A novel of estimating full 3D hand shape and pose presented by Ge et al. [ 82 ] based on single RGB image. Where Graph Convolutional Neural Network (Graph CNN) utilized to reconstruct full 3D mesh for hand surface. Another study by Taylor et al. [ 83 ] proposed a new system tracking human hand by combine surface model with new energy function which continuously optimized jointly over pose and correspondences, which can track the hand for several meter from the camera. Malik et al. [ 84 ] proposed a novel CNN-based algorithm which automatically learns in order to segment hand from a raw depth image and estimate 3D hand pose estimation including the structural constraints of hand skeleton. Tsoli et al. [ 85 ] presented a novel method to track a complex deformable object in interaction with a hand. Chen et al. [ 86 ] proposed self-organizing hand network SO—Hand Net—which achieved 3D hand pose estimation via semi-supervised learning. Where end-to-end regression method utilized for single depth image to estimation 3D hand pose. Another study by Ge et al. [ 87 ] proposed a point-to-point regression method for 3D hand pose estimation in single depth images. Wu et al. [ 88 ] proposed novel hand pose estimation from a single depth image by combine detection based method and regression-based method to improve accuracy. Cai et al. [ 89 ] present one-way to adapt a weakly labeled real-world dataset from a fully annotated synthetic dataset with the aid of low-cost depth images and take only RGB inputs for 3D joint predictions. Figure 11 shows an example of a 3D hand model interaction with virtual system.
3D hand model interaction with virtual system [ 83 ].
There are some reported limitations, such as 3D hand required a large dataset of images to formulate the characteristic shapes of the hand in case multi-view. Moreover, the matching process considers time consumption, also computation costly and less ability to treat unclear views.
The artificial intelligence offers a good and reliable technique used in a wide range of modern applications because of using a learning role principle. The deep learning used multilayers for learning data and gives a good prediction out result. The most challenges facing this technique is required dataset to learn algorithm which may affect time processing. Table 7 presents a set of research papers that use different techniques based on deep-learning recognition to detect ROI.
Set of research papers that have used deep-learning-based recognition for hand gesture application.
Author | Type of Camera | Resolution | Techniques/ Methods for Segmentation | Feature Extract Type | Classify Algorithm | Recognition Rate | No. of Gestures | Application Area | Dataset Type | Hardware Run |
---|---|---|---|---|---|---|---|---|---|---|
[ ] | Different mobile cameras | HD and 4k | features extraction by CNN | hand gestures | Adapted Deep Convolutional Neural Network (ADCNN) | training set 100% test set 99% | 7 hand gestures | (HCI) communicate for people was injured Stroke | Created by video frame recorded | Core™ i7-6700 CPU @ 3.40 GHz |
[ ] | webcam | – | skin color detection and morphology & background subtraction | hand gestures | deep convolutional neural network (CNN) | training set 99.9% test set 95.61% | 6 hand gestures | Home appliance control (smart homes) | 4800 image collect for train and 300 for test | – |
[ ] | RGB image | 640 × 480 pixels | No segment stage Image direct fed to CNN after resizing | hand gestures | deep convolutional neural network | simple backgrounds 97.1% complex background 85.3% | 7 hand gestures | Command consumer electronics device such as mobiles phones and TVs | Mantecón et al.* dataset for direct testing | GPU with 1664 cores, base clock of 1050 MHz |
[ ] | Kinect | – | skin color modeling combined with convolution neural network image feature | hand gestures | convolution neural network & support vector machine | 98.52% | 8 hand gestures | – | image information collected by Kinect | CPUE 5-1620v4, 3.50 GHz |
[ ] | Kinect | Image size 200 × 200 | skin color -Y–Cb–Cr color space & Gaussian Mixture model | hand gestures | convolution neural network | Average 95.96% | 7 hand gestures | human hand gesture recognition system | image information collected by Kinect | – |
[ ] | video sequences recorded | – | Semantic segmentation based deconvolution neural network | hand gesture motion | convolution network (LRCN) deep | 95% | 9 hand gestures | intelligent vehicle applications | Cambridge gesture recognition dataset | Nvidia Geforce GTX 980 graphics |
[ ] | image | Original images in the database 248 × 256 or 128 × 128 pixels | Canny operator edge detection | hand gesture | double channel convolutional neural network (DC-CNN) & softmax classifier | 98.02% | 10 hand gestures | man–machine interaction | Jochen Triesch Database (JTD) & NAO Camera hand posture Database (NCD) | Core i5 processor |
[ ] | Kinect | – | – | Skeleton-based hand gesture recognition. | neural network based on SPD | 85.39% | 14 hand gestures | – | Dynamic Hand Gesture (DHG) dataset & First-Person Hand Action (FPHA) dataset | non-optimized CPU 3.4 GHz |
Authors proposed seven popular hand gestures which captured by mobile camera and generate 24,698 image frames. The feature extraction and adapted deep convolutional neural network (ADCNN) utilized for hand classification. The experiment evaluates result for the training data 100% and testing data 99%, with execution time 15,598 s [ 90 ]. While other proposed systems used webcam in order to track hand. Then used skin color (Y–Cb–Cr color space) technique and morphology to remove the background. In addition, kernel correlation filters (KCF) used to track ROI. The resulted image enters into a deep convolutional neural network (CNN). Where the CNN model used to compare performance of two modified from Alex Net and VGG Net. The recognition rate for training data and testing data, respectively 99.90% and 95.61% in [ 91 ]. A new method based on deep convolutional neural network, where the resized image directly feds into the network ignoring segmentation and detection stages in orders to classify hand gestures directly. The system works in real time and gives a result with simple background 97.1% and with complex background 85.3% in [ 92 ]. The depth image produced by Kinect sensor used to segment color image then skin color modeling combined with convolution neural network, where error back propagation algorithm applied to modify the threshold and weights for the neural network. The SVM classification algorithm added to the network to enhance result in [ 93 ]. Other research study used Gaussian Mixture model (GMM) to filter out non-skin colors of an image which used to train the CNN in order to recognize seven hand gestures, where the average recognition rate 95.96 % in [ 94 ]. The next proposed system used long-term recurrent convolutional network-based action classifier, where multiple frames sampled from the video sequence recorded is fed to the network. In order to extract the representative frames, the semantic segmentation-based de-convolutional neural network is used. The tiled image patterns and tiled binary patterns are utilized to train the de-convolutional network in [ 95 ]. A double-channel convolutional neural network (DC-CNN) is proposed by [ 96 ] where the original image preprocessed to detect the edge of the hand before fed to the network. The each of two-channel CNN has a separate weight and softmax classifier used to classify output results. The proposed system gives recognition rate of 98.02%. Finally, a new neural network based on SPD manifold learning for skeleton-based hand gesture recognition proposed by [ 97 ]. Figure 12 below shown example on deep learn convolution neural network.
Simple example on deep learning convolutional neural network architecture.
Research into hand gestures has become an exciting and relevant field; it offers a means of natural interaction and reduces the cost of using sensors in terms of data gloves. Conventional interactive methods depend on different devices such as a mouse, keyboard, touch screen, joystick for gaming and consoles for machine controls. The following sections describe some popular applications of hand gestures. Figure 13 shows the most common application area deal with hand gesture recognition techniques.
Most common application area of hand gesture interaction system (the image of Figure 13 is adapted from [ 12 , 14 , 42 , 76 , 83 , 98 , 99 ]).
During clinical operations, a surgeon may need details about the patient’s entire body structure or a detailed organ model in order to shorten the operating time or increase the accuracy of the result. This is achieved by using a medical imaging system such as MRI, CT or X-ray system [ 10 , 99 ], which collects data from the patient’s body and displays them on the screen as a detailed image. The surgeon can facilitate interaction with the viewed images by performing hand gestures in front of the camera using a computer vision technique. These gestures can enable some operations such as zooming, rotating, image cropping and going to the next or previous slide without using any peripheral device such as a mouse, keyboard or touch screen. Any additional equipment requires sterilization, which can be difficult in the case of keyboards and touch screen. In addition, hand gestures can be used for assistive purpose such as wheelchair control [ 43 ].
Sign language is an alternative method used by people who are unable to communicate with others by speech. It consists of a set of gestures wherein every gesture represents one letter, number or expression. Many research papers have proposed recognition of sign language for deaf-mute people, using a glove-attached sensor worn on the hand that gives responses according to hand movement. Alternatively, it may involve uncovered hand interaction with the camera, using computer vision techniques to identify the gesture. For both approaches mentioned above, the dataset used for classification of gestures matches a real-time gesture made by the user [ 11 , 42 , 50 ].
Robot technology is used in many application fields such as industry, assistive services [ 100 ], stores, sports and entertainment. Robotic control systems use machine learning techniques, artificial intelligence and complex algorithms to execute a specific task, which lets the robotic system, interact naturally with the environment and make an independent decision. Some research proposes computer vision technology with a robot to build assistive systems for elderly people. Other research uses computer vision to enable a robot to ask a human for a proper path inside a specific building [ 12 ].
Virtual environments are based on a 3D model that needs a 3D gesture recognition system in order to interact in real time as a HCI. These gestures may be used for modification and viewing or for recreational purposes, such as playing a virtual piano. The gesture recognition system utilizes a dataset to match it with an acquired gesture in real time [ 13 , 78 , 83 ].
Hand gestures can be used efficiently for home automation. Shaking a hand or performing some gesture can easily enable control of lighting, fans, television, radio, etc. They can be used to improve older people’s quality of life [ 14 ].
Hand gestures can be used as an alternative input device that enables interaction with a computer without a mouse or keyboard, such as dragging, dropping and moving files through the desktop environment, as well as cut and paste operations [ 19 , 69 , 76 ]. Moreover, they can be used to control slide show presentations [ 15 ]. In addition, they are used with a tablet to permit deaf-mute people to interact with other people by moving their hand in front of tablet’s camera. This requires the installation of an application that translates sign language to text, which is displayed on the screen. This is analogous to the conversion of acquired voice to text.
The best example of gesture interaction for gaming purposes is the Microsoft Kinect Xbox, which has a camera placed over the screen and connects with the Xbox device through the cable port. The user can interact with the game by using hand motions and body movements that are tracked by the Kinect camera sensor [ 16 , 98 ].
From the previous sections, it is easy to identify the research gap, since most research studies focus on computer applications, sign language and interaction with a 3D object through a virtual environment. However, many research papers deal with enhancing frameworks for hand gesture recognition or developing new algorithms rather than executing a practical application with regard to health care. The biggest challenge encountered by the researcher is in designing a robust framework that overcomes the most common issues with fewer limitations and gives an accurate and reliable result. Most proposed hand gesture systems can be divided into two categories of computer vision techniques. First, a simple approach is to use image processing techniques via Open-NI library or OpenCV library and possibly other tools to provide interaction in real time, which considers time consumption because of real-time processing. This has some limitations, such as background issues, illumination variation, distance limit and multi-object or multi-gesture problems. A second approach uses dataset gestures to match against the input gesture, where considerably more complex patterns require complex algorithm. Deep learning technique and artificial intelligence techniques to match the interaction gesture in real time with dataset gestures already containing specific postures or gestures. Although this approach can identify a large number of gestures, it has some drawbacks in some cases, such as missing some gestures because of the classification algorithms accuracy contrast. In addition, it takes time more than first approach because of the matching dataset in case of using a large number of the dataset. In addition, the dataset of gestures cannot be used by other frameworks.
Hand gesture recognition addresses a fault in interaction systems. Controlling things by hand is more natural, easier, more flexible and cheaper, and there is no need to fix problems caused by hardware devices, since none is required. From previous sections, it was clear to need to put much effort into developing reliable and robust algorithms with the help of using a camera sensor has a certain characteristic to encounter common issues and achieve a reliable result. Each technique mentioned above, however, has its advantages and disadvantages and may perform well in some challenges while being inferior in others.
The authors would like to thank the staff in Electrical Engineering Technical College, Middle Technical University, Baghdad, Iraq and the participants for their support to conduct the experiments.
Conceptualization, A.A.-N. & M.O.; funding acquisition, A.A.-N. & J.C.; investigation, M.O.; methodology, M.O. & A.A.-N.; project administration, A.A.-N. and J.C.; supervision, A.A.-N. & J.C.; writing– original draft, M.O.; writing– review & editing, M.O., A.A.-N. & J.C. All authors have read and agreed to the published version of the manuscript.
This research received no external funding.
The authors of this manuscript have no conflicts of interest relevant to this work.
Click through the PLOS taxonomy to find articles in your field.
For more information about PLOS Subject Areas, click here .
Loading metrics
Open Access
Peer-reviewed
Research Article
Contributed equally to this work with: Jiawei Wu, Peng Ren
Roles Conceptualization, Data curation, Methodology, Writing – original draft
Affiliation School of Medical Information and Engineering, Xuzhou Medical University, Xuzhou, China
Roles Conceptualization, Methodology, Writing – review & editing
* E-mail: [email protected]
Affiliations School of Medical Information and Engineering, Xuzhou Medical University, Xuzhou, China, Engineering Research Center of Medical and Health Sensing Technology, Xuzhou Medical University, Xuzhou, China
Roles Formal analysis, Methodology
Roles Data curation, Resources
Roles Conceptualization, Writing – review & editing
As a novel form of human machine interaction (HMI), hand gesture recognition (HGR) has garnered extensive attention and research. The majority of HGR studies are based on visual systems, inevitably encountering challenges such as depth and occlusion. On the contrary, data gloves can facilitate data collection with minimal interference in complex environments, thus becoming a research focus in fields such as medical simulation and virtual reality. To explore the application of data gloves in dynamic gesture recognition, this paper proposes a data glove-based dynamic gesture recognition model called the Attention-based CNN-BiLSTM Network (A-CBLN). In A-CBLN, the convolutional neural network (CNN) is employed to capture local features, while the bidirectional long short-term memory (BiLSTM) is used to extract contextual temporal features of gesture data. By utilizing attention mechanisms to allocate weights to gesture features, the model enhances its understanding of different gesture meanings, thereby improving recognition accuracy. We selected seven dynamic gestures as research targets and recruited 32 subjects for participation. Experimental results demonstrate that A-CBLN effectively addresses the challenge of dynamic gesture recognition, outperforming existing models and achieving optimal gesture recognition performance, with the accuracy of 95.05% and precision of 95.43% on the test dataset.
Citation: Wu J, Ren P, Song B, Zhang R, Zhao C, Zhang X (2023) Data glove-based gesture recognition using CNN-BiLSTM model with attention mechanism. PLoS ONE 18(11): e0294174. https://doi.org/10.1371/journal.pone.0294174
Editor: Muhammad Bilal, University of Southampton - Malaysia Campus, MALAYSIA
Received: July 20, 2023; Accepted: October 26, 2023; Published: November 17, 2023
Copyright: © 2023 Wu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The experimental data used in this article are subject to access restrictions, and the corresponding author does not have permission to make them public. If you have any questions, or if you would like to request access to the data set, please contact Heng Wan, Director of the Information Security Department at Xuzhou Medical University, at the following email: [email protected] .
Funding: This research was funded by The Unveiling & Leading Project of XZHMU, grant number No. JBGS202204. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
With the rapid development of computer technology and artificial intelligence, Human Machine Interaction (HMI) has emerged as one of the most prominent research fields in contemporary times. The driving force behind HMI is our expectation that machines will become intelligent and perceptive like humans [ 1 ]. HMI refers to the process of exchanging information between humans and machines through effective dialogue. HMI systems can collect human-intended information and transform it into a format understandable by machines, enabling machines to operate based on human intent [ 2 ]. Traditional HMI primarily relies on tools such as joysticks, keyboards, and mice to control terminals, which usually require fixed operational spaces. This severely restricts the range of human expressive actions and diminishes work efficiency. Consequently, to enhance the naturalness of HMI, the next generation of HMI technology needs to be human-centric, diversified, and intelligent [ 3 ]. In real-life situations, besides verbal communication, gestures serve as one of the most significant means for humans to convey information, enabling direct and effective expression of user needs. Research conducted by Liu et al. pointed out that hand gestures constitute a significant part of human communication, with advantages including high flexibility and rich meaning, making them an important modality in HMI [ 4 ]. Consequently, Hand Gesture Recognition (HGR) has emerged as a new type of HMI technology and has become a research hotspot with enormous potential in various domains. For instance, in the healthcare domain, capturing and analyzing physiological characteristics related to finger movements can significantly assist in studying and developing appropriate rehabilitation postures [ 5 ]. In the field of mechanical automation, interaction between fingers and machines can be achieved by detecting finger motion trajectories [ 6 ]. In the field of virtual reality, defining different gesture commands allows users to control the movements of virtual characters from a first-person perspective [ 7 ].
Research on HGR can be classified into two categories based on the methods of acquiring gesture data: vision-based HGR and wearable device-based HGR. Vision-based HGR relies on cameras as the primary tools for capturing gesture data. They offer advantages such as low cost and no direct contact with the human hands. However, despite the success of high-quality cameras, vision-based systems still have some inherent limitations, including a restricted field of view and high computational costs [ 8 , 9 ]. In certain scenarios, robust results may require the combined data acquisition from multiple cameras due to issues like depth and occlusion [ 10 , 11 ]. Consequently, the presence of these aforementioned challenges often hinders vision-based HGR methods from achieving optimal performance. In recent years, wearable device-based HGR has witnessed rapid development due to advancements in sensing technology and widespread sensor applications. Compared to vision-based approaches, wearable device-based HGR eliminates the need to consider camera distribution and is less susceptible to external environmental factors such as lighting, occlusion, and background interference. Data gloves represent a typical example of wearable devices used in HGR. These gloves are equipped with position tracking sensors that enable real-time capture of spatial motion trajectory information of users’ hand postures. Based on predefined algorithms, gesture actions can be recognized, mapped to corresponding response modules, and thus complete the HMI process. HGR systems based on data gloves have become a research hotspot in the relevant field. These systems offer several advantages, including stable acquisition of gesture data, reduced interference from complex environments and satisfactory modeling and recognition results, especially when dealing with large-scale gesture data [ 12 ].
In the field of HGR, researchers primarily focus on two types of gestures: static gestures and dynamic gestures. Static HGR systems analyze hand posture data at a specific moment to determine its corresponding meaning. However, static gesture data only provide spatial information of hand postures at each moment, while temporal information of hand movements is disregarded. As a result, the actual semantic information conveyed is limited, making it challenging to extend to complex real-world applications. Dynamic HGR systems, on the other hand, deal with information regarding the changes in hand movement postures over a period of time. These systems require a comprehensive consideration of both spatial and temporal aspects of hand postures. Clearly, compared to static gestures, dynamic gestures can convey richer semantic information and better align with people’s actual needs in real-life scenarios. Although numerous research efforts have been dedicated to dynamic HGR algorithms, most are based on vision systems, and the challenge of dynamic HGR using data gloves remains.
The dynamic gesture investigated in this study is the seven-step handwashing, which is a crucial step in the healthcare field. Proper handwashing procedures can effectively reduce the probability of disease transmission. Our work applies the seven-step handwashing to medical simulation training, where users wear data gloves to perform the handwashing process. Additionally, we design an automated dynamic gesture recognition algorithm to assess whether users correctly execute the specified hand gesture steps. Specifically, we developed a data glove-based dynamic HGR algorithm in this paper by incorporating deep learning techniques. This algorithm considers both spatial and temporal information of gesture data. Firstly, the Convolutional Neural Network (CNN) is utilized to extract local features of gesture data at each moment. Subsequently, these features are incorporated into the Bidirectional Long Short-Term Memory (BiLSTM) structure to model the temporal relationships. Finally, an attention mechanism is employed to enhance the gesture features and output the recognition results of dynamic gestures. In summary, this paper makes three main contributions:
The remaining sections of this paper are organized as follows. In Section 2, we review recent works related to HGR, with a particular focus on data glove-based HGR methods. Section 3 provides a detailed description of the proposed algorithm for dynamic gesture recognition. Section 4 encompasses the data collection methodology for gestures and provides implementation details of the conducted experiments. The relevant experimental results and analysis are presented in Section 5, followed by a concise summary of this paper in Section 6.
In recent years, research in the HGR field has focused on two main aspects: the type of gesture data (static or dynamic) and the sensors used for data collection (visual systems or wearable devices). This section provides an overview of relevant studies in HGR, emphasizing research involving wearable devices like data gloves.
Static hand gesture recognition research primarily focuses on analyzing the spatial features of gesture data without considering its temporal variations. This type of research is primarily applied in sign language recognition scenarios. A static hand gesture recognition system based on wavelet transform and neural networks was proposed by Karami et al. [ 13 ]. The system operated by taking hand gesture images acquired by a camera as input and extracting image features using Discrete Wavelet Transform (DWT). These features were fed into a neural network for classification. In the experimental section, 32 Persian sign language (PSL) letter symbols were selected for investigation. The training was conducted on 416 images, while testing was performed on 224 images, resulting in a test accuracy of 83.03%. Thalange et al. [ 14 ] introduced two novel feature extraction techniques, Combined Orientation Histogram and Statistical (COHST) Features and Wavelet Features, to address the recognition of static symbols representing numbers 0 to 9 in American Sign Language. Hand gesture data was collected using a 5-megapixel network camera and processed with different feature extraction methods before input into a neural network for training. The proposed approach achieved an outstanding average recognition rate of 98.17%. Moreover, a novel data glove with 14 sensor units was proposed by Wu et al. [ 15 ], who explored its performance in static hand gesture recognition. They defined 10 static hand gestures representing digits 0–9 and collected data from 10 subjects, with 50% of the data used for training and the remaining 50% for testing. By employing a neural network for classification experiments, they achieved an impressive overall recognition accuracy of 98.8%. Lee et al. [ 16 ] introduced a knitted glove capable of pattern recognition for hand poses and designed a novel CNN model for hand gesture classification experiments. The experimental results demonstrated that the proposed CNN structure effectively recognized 10 static hand gestures, with classification accuracies ranging from 79% to 97% for different gestures and an average accuracy of 89.5%. However, they only recruited 10 subjects for the experiments. Antillon et al. [ 17 ] developed an intelligent diving glove capable of recognizing 13 static hand gestures for underwater communication. They employed five classical machine learning classification algorithms and conducted training on hand gesture data from 24 subjects, with testing performed on an independent group of 10 subjects. The experimental results indicated that all classification algorithms achieved satisfactory hand gesture recognition performance in dry environments, with accuracies ranging from 95% to 98%. The performance slightly declined in underwater experimental conditions, with accuracies ranging from 81% to 94%. Yuan et al. [ 18 ] developed a wearable gesture recognition system that can simultaneously recognize ten types of numeric gestures and nine types of complex gestures. They utilized the Multilayer Perceptron (MLP) algorithm to recognize 19 static gestures with 100% accuracy, showcasing the strong capabilities of deep learning technology in the field of HGR. However, it is worth noting that the sample data in their experimental section was derived solely from four male volunteers. Moreover, a data glove based on flexible sensors was utilized by Ge et al. [ 19 ] to accurately predict the final hand gesture before the completion of the user’s hand movement in real time. They constructed a gesture dataset called Flex-Gesture, which consisted of 16 common gestures, each comprising 3000 six-dimensional flexion data points. Additionally, they proposed a multimodal data feature fusion approach and employed a combination of neural networks and support vector machines (SVM) as classifiers. The system achieved a remarkable prediction accuracy of 98.29% with a prediction time of only 0.2329 ms. However, it should be noted that the data glove-based system had certain limitations as it did not consider temporal information in the hand gestures. It is worth mentioning that the authors believe that incorporating deep learning algorithms with temporal features analysis could potentially yield more effective results.
Unlike static gesture recognition, dynamic gesture recognition requires considering the spatial information of hand movements and their temporal variations. With the rapid advancement of deep learning techniques, researchers have extensively investigated structures such as Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) and applied them to real-time dynamic gesture recognition problems. Nguyen et al. [ 20 ] presented a novel approach for continuous dynamic gesture recognition using RGB video input. Their method comprises two main components: a gesture localization module and a gesture classification module. The former aims to separate gestures using a BiLSTM network to segment continuous gesture sequences. The latter aims to classify gestures and efficiently combine data from multiple channels, including RGB, optical flow, and 3D key pose positions, using two 3D CNNs and a Long Short-Term Memory (LSTM). The method was evaluated on three publicly available datasets, achieving an average Jaccard index of 0.5535. Furthermore, Paweł et al. [ 21 ] developed a system capable of rapidly and effectively recognizing hand gestures in hand-body language using a dedicated glove with ten sensors. Their experiments defined 22 hand-body language gestures and recorded 2200 gesture data samples (10 participants, each gesture action repeated 10 times). Three machine learning classifiers were employed for training and testing, resulting in a high sensitivity rate of 98.32%. The pioneering work of Emmanuel et al. [ 22 ] introduced the use of CNN for grasp classification using piezoelectric data gloves. Experimental data were collected from five participants, each performing 30 object grasps following Schlesinger’s classification method. The results demonstrated that the CNN architecture achieved the highest classification accuracy (88.27%). It is worth mentioning that the authors plan to leverage the strengths of both CNN and RNN in future work to improve gesture prediction accuracy. Lee et al. [ 23 ] developed a real-time dynamic gesture recognition data glove. They employed neural network structures such as LSTM, fully connected layers, and novel gesture localization and recognition algorithms. This allowed the successful classification of 11 dynamic finger gestures with a gesture recognition time of less than 12 ms. Yuan et al. [ 24 ] designed a data glove equipped with 3D flexible sensors and two wristbands and proposed a novel deep feature fusion network to capture fine-grained gesture information. They first fused multi-sensor data using a CNN structure with residual connections and then modeled long-range dependencies of complex gestures using LSTM. Experimental results demonstrated the effectiveness of this approach in classifying complex hand movements, achieving a maximum precision of 99.3% on the American Sign Language dataset. Wang et al. [ 25 ] combined attention mechanism with BiLSTM and designed a deep learning algorithm capable of effectively recognizing 10 types of dynamic gestures. Their proposed method achieved an accuracy of 98.3% on the test dataset, showing a 14.5% improvement compared to a standalone LSTM model. This indicates that incorporating attention mechanism can effectively enhance the model’s understanding of gesture semantics. Dong et al. [ 12 ] introduced a novel dynamic gesture recognition algorithm called DGDL-GR. Built upon deep learning, this algorithm combined CNN and temporal convolutional networks (TCN) to simultaneously extract temporal and spatial features of hand movements. They defined 10 gestures according to relevant standards and recruited 20 participants for testing. The experimental results demonstrated that DGDL-GR achieved the highest recognition accuracy (0.9869), surpassing state-of-the-art algorithms such as CNN and LSTM. Hu et al. [ 26 ] explored deep learning-based gesture recognition using surface electromyography (sEMG) signals and proposed a hybrid CNN and RNN structure with attention mechanism. In this framework, CNN was employed for feature extraction from sEMG signals, while RNN was utilized for modeling the temporal sequence of the signals. Experimental results on multiple publicly available datasets revealed that the performance of the hybrid CNN-RNN structure was superior to individual CNN and RNN modules.
Despite the existence of a large body of research on HGR, research on dynamic gesture recognition using data gloves is still limited, especially in exploring the feasibility of applying deep learning in this field. Therefore, this study focused on the intelligent recognition of handwashing steps in the context of medical simulation. We utilized data gloves as the medium for dynamic gesture data collection and selected the seven-step handwashing series of dynamic gestures as the research target. Specifically, we considered the characteristics of dynamic gestures, including local feature variations in spatial positions and temporal changes in sequences. We systematically combined structures such as CNN, BiLSTM, and attention mechanism and designed a deep learning algorithm for dynamic gesture recognition based on data gloves. The next section will provide a detailed introduction to the proposed algorithm framework.
3.1. convolutional neural network (cnn).
A classic CNN architecture was designed by LeCun et al. in 1998 [ 27 ], which achieved remarkable performance in handwritten digit recognition tasks. Compared to traditional neural network structures, CNN exhibits characteristics of local connectivity and weight sharing [ 28 ]. Consequently, CNN can improve the learning efficiency of neural networks and effectively avoid overfitting issues caused by excessive parameters. The classic CNN architecture consists of three components: the convolutional layer, the pooling layer, and the fully connected layer.
The convolutional layer’s core component is the convolutional kernel (or weight matrix). Each convolutional kernel multiplies and sums the corresponding receptive field elements in the input data. This operation is repeated by sliding the kernel with a certain stride on the input data until the entire data has been processed for feature extraction. Finally, these feature maps are typically generated as the output of the convolutional layer through a non-linear activation function. It is worth mentioning that multiple convolutional kernels are usually chosen to extract more diverse features since each kernel extracts different feature information. ReLU [ 29 ] is the most popular activation function in CNN, it has the capability to retain the segments of input features that are greater than 0 and rectify the remaining segments to 0.
The pooling layer, also known as the down-sampling layer, extracts the minor features of the input data using pooling kernels. Similar to the convolutional kernels, each pooling kernel slides over the input data with a certain stride, preserving either the maximum value or the average value of the elements within the corresponding receptive field. This process continues until the feature extraction of the entire data is completed. The pooling layer is typically placed after the convolutional layers to reduce the dimensionality of the feature maps, thereby reducing the computational complexity of the entire network.
In classification tasks, the input data undergoes feature extraction by passing through multiple convolutional and pooling layers, and the resulting feature maps are flattened and fed into the fully connected layer. The fully connected layer usually consists of a few hidden layers and a softmax classifier, which further extracts features from the data and outputs the probability distribution of each class.
The RNN is a recursively connected neural network with the ability of short-term memory that has been widely applied in the analysis and prediction of time series data [ 30 ]. However, due to memory and information storage limitations, RNN faces challenges in effectively learning long-term dependencies in time sequences, and gradient vanishing is often encountered during training [ 31 ]. To overcome these challenges, Greff et al. proposed the LSTM network structure that exhibits long-range memory capabilities [ 32 ]. The LSTM structure achieves this by introducing memory cells to retain long-term historical information and employing different gate mechanisms to regulate the flow of information. In fact, gate mechanisms can be understood as a multi-level feature selection approach. Consequently, compared to RNN, LSTM offer more advantages in handling time series problems.
The classical LSTM unit is equipped with three gate functions to control the state of the memory cell, denoted as the forget gate f t , input gate i t and output gate o t . The forget gate f t determines which information should be retained from the previous cell state c t −1 to the current cell state c t . The input gate i t regulates the amount of information from the current input x t that should be stored in the current cell state c t . The output gate o t governs the amount of information from the current cell state c t that should be transmitted to the current hidden state h t . Fig 1 illustrates the internal structure of a LSTM unit.
https://doi.org/10.1371/journal.pone.0294174.g001
The LSTM unit has three inputs at time t: the current input x t , the previous hidden state h t −1 , and the previous cell state c t −1 . After being regulated by the gate functions, two outputs are obtained: the current hidden state h t and the current cell state c t . Specifically, the output of f t is obtained by linearly transforming the current input x t and the previous hidden state h t −1 , followed by the application of the sigmoid activation function. This process can be expressed by Formula 1 .
Here, the weight matrix and bias vector of f t are represented by w f and b f , respectively. The sigmoid activation function, denoted by σ , is applied. The value of f t ranges from 0 to 1, where a value closer to 0 indicates that information will be discarded, and a value closer to 1 implies more information will be preserved. The computation process of the input gate i t is similar to that of f t , and the specific formula is as follows.
LSTM addresses the issue of vanishing gradients during training by incorporating a series of gate mechanisms. However, as LSTM only propagates information in one direction, it can only learn forward features and not capture backward features. To overcome this limitation, Graves et al. introduced BiLSTM based on LSTM [ 33 ]. BiLSTM effectively combines a pair of forward and backward LSTM sequences, inheriting the advantages of LSTM while addressing the unidirectional learning problem. This integration allows BiLSTM to effectively capture contextual information in sequential data. From a temporal perspective, BiLSTM analyzes both the "past-to-future" and "future-to-past" directions of data flow, enabling better exploration of temporal features in the data and improving the utilization efficiency of the data and the predictive accuracy of the model.
https://doi.org/10.1371/journal.pone.0294174.g002
Finally, the weighted sum of a t and h t is computed to obtain the final output enhanced by the attention mechanism.
This study aims to recognize the meaning conveyed by dynamic gesture data over time, which can be understood as a classification task for time series data. Building upon the previous discussions, it is highly conceivable that CNN can effectively extract local features from time series data, but may not capture long-range dependencies present in the data. The advantages of BiLSTM can overcome this limitation by learning from the forward and backward processes of dynamic gesture data, allowing the model to effectively capture the underlying long-term dependencies. Furthermore, the incorporation of attention mechanism can enhance the model’s semantic understanding of various gestures, thereby boosting the accuracy of gesture recognition. Therefore, in this paper, we proposed to combine CNN, BiLSTM, and the attention mechanism, presenting a novel framework for dynamic gesture recognition called Attention-based CNN-BiLSTM Network (A-CBLN). A-CBLN effectively integrates the advantages of different types of neural networks, thereby improving the predictive accuracy of dynamic gesture recognition. Fig 3 illustrates the pipeline of dynamic gesture recognition based on A-CBLN.
https://doi.org/10.1371/journal.pone.0294174.g003
Specifically, as shown in Fig 3 , the A-CBLN consists of five main components. The input layer transforms the data collected by the data glove into the model’s input format: T × L ×1, where T represents the number of samples of gesture data collected within a specified time range, L represents the feature dimension of the gesture data returned by the data glove, and 1 represents the number of channels. The CNN layer performs feature extraction and dimensionality reduction using two convolutional operations and one max-pooling operation. It is worth noting that we did not directly employ 1D or 2D convolutional methods for feature extraction, but instead utilized a 2D convolutional method with a kernel size of 1×3, enabling the extraction of spatial features from gesture data without being influenced by the temporal dimension. The BiLSTM layer provides additional modeling of the long-term dependencies of gesture features. Both the CNN layer and the BiLSTM layer use the ReLU activation function. The AM layer helps the network better understand the specific meaning of gesture features. The FC layer utilizes fully connected layers to flatten the features and further reduce the dimensionality. Finally, it outputs the probability prediction of the current dynamic gesture through the softmax function. Table 1 presents the specific parameter settings for each network layer in A-CBLN.
https://doi.org/10.1371/journal.pone.0294174.t001
Algorithm 1. The pseudocode of A-CBLN .
Input : Gesture Dataset X , Gesture Labels y
Output : Trained model weights: w *
Parameter :
Batch size: 64
Best validation accuracy: 0
1. Load training dataset and validation dataset from X and y ;
2. Randomly initialize weight w ;
3. Start Training and Valid;
4. For each epoch do :
5. For each batch ( X train , y train ) in training dataset do :
6. F 1 is obtained by using two convolution layers on X train ;
7. F 2 is obtained by using a max-pooling layer on F 1 ;
8. F 3 is obtained by using two BiLSTM layers on F 2 ;
10. Update weights of the model using the categorical cross-entropy loss function with the Adam optimizer;
11. Calculate the accuracy of the model on the validation dataset denoted as V acc ;
12. If V acc > Best Validation accuracy:
13. Save Trained model weights w *
14. Update the value of Best Validation accuracy to the value of V acc
15. End Training and Valid
4.1. data glove.
The wearable sensor gesture data extraction device used in this study is provided by the VRTRIXTM Data Glove( http://www.vrtrix.com.cn/ ). The core component of this glove is a 9-axis MEMS (Micro Electro Mechanical System) inertial sensor, which can capture real-time motion data related to finger joints and enable the reproduction of the hand postures assumed by the operator during motion execution. The transmission of data from the glove employs wireless transmission technology, where the data captured by the sensors on both hands can be wirelessly transmitted to a personal computer (PC) through the wireless transmission module on the back of the hand for real-time rendering. In addition, the VRTRIXTM Data Glove provides a low-level Python API interface, allowing users to access joint pose data of the data glove, facilitating secondary development. It has been widely used in fields such as industrial simulation, mechanical control, and scientific research data acquisition.
Once the data glove is properly worn, the left hand has a total of 11 inertial sensors for capturing finger gestures. Specifically, each finger is assigned 2 sensors, while 1 sensor is allocated to the back of the hand. The number and distribution of sensors on the right hand are identical to those on the left hand. Table 2 presents the key parameters information of the data glove used in this study.
https://doi.org/10.1371/journal.pone.0294174.t002
This study sought to explore the applications of dynamic gesture recognition in the field of medical virtual simulation based on wearable devices (data gloves). We first comprehensively reviewed the existing literature on dynamic gesture recognition. As mentioned in Section 2, most publicly available dynamic gesture datasets are based on visual systems, with only a few studies utilizing wearable devices. Therefore, we created a new dynamic gesture dataset based on the common seven-step handwashing in medical virtual simulation systems in conjunction with the data gloves. We followed the handwashing method recommended by the World Health Organization (WHO) [ 36 ] and established a complete handwashing procedure comprising seven steps. More details on these steps are presented in Table 3 .
https://doi.org/10.1371/journal.pone.0294174.t003
According to the approval of the Medical Ethics Committee of Affiliated Hospital of Xuzhou Medical University, 32 healthy subjects were recruited for this study. We organized and conducted data acquisition for this study between January 5, 2023 and March 25, 2023. Prior to gesture data collection, each subject was required to sign a consent form granting permission for their data to be used in the study and was informed of the specific steps involved in data collection. In order to ensure the precise expression of gesture actions while wearing the data gloves, participants who were initially unacquainted with the seven-step handwashing received training sessions conducted by healthcare professionals until all subjects could correctly perform the hand gestures while wearing the gloves. Additionally, a timekeeper was assigned to prompt the start and end of each gesture action and record the corresponding time information.
Once the subject correctly wore the data gloves as instructed, the gesture data collection process followed the following detailed steps:
Fig 4 illustrates the specific flow of gesture data acquisition. The data gloves used in this study provided a Python API interface, facilitating the recording of gesture data using Python scripts. The data for each subject were stored in individual folders named after their respective names. Additionally, subjects were requested to repeat the gesture collection process five times to increase the dataset size. Once data collection from all subjects was completed, the data were exported for further processing and analysis.
https://doi.org/10.1371/journal.pone.0294174.g004
Specifically, the archival structure for each subject encompassed a set of five folders, and each folder consisted of seven dynamic gesture data files in text format. The data sampling frequency was set at 60Hz. We used a 3s time window to slide and segment the 15s data of each sample without overlapping, since the actions within 3s already contained the specific semantics of the current gesture. In summary, the sample size used for dynamic gesture modeling analysis in this study was 5,600, with each sample having a data dimension of (180×128×1). Here, 180 represents the number of gesture samples within 3 seconds, 128 represents the joint data returned by the data glove sensors, and 1 denotes the number of channels. Finally, to enhance the training of the gesture recognition model, a min-max scaling technique was applied to rescale the data intensity of all samples to the range [0, 1] using Formula 13 .
Here, f represents the input data, f norm refers to the normalized data, f min and f max represent the minimum and maximum values of the input data, respectively.
To evaluate the performance of the proposed gesture recognition model, we divided the data into training, validation, and test dataset in a ratio of 8:1:1. Therefore, the data from 26 subjects were used for training, while the remaining 6 were evenly split between the validation and test dataset.
To validate the effectiveness of the proposed dynamic gesture recognition algorithm, we selected three deep learning algorithms related to gesture recognition research for comparison:
All the experimental code in this study was written using Python (version 3.8). The deep learning algorithms were implemented using the TensorFlow framework (version 2.9.0). To ensure a fair comparison of the performance of each deep learning algorithm, we used the following training parameters consistently during the model training process: an initial training epoch of 50 and a batch size of 64. Since the research in this paper involved a typical multi-class classification task, we employed the cross-entropy loss to measure the error between the predicted values of the model and the true labels. We used the ADAM optimizer [ 37 ] to update the model parameters, and the initial learning rate was set to 0.001, the beta1 was set to 0.9, the beta2 was set to 0.999. After each training epoch, validation was performed, and the model with the lowest validation loss was saved for subsequent algorithm testing.
Here, TP represents the number of true positive samples, which are the samples that are correctly predicted as positive. TN represents the number of true negative samples, which are the samples that are correctly predicted as negative. FP represents the number of false positive samples, which are the samples that are actually negative but predicted as positive. FN represents the number of false negative samples, which are the samples that are actually positive but predicted as negative.
This section presents and analyzes the effectiveness of all models for dynamic gesture recognition from multiple perspectives. It includes a comparative analysis of the learning capabilities of different models and their predictive performance on the test dataset. Additionally, we conducted relevant experiments to discuss the impact of key parameters in A-CBLN, including the kernel size in the convolutional layer and the number of neurons in the BiLSTM layer, the findings from these experiments provide valuable insights into the optimal configuration of A-CBLN for enhanced gesture recognition performance. Finally, we further analyzed and discussed the confusion matrix predicted by A-CBLN on the test dataset.
We first analyzed the learning progress of the models during the training process. Fig 5 shows that as the number of training epochs increases, the validation accuracy gradually improves and stabilizes for all models. This finding indicates that all models possess certain learning capabilities, and overfitting phenomena does not occur during training. Further analysis revealed that the single LSTM structure exhibits the lowest learning capability, reaching its highest validation accuracy of 88.95% at 50 epochs. This may be due to the fact that the pure LSTM structure fails to focus on the local features within the dynamic handwashing steps. For instance, actions like rubbing or rotating are of utmost importance in understanding the semantic meaning conveyed by the gestures. In contrast, the best validation accuracy of the Attention-BiLSTM structure has been improved and peaked at 45 epochs (92.77%). Nevertheless, the entire training progress displays instability. This limitation is also attributed to the structure’s limited ability in capturing local features. By combining CNN and LSTM, the model can not only perceive the local features of dynamic gestures in spatial changes but also capture their temporal variations. As a result, the recognition ability has been significantly improved, the model achieves an accuracy of 93.71% at 48 epochs. Finally, our proposed A-CBLN combines attention mechanisms, further enhancing the model’s understanding of different gesture semantics. Consequently, it exhibits the most powerful learning capability during the training process. Its validation accuracy stabilizes and consistently outperforms other models after 18 epochs, peaked at 32 epochs (93.62%).
https://doi.org/10.1371/journal.pone.0294174.g005
The model with the best performance on the validation dataset was preserved for further analysis of their performance on the test dataset. As shown in Table 4 , all models perform well on the test dataset, with prediction accuracy exceeding 87%. Further observation reveals that the pure LSTM and Attention-BiLSTM models have relatively lower prediction accuracy (87.43% and 91.43% respectively), while the hybrid CNN-LSTM structure significantly improves the prediction accuracy to 93.38%. This is consistent with our previous analysis, indicating that the hybrid CNN-LSTM structure possesses stronger feature extraction capability for dynamic gesture data. Finally, our proposed A-CBLN model demonstrates the best predictive performance for dynamic gestures, achieving optimal values in all evaluation metrics, with an accuracy of 95.05%, precision of 95.43%, recall of 95.25%, and F1-score of 95.22%. Compared to the pure LSTM structure, it improves by 7.62%, 5.84%, 7.32%, and 7.78% in accuracy, precision, recall, and F1-score, respectively.
https://doi.org/10.1371/journal.pone.0294174.t004
The choice of different kernel size in convolutional layers implies variations in the receptive field for extracting local features. Therefore, selecting an appropriate kernel size is crucial for improving model performance. We conducted a comparative analysis to investigate the impact of four different kernel sizes (1×2, 1×3, 1×5, and 1×7) on the recognition performance of the A-CBLN algorithm. Fig 6 reveals that the recognition performance of the A-CBLN algorithm initially improves and then declines as the kernel size increases. Through further observation, it can be noted that the utilization of large convolutional kernels can lead to a decrease in the overall recognition performance of the model. This is because, while enlarging the receptive field, they also extract redundant features. When the kernel size is set to 1×3, the A-CBLN algorithm achieves the optimal performance in terms of accuracy, precision, recall, and F1- score. The corresponding performance metrics reach their peak values of 93.94%, 94.60%, 94.02%, and 93.98%, respectively.
https://doi.org/10.1371/journal.pone.0294174.g006
The number of neurons in the BiLSTM layer also influences the recognition performance of the A-CBLN algorithm. In this section, we discussed four different neuron quantities: 2, 4, 8, and 16. As shown in Fig 7 , the recognition performance of the A-CBLN algorithm initially improves and then declines with an increase in the number of neurons in the BiLSTM layer. When the neuron quantity is set to 8, the A-CBLN algorithm achieves the optimal performance in terms of accuracy, precision, recall, and F1-score. The corresponding performance metrics reach their peak values of 92.56%, 93.63%, 92.64%, and 92.54%, respectively.
https://doi.org/10.1371/journal.pone.0294174.g007
Finally, we conducted a separate discussion and analysis of the prediction results of the A-CBLN algorithm on the test dataset. As shown in Fig 8 , the values on the main diagonal of the confusion matrix represent the percentage of correctly predicted samples in each gesture category, while the remaining positions indicate cases where the model incorrectly predicts a given gesture as another category. Upon further observation, it can be determined that A-CBLN achieves recognition accuracy higher than 85% for all seven handwashing steps. Specifically, the model achieves perfect recognition for gestures in steps 1, 5, and 7, as these gestures exhibit distinct spatial features. However, the recognition performance for handwashing step 3 actions is poor, with approximately 15% of the samples incorrectly classified as step 2. This may be attributed to the similarity between the hand gestures in these two steps, involving actions such as "finger crossing" and "mutual friction," which the two convolutional layers in A-CBLN may struggle to differentiate between them. Additionally, there are also some recognition errors for handwashing actions in steps 4 and 6, likely due to the presence of similar actions such as "finger bending" and "rotational friction," leading to misjudgment by the model. Overall, A-CBLN demonstrates good overall recognition performance for the seven dynamic gestures, with an average accuracy exceeding 95%.
https://doi.org/10.1371/journal.pone.0294174.g008
This paper aims to investigate the problem of dynamic gesture recognition based on data gloves. Based on deep learning techniques, we proposed a dynamic gesture recognition algorithm called A-CBLN, which combines structures such as CNN, BiLSTM, and attention mechanism to capture the spatiotemporal features of dynamic gestures to the maximum extent. We selected the commonly used seven-step handwashing method in the medical simulation domain as the research subject and validated the performance of the proposed model in recognizing the seven dynamic gestures. The experimental results demonstrated that our proposed approach effectively addresses the task of dynamic gesture recognition and achieved superior prediction results compared to similar models, with the accuracy of 95.05%, precision of 95.43%, recall of 95.25%, and F1-score of 95.22% on the test dataset. In the future, we plan to further improve our approach in the following aspects: (1) design more efficient feature extraction modules to enhance the discriminability of gestures with similar action sequences; (2) recruit more subjects to increase the dataset size and improve the model’s generalization ability; (3) explore the fusion of multimodal data captured by infrared cameras to enhance the recognition performance of the model.
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Please log in to your account
Bibliometrics & citations, editorial notes.
Applied computing
Computing methodologies
Artificial intelligence
Natural language processing
Machine learning
Human-centered computing
Human computer interaction (HCI)
Social and professional topics
Professional topics
Computing profession
Robust hand gesture recognition with kinect sensor.
Hand gesture based Human-Computer-Interaction (HCI) is one of the most natural and intuitive ways to communicate between people and machines, since it closely mimics how human interact with each other. In this demo, we present a hand gesture recognition ...
Hand gesture recognition has become one of the key techniques of human-computer interaction (HCI). Many researchers are devoted in this field. In this paper, firstly the history of hand gesture recognition is discussed and the technical difficulties are ...
The natural user interface using hand gesture have been popular field in Human-Computer-Interaction(HCI). Many research papers have been proposed in this field. They proposed vision-based, glove-based and depth-based approach for hand gesture ...
Published in.
Rector of International Information Technology University, IITU, Kazakhstan
PhD, Professor, IITU, Kazakhstan
Associate professor, IITU, Kazakhstan
JUST, Jordan
IITU, Kazakhstan
Association for Computing Machinery
New York, NY, United States
Permissions, check for updates, author tags.
Contributors, other metrics, bibliometrics, article metrics.
Login options.
Check if you have access through your login credentials or your institution to get full access on this article.
View options.
View or Download as a PDF file.
View online with eReader .
View this article in HTML Format.
Copying failed.
Affiliations, export citations.
We are preparing your search results for download ...
We will inform you here when the file is ready.
Your file of search results citations is now ready.
Your search export query has expired. Please try again.
You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.
All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .
Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.
Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.
Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.
Original Submission Date Received: .
Find support for a specific problem in the support section of our website.
Please let us know what you think of our products and services.
Visit our dedicated information section to learn more about MDPI.
A structured and methodological review on vision-based hand gesture recognition system.
1.1. background, 1.2. survey methodology, 1.3. research gaps and new research challenges, 1.4. contribution, 1.5. research questions.
2. hand gestures types, 3. recognition technologies of hand gesture, 3.1. technology based on sensor, 3.1.1. techniques for recognizing hand gestures using impulse radio signals, 3.1.2. ultrasonic hand gesture recognition techniques, 3.2. technology based on vision, 4. significant research works on hand gesture recognition, 4.1. data augmentation, 4.2. deep learning for gesture recognition, 4.3. summary, 5. conclusions, institutional review board statement, informed consent statement, data availability statement, conflicts of interest.
Click here to enlarge figure
Author | Findings | Challenges |
---|---|---|
[ ] | In this study, image processing techniques such as wavelets and empirical mode decomposition were suggested to extract picture functionalities in order to identify 2D or 3D manual motions. Classification of artificial neural networks (ANN), which was utilized for the training and classification of data in addition to the CNN (CNN). | Three-dimensional gesture disparities were measured utilizing the left and right 3D gesture videos. |
[ ] | Deaf–mute elderly folk use five distinct hand signals to seek a particular item, such as drink, food, toilet, assistance, and medication. Since older individuals cannot do anything independently, their requests were delivered to their smartphone. | Microsoft Kinect v2 sensor’s capability to extract hand movements in real time keeps this study in a restricted area. |
[ ] | The physical closeness of gestures and voices may be loosened slightly and utilized by individuals with unique abilities. It was always important to explore efficient human computer interaction (HCI) in developing new approaches and methodologies. | Many of the methods encounter difficulties like occlusions, changes in lighting, low resolution and a high frame rate. |
[ ] | A working prototype is created to perform gestures based on real-time interactions, comprising a wearable gesture detecting device with four pressure sensors and the appropriate computational framework. | The hardware design of the system has to be further simplified to make it more feasible. More research on the balance between system resilience and sensitivity is required. |
[ ] | This article offers a lightweight model based on the YOLO (You Look Only Once) v3 and the DarkNet-53 neural networks for gesture detection without further preprocessing, filtration of pictures and image improvement. Even in a complicated context the suggested model was very accurate, and even in low resolution image mode motions were effectively identified. Rate of high frame. | The primary challenge of this application for identification of gestures in real time is the classification and recognition of gestures. Hand recognition is a method used by several algorithms and ideas of diverse approaches for understanding the movement of a hand, such as picture and neural networks. |
[ ] | This work formulates the recognition of gestures as an irregular issue of sequence identification and aims to capture long-run spatial correlations in points of the cloud. In order to spread information from past to future while maintaining its spatial structure, a new and effective PointLSTM is suggested. | The underlying geometric structure and distance information for the object surfaces are accurately described in dot clouds as compared with RGB data, which offer additional indicators of gesture identification. |
[ ] | A new system is presented for a dynamic recognition of hand gestures utilizing various architectures to learn how to partition hands, local and global features and globalization and recognition features of the sequence. | To create an efficient system for recognition, hand segmentation, local representation of hand forms, global corporate configuration, and gesture sequence modeling need to be addressed. |
[ ] | This article detects and recognizes the gestures of the human hand using the method to classification for neural networks (CNN). This process flow includes hand area segmentation using mask image, finger segmentation, segmented finger image normalization and CNN classification finger identification. | SVM and the naive Bayes classification were used to recognize the conventional gesture technique and needed a large number of data for the identification of gesture patterns. |
[ ] | They provided a study of existing deep learning methodologies for action and gesture detection in picture sequences, as well as a taxonomy that outlines key components of deep learning for both tasks. | They looked through the suggested architectures, fusion methodologies, primary datasets, and competitions in depth. They described and analyzed the key works presented so far, focusing on how they deal with the temporal component of data and suggesting potential and challenges for future study. |
[ ] | They solve the problems by employing an end-to-end learning recurrent 3D convolutional neural network. They created a spatiotemporal transformer module with recurrent connections between surrounding time slices that can dynamically change a 3D feature map into a canonical view in both space and time. | The main challenge in egocentric vision gesture detection is the global camera motion created by the device wearer’s spontaneous head movement. |
[ ] | To categorize video sequences of hand motions, a long-term recurrent convolution network is utilized. Long-term recurrent convolution is the most common kind of long-term recurrent convolution. Multiple frames captured from a video sequence are fed into a network to conduct categorization in a network-based action classifier. | Apart from lowering the accuracy of the classifier, the inclusion of several frames increases the computing complexity of the system. |
[ ] | The MEMP network’s major characteristic is that it extracts and predicts the temporal and spatial feature information of gesture video numerous times, allowing for great accuracy. MEMP stands for multiple extraction and multiple prediction. | They present a neural network with an alternative fusion of 3D CNN and ConvLSTM since each kind of neural network structure has its own constraints. MEMP was developed by them. |
[ ] | This research introduces a new machine learning architecture that is especially built for gesture identification based on radio frequency. They are particularly interested in high-frequency (60 GHz) short-range radar sensing, such as Google’s Soli sensor. | The signal has certain unique characteristics, such as the ability to resolve motion at a very fine level and the ability to segment in range and velocity space rather than picture space. This allows for the identification of new sorts of inputs, but it makes the design of input recognition algorithms much more challenging. |
[ ] | They propose learning spatio-temporal properties from successive video frames using a 3D convolutional neural network (CNN). They test their method using recordings of robot-assisted suturing on a bench-top model from the JIGSAWS dataset, which is freely accessible. | Recognizing surgical gestures automatically is an important step in gaining a complete grasp of surgical expertise. Automatic skill evaluation, intra-operative monitoring of essential surgical processes, and semi-automation of surgical activities are all possible applications. |
[ , ] | They blur the image frames from videos to remove the background noise. The photos are then converted to HSV color mode. They transform the picture to black-and-white format through dilation, erosion, filtering, and thresholding. Finally, hand movements are identified using SVM. | Gesture-based technology may assist the handicapped, as well as the general public, to maintain their safety and requirements. Due to the significant changeability of the properties of each motion with regard to various persons, gesture detection from video streams is a complicated matter. |
[ , ] | The purpose of this study is to offer a method for Hajj applications that is based on a convolutional neural network model. They also created a technique for counting and then assessing crowd density. The model employs an architecture that recognizes each individual in the crowd, marks their head position with a bounding box, and counts them in their own unique dataset (HAJJ-Crowd). | There has been a growth in interest in the improvement of video analytics and visual monitoring to better the safety and security of pilgrims while in Makkah. It is mostly due to the fact that Hajj is a one-of-a-kind event with hundreds of thousands of people crowded into a small area. |
[ ] | This study presents crowd density analysis using machine learning. The primary goal of this model is to find the best machine learning method for crowd density categorization with the greatest performance. | Crowd control is essential for ensuring crowd safety. Crowd monitoring is an efficient method of observing, controlling, and comprehending crowd behavior. |
MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
Al Farid, F.; Hashim, N.; Abdullah, J.; Bhuiyan, M.R.; Shahida Mohd Isa, W.N.; Uddin, J.; Haque, M.A.; Husen, M.N. A Structured and Methodological Review on Vision-Based Hand Gesture Recognition System. J. Imaging 2022 , 8 , 153. https://doi.org/10.3390/jimaging8060153
Al Farid F, Hashim N, Abdullah J, Bhuiyan MR, Shahida Mohd Isa WN, Uddin J, Haque MA, Husen MN. A Structured and Methodological Review on Vision-Based Hand Gesture Recognition System. Journal of Imaging . 2022; 8(6):153. https://doi.org/10.3390/jimaging8060153
Al Farid, Fahmid, Noramiza Hashim, Junaidi Abdullah, Md Roman Bhuiyan, Wan Noor Shahida Mohd Isa, Jia Uddin, Mohammad Ahsanul Haque, and Mohd Nizam Husen. 2022. "A Structured and Methodological Review on Vision-Based Hand Gesture Recognition System" Journal of Imaging 8, no. 6: 153. https://doi.org/10.3390/jimaging8060153
Article access statistics, further information, mdpi initiatives, follow mdpi.
Subscribe to receive issue release notifications and newsletters from MDPI journals
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
Explore all metrics
Gesture recognition, having multitudinous applications in the real world, is one of the core areas of research in the field of human-computer interaction. In this paper, we propose a novel method for isolated and continuous hand gesture recognition utilizing the movement epenthesis detection and removal. For this purpose, the present work detects and removes the movement epenthesis frames from the isolated and continuous hand gesture videos. In this paper, we have also proposed a novel modality based on the temporal difference that extracts hand regions, removes gesture irrelevant factors and provides temporal information contained in the hand gesture videos. Using the proposed modality and other modalities such as the RGB modality, depth modality and segmented hand modality, features are extracted using Googlenet Caffe Model. Next, we derive a set of discriminative features by fusing the acquired features that form a feature vector representing the sign gesture in question. We have designed and used a Bidirectional Long Short-Term Memory Network (Bi-LSTM) for classification purpose. To test the efficacy of our proposed work, we applied our method on various publicly available continuous and isolated hand gesture datasets like ChaLearn LAP IsoGD, ChaLearn LAP ConGD, IPN Hand, and NVGesture. We observe in our experiments that our proposed method performs exceptionally well with several individual modalities as well as combination of modalities of these datasets. The combined effect of the proposed modality and movement epenthesis frames removal led to significant improvement in gesture recognition accuracy and considerable reduction in computational burden. Thus the obtained results advocate our proposed approach to be at par with the existing state-of-the-art methods.
This is a preview of subscription content, log in via an institution to check access.
Price includes VAT (Russian Federation)
Instant access to the full article PDF.
Rent this article via DeepDyve
Institutional subscriptions
Datasets ChaLearn LAP ConGD and ChaLearn LAP IsoGD can be accessed using the website link mentioned as: https://gesture.chalearn.org/2016-looking-at-people-cvpr-challenge/isogd-and-congd-datasets Dataset IPN Hand can be accessed using the following website link: https://gibranbenitez.github.io/IPN_Hand/ Dataset NVGesture can be accessed with the help of following link: https://research.nvidia.com/publication/2016-06_online-detection-and-classification-dynamic-hand-gestures-recurrent-3d Rest other declarations are not applicable.
Abavisani, M., Joze, H.R.V., Patel, V.M.: Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1165–1174 (2019)
Belgacem, S., Chatelain, C., Paquet, T.: Gesture sequence recognition with one shot learned crf/hmm hybrid model. Image Vis. Comput. 61 , 12–21 (2017)
Article Google Scholar
Benitez-Garcia, G., Olivares-Mercado, J., Sanchez-Perez, G., et al.: IPN Hand: a video dataset and benchmark for real-time continuous hand gesture recognition. In: 2020 25th International Conference on Pattern Recognition, pp. 4340–4347 (2021)
Camgoz, N.C., Hadfield, S., Koller, O., et al.: Using convolutional 3D neural networks for user-independent continuous gesture recognition. In: 2016 23rd International Conference on Pattern Recognition, pp. 49–54 (2016)
Camgoz, N.C., Hadfield, S., Bowden, R.: Particle filter based probabilistic forced alignment for continuous gesture recognition. In: 2017 IEEE International Conference on Computer Vision Workshops, pp. 3079–3085 (2017)
Chai, X., Liu, Z., Yin, F., et al.: Two streams recurrent neural networks for large-scale continuous gesture recognition. In: 2016 23rd International Conference on Pattern Recognition, pp. 31–36 (2016)
Choudhury, A., Talukdar, A.K., Bhuyan, M.K., et al.: Movement epenthesis detection for continuous sign language recognition. J. Intell. Syst. 26 (3), 471–481 (2017)
Google Scholar
Duan, J., Wan, J., Zhou, S., et al.: A unified framework for multi-modal isolated gesture recognition. ACM Trans. Multimed. Comput. Commun. Appl. 14 (1), 1–16 (2018)
Gammulle, H., Denman, S., Sridharan, S., et al.: TMMF: Temporal multi-modal fusion for single-stage continuous gesture recognition. IEEE Trans. Image Process. 30 , 7689–7701 (2021)
Gao, W., Fang, G., Zhao, D., et al.: Transition movement models for large vocabulary continuous sign language recognition. In: Sixth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 553–558 (2004)
Guyon, I., Athitsos, V., Jangyodsuk, P., et al.: The ChaLearn gesture dataset (CGD 2011). Mach. Vis. Appl. 25 , 1929–1951 (2014)
Hu, T.K., Lin, Y.Y., Hsiu, P.C.: Learning adaptive hidden layers for mobile gesture recognition. Proceedings of the AAAI Conference on Artificial Intelligence pp. 32(1) (2018)
Jia, Y., Shelhamer, E., Donahue, J., et al.: Caffe: Convolutional architecture for fast feature embedding. Proceedings of the 2014 ACM Conference on Multimedia pp. 675–678 (2014)
Joshi, A., Monnier, C., Betke, M., et al.: Comparing random forest approaches to segmenting and classifying gestures. Image Vis. Comput. 58 , 86–95 (2017)
Kelly, D., Mc Donald, J., Markham, C.: Continuous recognition of motion based gestures in sign language. In: 2009 IEEE 12th International Conference on Computer Vision Workshops, pp. 1073–1080 (2009)
Kelly, D., McDonald, J., Markham, C.: Recognizing spatiotemporal gestures and movement epenthesis in sign language. In: 2009 13th International Machine Vision and Image Processing Conference, pp. 145–150 (2009)
Köpüklü, O., Gunduz, A., Kose, N., et al.: Real-time hand gesture detection and classification using convolutional neural networks. In: 2019 14th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 1–8 (2019)
Li, Y., Miao, Q., Tian, K., et al.: Large-scale gesture recognition with a fusion of RGB-D data based on saliency theory and C3D model. IEEE Trans. Circuits Syst. Video Technol. 28 (10), 2956–2964 (2018)
Li, Y., Miao, Q., Qi, X., et al.: A spatiotemporal attention-based ResC3D model for large-scale gesture recognition. Mach. Vis. Appl. 30 (5), 875–888 (2019)
Lin, C., Wan, J., Liang, Y., et al.: Large-scale isolated gesture recognition using a refined fused model based on masked Res-C3D network and skeleton LSTM. In: 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 52–58 (2018)
Liu, Z., Chai, X., Liu, Z., et al.: Continuous gesture recognition with hand-oriented spatiotemporal feature. In: 2017 IEEE International Conference on Computer Vision Workshops, pp. 3056–3064 (2017)
Mohandes, M., Deriche, M., Aliyu, S.O.: Classifiers combination techniques: a comprehensive review. IEEE Access 6 , 19626–19639 (2018)
Molchanov, P., Yang, X., Gupta, S., et al.: Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 4207–4215 (2016)
Narayana, P., Beveridge, J.R., Draper, B.A.: Gesture recognition: focus on the hands. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5235–5244 (2018)
Nayan, N., Ghosh, D., Pradhan, P.M.: An optical flow based approach to detect movement epenthesis in continuous fingerspelling of sign language. In: 2021 National Conference on Communications, pp. 1–5 (2021)
Nayan, N., Ghosh, D., Pradhan, P.M.: A cnn bi-lstm based multimodal continuous hand gesture recognition. In: 2022 IEEE India Council International Subsections Conference (INDISCON), pp. 1–4 (2022)
Nayan, N., Ghosh, D., Pradhan, P.M.: An unsupervised learning approach to handle movement epenthesis in continuous sign language recognition. In: 2022 17th International Conference on Control, pp. 862–867. Automation, Robotics and Vision (ICARCV) (2022)
Ni, B., Wang, G., Moulin, P.: RGBD-HuDaAct: a color-depth video database for human daily activity recognition. In: 2011 IEEE International Conference on Computer Vision Workshops, pp. 1147–1153 (2011)
Pigou, L., Van Herreweghe, M., Dambre, J.: Gesture and sign language recognition with temporal residual networks. In: 2017 IEEE International Conference on Computer Vision Workshops, pp. 3086–3093 (2017)
Shen, X., Hua, G., Williams, L., et al.: Dynamic hand gesture recognition: an exemplar-based approach from motion divergence fields. Image Vis. Comput. 30 (3), 227–235 (2012)
Suau, X., Alcoverro, M., López-Méndez, A., et al.: Real-time fingertip localization conditioned on hand gesture classification. Image Vis. Comput. 32 (8), 522–532 (2014)
Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015)
Talukdar, A.K., Bhuyan, M.K.: Movement epenthesis detection in continuous fingerspelling from a coarsely sampled motion vector field in h.264/avc video. In: 2018 IEEE Recent Advances in Intelligent Computational Systems, pp. 26–30 (2018)
Theodorakis, S., Pitsikalis, V., Maragos, P.: Dynamic-static unsupervised sequentiality, statistical subunits and lexicon for sign language recognition. Image Vis. Comput. 32 (8), 533–549 (2014)
Vogler, C., Metaxas, D.: ASL recognition based on a coupling between HMMs and 3D motion analysis. In: Sixth International Conference on Computer Vision, pp. 363–369 (1998)
Vogler, C., Metaxas, D.: A framework for recognizing the simultaneous aspects of American sign language. Comput. Vis. Image Underst. 81 (3), 358–384 (2001)
Wan, J., Guo, G., Li, S.Z.: Explore efficient local features from RGB-D data for one-shot learning gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38 (8), 1626–1639 (2016)
Wan, J., Li, S.Z., Zhao, Y., et al.: Chalearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 761–769 (2016)
Wan, J., Lin, C., Wen, L., et al.: Chalearn looking at people: IsoGD and ConGD large-scale RGB-D gesture recognition. IEEE Trans. Cybern. 52 (5), 3422–3433 (2022)
Wang, H., Wang, P., Song, Z., et al.: Large-scale multimodal gesture recognition using heterogeneous networks. In: 2017 IEEE International Conference on Computer Vision Workshops, pp. 3129–3137 (2017)
Wang, H., Wang, P., Song, Z., et al.: Large-scale multimodal gesture segmentation and recognition based on convolutional neural networks. In: 2017 IEEE International Conference on Computer Vision Workshops, pp. 3138–3146 (2017)
Wang, P., Li, W., Liu, S., et al.: Large-scale isolated gesture recognition using convolutional neural networks. In: 2016 23rd International Conference on Pattern Recognition, pp. 7–12 (2016)
Wang, P., Li, W., Liu, S., et al.: Large-scale continuous gesture recognition using convolutional neural networks. In: 2016 23rd International Conference on Pattern Recognition, pp. 13–18 (2016)
Wang, P., Li, W., Gao, Z., et al.: Depth pooling based large-scale 3-D action recognition with convolutional neural networks. IEEE Trans. Multimed. 20 (5), 1051–1061 (2018)
Wang, P., Li, W., Wan, J., et al.: Cooperative training of deep aggregation networks for RGB-D action recognition. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 1–8 (2018)
Yang, R., Sarkar, S., Loeding, B.: Enhanced level building algorithm for the movement epenthesis problem in sign language recognition. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)
Yang, R., Sarkar, S., Loeding, B.: Handling movement epenthesis and hand segmentation ambiguities in continuous sign language recognition using nested dynamic programming. IEEE Trans. Pattern Anal. Mach. Intell. 32 (3), 462–477 (2010)
Yuan, Q., Geo, W., Yao, H., et al.: Recognition of strong and weak connection models in continuous sign language. In: 2002 International Conference on Pattern Recognition, pp. 75–78 (2002)
Zhang, L., Zhu, G., Shen, P., et al.: Learning spatiotemporal features using 3DCNN and convolutional LSTM for gesture recognition. In: 2017 IEEE International Conference on Computer Vision Workshops, pp. 3120–3128 (2017)
Zhu, G., Zhang, L., Mei, L., et al.: Large-scale isolated gesture recognition using pyramidal 3D convolutional networks. In: 2016 23rd International Conference on Pattern Recognition, pp. 19–24 (2016)
Zhu, G., Zhang, L., Shen, P., et al.: Continuous gesture segmentation and recognition using 3DCNN and convolutional LSTM. IEEE Trans. Multimed. 21 (4), 1011–1021 (2019)
Download references
Authors and affiliations.
Department of Electronics and Communication Engineering, Indian Institute of Technology (IIT) Roorkee, Roorkee, Uttarakhand, 247667, India
Navneet Nayan, Debashis Ghosh & Pyari Mohan Pradhan
You can also search for this author in PubMed Google Scholar
Correspondence to Debashis Ghosh .
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Reprints and permissions
Nayan, N., Ghosh, D. & Pradhan, P.M. A multi-modal framework for continuous and isolated hand gesture recognition utilizing movement epenthesis detection. Machine Vision and Applications 35 , 86 (2024). https://doi.org/10.1007/s00138-024-01565-9
Download citation
Received : 07 December 2023
Revised : 24 May 2024
Accepted : 07 June 2024
Published : 27 June 2024
DOI : https://doi.org/10.1007/s00138-024-01565-9
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
This year, from more than 11,500 paper submissions, the CVPR 2024 Awards Committee selected the following 10 winners for the honor of Best Papers during the Awards Program at CVPR 2024, taking place now through 21 June at the Seattle Convention Center in Seattle, Wash., U.S.A.
Best Papers
Honorable mention papers included, “ EventPS: Real-Time Photometric Stereo Using an Event Camera ” and “ pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction. ”
Best Student Papers
There also were four honorable mentions in this category this year: “ SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency ”; “ Image Processing GNN: Breaking Rigidity in Super-Resolution; Objects as Volumes: A Stochastic Geometry View of Opaque Solids ;” and “ Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods. ”
“We are honored to recognize the CVPR 2024 Best Paper Awards winners,” said David Crandall, Professor of Computer Science at Indiana University, Bloomington, Ind., U.S.A., and CVPR 2024 Program Co-Chair. “The 10 papers selected this year – double the number awarded in 2023 – are a testament to the continued growth of CVPR and the field, and to all of the advances that await.”
Additionally, the IEEE Computer Society (CS), a CVPR organizing sponsor, announced the Technical Community on Pattern Analysis and Machine Intelligence (TCPAMI) Awards at this year’s conference. The following were recognized for their achievements:
“The TCPAMI Awards demonstrate the lasting impact and influence of CVPR research and researchers,” said Walter J. Scheirer, University of Notre Dame, Notre Dame, Ind., U.S.A., and CVPR 2024 General Chair. “The contributions of these leaders have helped to shape and drive forward continued advancements in the field. We are proud to recognize these achievements and congratulate them on their success.”
About the CVPR 2024 The Computer Vision and Pattern Recognition Conference (CVPR) is the preeminent computer vision event for new research in support of artificial intelligence (AI), machine learning (ML), augmented, virtual and mixed reality (AR/VR/MR), deep learning, and much more. Sponsored by the IEEE Computer Society (CS) and the Computer Vision Foundation (CVF), CVPR delivers the important advances in all areas of computer vision and pattern recognition and the various fields and industries they impact. With a first-in-class technical program, including tutorials and workshops, a leading-edge expo, and robust networking opportunities, CVPR, which is annually attended by more than 10,000 scientists and engineers, creates a one-of-a-kind opportunity for networking, recruiting, inspiration, and motivation.
CVPR 2024 takes place 17-21 June at the Seattle Convention Center in Seattle, Wash., U.S.A., and participants may also access sessions virtually. For more information about CVPR 2024, visit cvpr.thecvf.com .
About the Computer Vision Foundation The Computer Vision Foundation (CVF) is a non-profit organization whose purpose is to foster and support research on all aspects of computer vision. Together with the IEEE Computer Society, it co-sponsors the two largest computer vision conferences, CVPR and the International Conference on Computer Vision (ICCV). Visit thecvf.com for more information.
About the IEEE Computer Society Engaging computer engineers, scientists, academia, and industry professionals from all areas and levels of computing, the IEEE Computer Society (CS) serves as the world’s largest and most established professional organization of its type. IEEE CS sets the standard for the education and engagement that fuels continued global technological advancement. Through conferences, publications, and programs that inspire dialogue, debate, and collaboration, IEEE CS empowers, shapes, and guides the future of not only its 375,000+ community members, but the greater industry, enabling new opportunities to better serve our world. Visit computer.org for more information.
As large language models (LLMs) appear to behave increasingly human-like in text-based interactions, more and more researchers become interested in investigating personality in LLMs. However, the diversity of psychological personality research and the rapid development of LLMs have led to a broad yet fragmented landscape of studies in this interdisciplinary field. Extensive studies across different research focuses, different personality psychometrics, and different LLMs make it challenging to have a holistic overview and further pose difficulties in applying findings to real-world applications. In this paper, we present a comprehensive review by categorizing current studies into three research problems: self-assessment, exhibition, and recognition, based on the intrinsic characteristics and external manifestations of personality in LLMs. For each problem, we provide a thorough analysis and conduct in-depth comparisons of their corresponding solutions. Besides, we summarize research findings and open challenges from current studies and further discuss their underlying causes. We also collect extensive publicly available resources to facilitate interested researchers and developers. Lastly, we discuss the potential future research directions and application scenarios. Our paper is the first comprehensive survey of up-to-date literature on personality in LLMs. By presenting a clear taxonomy, in-depth analysis, promising future directions, and extensive resource collections, we aim to provide a better understanding and facilitate further advancements in this emerging field.
Help | Advanced Search
Title: burst image super-resolution with base frame selection.
Abstract: Burst image super-resolution has been a topic of active research in recent years due to its ability to obtain a high-resolution image by using complementary information between multiple frames in the burst. In this work, we explore using burst shots with non-uniform exposures to confront real-world practical scenarios by introducing a new benchmark dataset, dubbed Non-uniformly Exposed Burst Image (NEBI), that includes the burst frames at varying exposure times to obtain a broader range of irradiance and motion characteristics within a scene. As burst shots with non-uniform exposures exhibit varying levels of degradation, fusing information of the burst shots into the first frame as a base frame may not result in optimal image quality. To address this limitation, we propose a Frame Selection Network (FSN) for non-uniform scenarios. This network seamlessly integrates into existing super-resolution methods in a plug-and-play manner with low computational costs. The comparative analysis reveals the effectiveness of the nonuniform setting for the practical scenario and our FSN on synthetic-/real- NEBI datasets.
Comments: | CVPR2024W NTIRE accepted |
Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
Cite as: | [cs.CV] |
(or [cs.CV] for this version) | |
Focus to learn more arXiv-issued DOI via DataCite |
Access paper:.
Code, data and media associated with this article, recommenders and search tools.
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .
IMAGES
VIDEO
COMMENTS
This paper reviewed the sign language research in the vision-based hand gesture recognition system from 2014 to 2020. Its objective is to identify the progress and what needs more attention. We have extracted a total of 98 articles from well-known online databases using selected keywords. The review shows that the vision-based hand gesture recognition research is an active field of research ...
The paper will discuss the gesture acquisition methods, the feature extraction process, the classification of hand gestures, the applications that were recently proposed, the challenges that face researchers in the hand gesture recognition process, and the future of hand gesture recognition. ... The research work in hand gesture recognition has ...
Currently, gesture recognition is like a problem of feature extraction and pattern recognition, in which a movement is labeling as belonging to a given class. A gesture recognition system's response could solve different problems in various fields, such as medicine, robotics, sign language, human-computer interfaces, virtual reality, augmented reality, and security. In this context, this ...
Hand gesture, one of the essential ways for a human to convey information and express intuitive intention, has a significant degree of differentiation, substantial flexibility, and high robustness of information transmission to make hand gesture recognition (HGR) one of the research hotspots in the fields of human-human and human-computer or human-machine interactions.
124 papers with code • 13 benchmarks • 14 datasets. Gesture Recognition is an active field of research with applications such as automatic recognition of sign language, interaction of humans and robots or for new ways of controlling video games. Source: Gesture Recognition in RGB Videos Using Human Body Keypoints and Dynamic Time Warping.
Hand gesture recognition is one of the most widely explored areas under the human-computer interaction domain. Although various modalities of hand gesture recognition have been explored in the ...
Among the many gesture recognition methods, they can be divided into two categories: static gesture recognition and dynamic gesture recognition. Static gesture recognition methods have significant ...
The fundamental objective of gesture recognition research is to develop a technology capable of recognizing distinct human gestures and utilizing them to communicate information or control devices . As a result, it incorporates monitoring hand movement and translation of such motion as crucial instruction. ... In this paper, we worked on 5 one ...
45 papers with code • 18 benchmarks • 14 datasets. Hand gesture recognition (HGR) is a subarea of Computer Vision where the focus is on classifying a video or image containing a dynamic or static, respectively, hand gesture. In the static case, gestures are also generally called poses. HGR can also be performed with point cloud or joint ...
Gesture recognition using machine-learning methods is valuable in the development of advanced cybernetics, robotics and healthcare systems, and typically relies on images or videos. To improve ...
H AND GESTURE RECOGNITION: A LITERATURE. R EVIEW. 1 Rafiqul Zaman Khan and 2 Noor Adnan Ibraheem. 1,2 Department of Computer Science, A.M.U. Aligarh, India. 1 [email protected]. 2 naibraheem@gmail ...
However, many research papers deal with enhancing frameworks for hand gesture recognition or developing new algorithms rather than executing a practical application with regard to health care. The biggest challenge encountered by the researcher is in designing a robust framework that overcomes the most common issues with fewer limitations and ...
Abstract. This paper introduces a real-time system for recognizing hand gestures using Python and OpenCV, centred on a Convolutional Neural Network (CNN) model. The primary objective of this study ...
Static gesture recognition employs a gesture image acquired at a specific point in time, with the recognition result based on the location, shape, and texture (Yuanyuan et al., 2021). However, dynamic gestures are referred to the variation of hand movement in a period (De Smedt et al., 2016, Lupinetti et al., 2020, Shi et al., 2021). Thus, for ...
As a novel form of human machine interaction (HMI), hand gesture recognition (HGR) has garnered extensive attention and research. The majority of HGR studies are based on visual systems, inevitably encountering challenges such as depth and occlusion. On the contrary, data gloves can facilitate data collection with minimal interference in complex environments, thus becoming a research focus in ...
Human gesture recognition is one of the most challenging problems in computer vision, striving to analyze human gestures by machine. However, most of the literature on gesture recognition utilizes isolated data with only one gesture in one image or a video for classifying gestures. This work targets the identification of human gestures from the continuous stream of data input taken from a live ...
However, to focus the scope of the study 465 papers have been excluded. Only the most relevant hand gesture recognition works to the research questions, and the well-organized papers have been ...
A Review on Vision-Based Hand Gesture Recognition and Applications, Research Gate, pp.261-286. Google Scholar Cross Ref; Tao Liu, Wen-gang Zhou, and Houquiang Li. 2016. ... Gesture recognition using data glove: an extreme learning machine method. In International 9 conference on robotics and biomimetics ROBIO. ... Many research papers have been ...
We present an on-device real-time hand gesture recognition (HGR) system, which detects a set of predefined static gestures from a single RGB camera. The system consists of two parts: a hand skeleton tracker and a gesture classifier. We use MediaPipe Hands as the basis of the hand skeleton tracker, improve the keypoint accuracy, and add the estimation of 3D keypoints in a world metric space. We ...
Many research papers have proposed recognition of sign language for deaf-mute people, using a glove-attached sensor worn on the hand that gives responses according to hand movement. Alternatively, it may involve uncovered hand interaction with the camera, using computer vision techniques to identify the gesture.
Researchers have recently focused their attention on vision-based hand gesture recognition. However, due to several constraints, achieving an effective vision-driven hand gesture recognition system in real time has remained a challenge. This paper aims to uncover the limitations faced in image acquisition through the use of cameras, image segmentation and tracking, feature extraction, and ...
In this paper we present a literature survey on Hand Gesture Recognition (HGR). Having reached all the best possible ways for data acquisition like cameras, wrist sensors, hand gloves now these are of less concern. Now the higher emphasis is on feature extraction from the available data, algorithms used to improvise feature extraction. These processes have also been tested and in recent papers ...
With the rapid development of computer vision, the demand for interaction between human and machine is becoming more and more extensive. Since hand gestures are able to express enriched information, the hand gesture recognition is widely used in robot control, intelligent furniture and other aspects. The paper realizes the segmentation of hand gestures by establishing the skin color model and ...
SID Symposium Digest of Technical Papers is an information display journal publishing short papers and poster session content from SID's annual symposium, Display Week. In recent years, gesture recognition technology has been increasingly used in the field of virtual reality. ... The Research on Virtual Reality Field Based on Gesture Recognition.
Computer Science > Computer Vision and Pattern Recognition. arXiv:2406.19217 (cs) ... which utilizes transformer and attention architectures for gesture prompting, while the second, a Multi-Scale Temporal Reasoning module, employs a multi-stage temporal convolutional network with both slow and fast paths for temporal information extraction ...
Gesture recognition, having multitudinous applications in the real world, is one of the core areas of research in the field of human-computer interaction. In this paper, we propose a novel method for isolated and continuous hand gesture recognition utilizing the movement epenthesis detection and removal. For this purpose, the present work detects and removes the movement epenthesis frames from ...
SEATTLE, 19 June 2024 - Today, during the 2024 Computer Vision and Pattern Recognition (CVPR) Conference opening session, the CVPR Awards Committee announced the winners of its prestigious Best Paper Awards, which annually recognize top research in computer vision, artificial intelligence (AI), machine learning (ML), augmented, virtual and mixed reality (AR/VR/MR), deep learning, and much more.
As large language models (LLMs) appear to behave increasingly human-like in text-based interactions, more and more researchers become interested in investigating personality in LLMs. However, the diversity of psychological personality research and the rapid development of LLMs have led to a broad yet fragmented landscape of studies in this interdisciplinary field. Extensive studies across ...
Burst image super-resolution has been a topic of active research in recent years due to its ability to obtain a high-resolution image by using complementary information between multiple frames in the burst. In this work, we explore using burst shots with non-uniform exposures to confront real-world practical scenarios by introducing a new benchmark dataset, dubbed Non-uniformly Exposed Burst ...