Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, hand gesture recognition.

45 papers with code • 18 benchmarks • 14 datasets

Hand gesture recognition (HGR) is a subarea of Computer Vision where the focus is on classifying a video or image containing a dynamic or static, respectively, hand gesture. In the static case, gestures are also generally called poses. HGR can also be performed with point cloud or joint hand data.

Benchmarks Add a Result

--> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> -->
Trend Dataset Best ModelPaper Code Compare
e2eET
e2eET
De+Recouple
e2eET
ResNeXt-101
3DCNN_VIVA_4
8-MFFs-3f1c (5 crop)
MTUT
Prototypical Networks + CNN
DRX3D
Key Frames + Feature Fusion
DenseNet
F-BLSTM
F-BGRU
F-BGRU
8-MFFs-3f1c (5 crop)
8-MFFs-3f1c
Key Frames + Feature Fusion

research papers on gesture recognition

Most implemented papers

Real-time hand gesture detection and classification using convolutional neural networks.

research papers on gesture recognition

We evaluate our architecture on two publicly available datasets - EgoGesture and NVIDIA Dynamic Hand Gesture Datasets - which require temporal detection and classification of the performed hand gestures.

Make Skeleton-based Action Recognition Model Smaller, Faster and Better

Although skeleton-based action recognition has achieved great success in recent years, most of the existing methods may suffer from a large model size and slow execution speed.

HGR-Net: A Fusion Network for Hand Gesture Segmentation and Recognition

We propose a two-stage convolutional neural network (CNN) architecture for robust recognition of hand gestures, called HGR-Net, where the first stage performs accurate semantic segmentation to determine hand regions, and the second stage identifies the gesture.

Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison

Based on this new large-scale dataset, we are able to experiment with several deep learning methods for word-level sign recognition and evaluate their performances in large scale scenarios.

Human Computer Interaction Using Marker Based Hand Gesture Recognition

siam1251/HandGestureRecognition • 23 Jun 2016

Human Computer Interaction (HCI) has been redefined in this era.

First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations

guiggh/hand_pose_action • CVPR 2018

Our dataset and experiments can be of interest to communities of 3D hand pose estimation, 6D object pose, and robotics as well as action recognition.

Deep Fisher Discriminant Learning for Mobile Hand Gesture Recognition

chriswegmann/drone_steering • 12 Jul 2017

Gesture recognition is a challenging problem in the field of biometrics.

A Study of Convolutional Architectures for Handshape Recognition applied to Sign Language

Using the LSA16 and RWTH-PHOENIX-Weather handshape datasets, we performed experiments with the LeNet, VGG16, ResNet-34 and All Convolutional architectures, as well as Inception with normal training and via transfer learning, and compared them to the state of the art in these datasets.

Motion Fused Frames: Data Level Fusion Strategy for Hand Gesture Recognition

Acquiring spatio-temporal states of an action is the most crucial step for action classification.

Deep Learning for Hand Gesture Recognition on Skeletal Data

In this paper, we introduce a new 3D hand gesture recognition approach based on a deep learning model.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 08 June 2020

Gesture recognition using a bioinspired learning architecture that integrates visual data with somatosensory data from stretchable sensors

  • Ming Wang   ORCID: orcid.org/0000-0003-0976-9871 1   na1 ,
  • Zheng Yan   ORCID: orcid.org/0000-0003-3368-2100 2   na1 ,
  • Ting Wang 1 ,
  • Pingqiang Cai   ORCID: orcid.org/0000-0002-2665-5932 1 ,
  • Siyu Gao 1 ,
  • Yi Zeng 1 ,
  • Changjin Wan   ORCID: orcid.org/0000-0002-3210-6673 1 ,
  • Hong Wang 1 ,
  • Liang Pan 1 ,
  • Jiancan Yu 1 ,
  • Shaowu Pan 1 ,
  • Jie Lu 2 &
  • Xiaodong Chen   ORCID: orcid.org/0000-0002-3312-1664 1  

Nature Electronics volume  3 ,  pages 563–570 ( 2020 ) Cite this article

10k Accesses

327 Citations

88 Altmetric

Metrics details

  • Electrical and electronic engineering
  • Materials for devices

Gesture recognition using machine-learning methods is valuable in the development of advanced cybernetics, robotics and healthcare systems, and typically relies on images or videos. To improve recognition accuracy, such visual data can be combined with data from other sensors, but this approach, which is termed data fusion, is limited by the quality of the sensor data and the incompatibility of the datasets. Here, we report a bioinspired data fusion architecture that can perform human gesture recognition by integrating visual data with somatosensory data from skin-like stretchable strain sensors made from single-walled carbon nanotubes. The learning architecture uses a convolutional neural network for visual processing and then implements a sparse neural network for sensor data fusion and recognition at the feature level. Our approach can achieve a recognition accuracy of 100% and maintain recognition accuracy in non-ideal conditions where images are noisy and under- or over-exposed. We also show that our architecture can be used for robot navigation via hand gestures, with an error of 1.7% under normal illumination and 3.3% in the dark.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 digital issues and online access to articles

111,21 € per year

only 9,27 € per issue

Buy this article

  • Purchase on Springer Link
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

research papers on gesture recognition

Similar content being viewed by others

research papers on gesture recognition

Skin-inspired soft bioelectronic materials, devices and systems

research papers on gesture recognition

Brain organoid reservoir computing for artificial intelligence

research papers on gesture recognition

Multidimensional vision sensors for information processing

Data availability.

The data that support the plots within this paper and other findings of this study are available from the corresponding author upon reasonable request. The SV datasets used in this study are available at https://github.com/mirwang666-ime/Somato-visual-SV-dataset .

Code availability

The code that supports the plots within this paper and other findings of this study are available at https://github.com/mirwang666-ime/Somato-visual-SV-dataset . The code that supports the human–machine interaction experiment is available from the corresponding author upon reasonable request.

Yamada, T. et al. A stretchable carbon nanotube strain sensor for human-motion detection. Nat. Nanotechnol. 6 , 296–301 (2011).

Article   Google Scholar  

Amjadi, M., Kyung, K.-U., Park, I. & Sitti, M. Stretchable, skin-mountable and wearable strain sensors and their potential applications: a review. Adv. Funct. Mater. 26 , 1678–1698 (2016).

Rautaray, S. S. & Agrawal, A. Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev. 43 , 1–54 (2015).

Lim, S. et al. Transparent and stretchable interactive human machine interface based on patterned graphene heterostructures. Adv. Funct. Mater. 25 , 375–383 (2015).

Pisharady, P. K., Vadakkepat, P. & Loh, A. P. Attention based detection and recognition of hand postures against complex backgrounds. Int. J. Comput. Vis. 101 , 403–419 (2013).

Giese, M. A. & Poggio, T. Neural mechanisms for the recognition of biological movements. Nat. Rev. Neurosci. 4 , 179–192 (2003).

Tan, X. & Triggs, B. Enhanced local texture feature sets for face recognition under difficult lighting conditions. IEEE Trans. Image Process. 19 , 1635–1650 (2010).

Article   MathSciNet   Google Scholar  

Liu, H., Ju, Z., Ji, X., Chan, C. S. & Khoury, M. Human Motion Sensing and Recognition (Springer, 2017).

Liu, K., Chen, C., Jafari, R. & Kehtarnavaz, N. Fusion of inertial and depth sensor data for robust hand gesture recognition. IEEE Sens. J. 14 , 1898–1903 (2014).

Chen, C., Jafari, R. & Kehtarnavaz, N. A survey of depth and inertial sensor fusion for human action recognition. Multimed. Tools Appl. 76 , 4405–4425 (2017).

Dawar, N., Ostadabbas, S. & Kehtarnavaz, N. Data augmentation in deep learning-based fusion of depth and inertial sensing for action recognition. IEEE Sens. Lett. 3 , 7101004 (2019).

Kwolek, B. & Kepski, M. Improving fall detection by the use of depth sensor and accelerometer. Neurocomputing 168 , 637–645 (2015).

Tang, D., Yusuf, B., Botzheim, J., Kubota, N. & Chan, C. S. A novel multimodal communication framework using robot partner for aging population. Expert Syst. Appl. 42 , 4540–4555 (2015).

Wang, C., Wang, C., Huang, Z. & Xu, S. Materials and structures toward soft electronics. Adv. Mater. 30 , 1801368 (2018).

Kim, D. H. et al. Dissolvable films of silk fibroin for ultrathin conformal bio-integrated electronics. Nat. Mater. 9 , 511–517 (2010).

Ehatisham-Ul-Haq, M. et al. Robust human activity recognition using multimodal feature-level fusion. IEEE Access 7 , 60736–60751 (2019).

Imran, J. & Raman, B. Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition. J. Amb. Intel. Hum. Comput. 11 , 189–208 (2020).

Dawar, N. & Kehtarnavaz, N. Action detection and recognition in continuous action streams by deep learning-based sensing fusion. IEEE Sens. J. 18 , 9660–9668 (2018).

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521 , 436–444 (2015).

Wang, M., Wang, T., Cai, P. & Chen, X. Nanomaterials discovery and design through machine learning. Small Methods 3 , 1900025 (2019).

Li, S.-Z., Yu, B., Wu, W., Su, S.-Z. & Ji, R.-R. Feature learning based on SAE–PCA network for human gesture recognition in RGBD images. Neurocomputing 151 , 565–573 (2015).

Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 , 115–118 (2017).

Long, E. et al. An artificial intelligence platform for the multihospital collaborative management of congenital cataracts. Nat. Biomed. Eng. 1 , 0024 (2017).

Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529 , 484–489 (2016).

Silver, D. et al. Mastering the game of Go without human knowledge. Nature 550 , 354–359 (2017).

Chandrasekaran, C., Lemus, L. & Ghazanfar, A. A. Dynamic faces speed up the onset of auditory cortical spiking responses during vocal detection. Proc. Natl Acad. Sci. USA 110 , E4668–E4677 (2013).

Lakatos, P., Chen, C. M., O’Connell, M. N., Mills, A. & Schroeder, C. E. Neuronal oscillations and multisensory interaction in primary auditory cortex. Neuron 53 , 279–292 (2007).

Henschke, J. U., Noesselt, T., Scheich, H. & Budinger, E. Possible anatomical pathways for short-latency multisensory integration processes in primary sensory cortices. Brain Struct. Funct. 220 , 955–977 (2015).

Lee, A. K. C., Wallace, M. T., Coffin, A. B., Popper, A. N. & Fay, R. R. (eds) Multisensory Processes : The Auditory Perspective (Springer, 2019).

Bizley, J. K., Jones, G. P. & Town, S. M. Where are multisensory signals combined for perceptual decision-making? Curr. Opin. Neurobiol. 40 , 31–37 (2016).

Ohyama, T. et al. A multilevel multimodal circuit enhances action selection in Drosophila . Nature 520 , 633–639 (2015).

Bullmore, E. & Sporns, O. Complex brain networks: graph theoretical analysis of structural and functional systems. Nat. Rev. Neurosci. 10 , 186–198 (2009).

Yamins, D. L. et al. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl Acad. Sci. USA 111 , 8619–8624 (2014).

Gilbert, C. D. & Li, W. Top-down influences on visual processing. Nat. Rev. Neurosci. 14 , 350–363 (2013).

Chortos, A., Liu, J. & Bao, Z. Pursuing prosthetic electronic skin. Nat. Mater. 15 , 937–950 (2016).

Barbier, V. et al. Stable modification of PDMS surface properties by plasma polymerization: application to the formation of double emulsions in microfluidic systems. Langmuir 22 , 5230–5232 (2006).

Bakarich, S. E. et al. Recovery from applied strain in interpenetrating polymer network hydrogels with ionic and covalent cross-links. Soft Matter 8 , 9985–9988 (2012).

Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9 , 2579–2605 (2008).

MATH   Google Scholar  

Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Proc. 25th International Conference on Neural Information Processing Systems Vol. 1, 1097–1105 (NIPS, 2012).

Polson, N. & Rockova, V. Posterior concentration for sparse deep learning. In Proc. 31st International Conference on Neural Information Processing Systems 930–941 (NIPS, 2018).

Le, X. & Wang, J. Robust pole assignment for synthesizing feedback control systems using recurrent neural networks. IEEE Trans. Neural Netw. Learn. Syst. 25 , 383–393 (2013).

Download references

Acknowledgements

The project was supported by the Agency for Science, Technology and Research (A*STAR) under its Advanced Manufacturing and Engineering (AME) Programmatic Scheme (no. A18A1b0045), the National Research Foundation (NRF), Prime Minister’s office, Singapore, under its NRF Investigatorship (NRF-NRFI2017-07), Singapore Ministry of Education (MOE2017-T2-2-107) and the Australian Research Council (ARC) under Discovery Grant DP200100700. We thank all the volunteers for collecting data and also A.L. Chun for critical reading and editing of the manuscript.

Author information

These authors contributed equally: Ming Wang, Zheng Yan.

Authors and Affiliations

Innovative Centre for Flexible Devices (iFLEX), Max Planck–NTU Joint Lab for Artificial Senses, School of Materials Science and Engineering, Nanyang Technological University, Singapore, Singapore

Ming Wang, Ting Wang, Pingqiang Cai, Siyu Gao, Yi Zeng, Changjin Wan, Hong Wang, Liang Pan, Jiancan Yu, Shaowu Pan, Ke He & Xiaodong Chen

Centre for Artificial Intelligence, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, New South Wales, Australia

Zheng Yan & Jie Lu

You can also search for this author in PubMed   Google Scholar

Contributions

M.W. and X.C. designed the study. M.W. designed and characterized the strain sensor. M.W., T.W. and P.C. fabricated the PAA hydrogels. Z.Y. and M.W. carried out the machine learning algorithms and analysed the results. M.W., S.G. and Y.Z. collected the SV data. M.W. performed the human–machine interaction experiment. M.W. and X.C. wrote the paper and all authors provided feedback.

Corresponding author

Correspondence to Xiaodong Chen .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

Supplementary Notes 1–3, Figs. 1–11, Tables 1–3 and refs. 1–7.

Reporting Summary

Supplementary video 1.

Conformable and adhesive stretchable strain sensor.

Supplementary Video 2

Comparison of the robot navigation using the BSV learning-based and visual-based hand gesture recognition under a normal illuminance of 431 lux.

Supplementary Video 3

Comparison of the robot navigation using the BSV learning-based and visual-based hand gesture recognition under a dark illuminance of 10 lux.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Wang, M., Yan, Z., Wang, T. et al. Gesture recognition using a bioinspired learning architecture that integrates visual data with somatosensory data from stretchable sensors. Nat Electron 3 , 563–570 (2020). https://doi.org/10.1038/s41928-020-0422-z

Download citation

Received : 30 May 2019

Accepted : 05 May 2020

Published : 08 June 2020

Issue Date : September 2020

DOI : https://doi.org/10.1038/s41928-020-0422-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Encoding of multi-modal emotional information via personalized skin-integrated wireless facial interface.

  • Jin Pyo Lee
  • Hanhyeok Jang

Nature Communications (2024)

High-density, highly sensitive sensor array of spiky carbon nanospheres for strain field mapping

  • Shuxing Mei

Intelligent upper-limb exoskeleton integrated with soft bioelectronics and deep learning for intention-driven augmentation

  • Kangkyu Kwon
  • Woon-Hong Yeo

npj Flexible Electronics (2024)

Computational design of ultra-robust strain sensors for soft robot perception and autonomy

  • Haitao Yang
  • Ghim Wei Ho

Millimeter wave gesture recognition using multi-feature fusion models in complex scenes

  • Zhanjun Hao
  • Zhizhou Sun
  • Jianxiang Peng

Scientific Reports (2024)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research papers on gesture recognition

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List

Logo of jimaging

Hand Gesture Recognition Based on Computer Vision: A Review of Techniques

Munir oudah.

1 Electrical Engineering Technical College, Middle Technical University, Baghdad 10022, Iraq; moc.oohay@iqarila_rinuM

Ali Al-Naji

2 School of Engineering, University of South Australia, Mawson Lakes SA 5095, Australia; [email protected]

Javaan Chahl

Hand gestures are a form of nonverbal communication that can be used in several fields such as communication between deaf-mute people, robot control, human–computer interaction (HCI), home automation and medical applications. Research papers based on hand gestures have adopted many different techniques, including those based on instrumented sensor technology and computer vision. In other words, the hand sign can be classified under many headings, such as posture and gesture, as well as dynamic and static, or a hybrid of the two. This paper focuses on a review of the literature on hand gesture techniques and introduces their merits and limitations under different circumstances. In addition, it tabulates the performance of these methods, focusing on computer vision techniques that deal with the similarity and difference points, technique of hand segmentation used, classification algorithms and drawbacks, number and types of gestures, dataset used, detection range (distance) and type of camera used. This paper is a thorough general overview of hand gesture methods with a brief discussion of some possible applications.

1. Introduction

Hand gestures are an aspect of body language that can be conveyed through the center of the palm, the finger position and the shape constructed by the hand. Hand gestures can be classified into static and dynamic. As its name implies, the static gesture refers to the stable shape of the hand, whereas the dynamic gesture comprises a series of hand movements such as waving. There are a variety of hand movements within a gesture; for example, a handshake varies from one person to another and changes according to time and place. The main difference between posture and gesture is that posture focuses more on the shape of the hand whereas gesture focuses on the hand movement. The main approaches to hand gesture research can be classified into the wearable glove-based sensor approach and the camera vision-based sensor approach [ 1 , 2 ].

Hand gestures offer an inspiring field of research because they can facilitate communication and provide a natural means of interaction that can be used across a variety of applications. Previously, hand gesture recognition was achieved with wearable sensors attached directly to the hand with gloves. These sensors detected a physical response according to hand movements or finger bending. The data collected were then processed using a computer connected to the glove with wire. This system of glove-based sensor could be made portable by using a sensor attached to a microcontroller.

As illustrated in Figure 1 , hand gestures for human–computer interaction (HCI) started with the invention of the data glove sensor. It offered simple commands for a computer interface. The gloves used different sensor types to capture hand motion and position by detecting the correct coordinates of the location of the palm and fingers [ 3 ]. Various sensors using the same technique based on the angle of bending were the curvature sensor [ 4 ], angular displacement sensor [ 5 ], optical fiber transducer [ 6 ], flex sensors [ 7 ] and accelerometer sensor [ 8 ]. These sensors exploit different physical principles according to their type.

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00073-g001.jpg

Different techniques for hand gestures. ( a ) Glove-based attached sensor either connected to the computer or portable; ( b ) computer vision–based camera using a marked glove or just a naked hand.

Although the techniques mentioned above have provided good outcomes, they have various limitations that make them unsuitable for the elderly, who may experience discomfort and confusion due to wire connection problems. In addition, elderly people suffering from chronic disease conditions that result in loss of muscle function may be unable to wear and take off gloves, causing them discomfort and constraining them if used for long periods. These sensors may also cause skin damage, infection or adverse reactions in people with sensitive skin or those suffering burns. Moreover, some sensors are quite expensive. Some of these problems were addressed in a study by Lamberti and Camastra [ 9 ], who developed a computer vision system based on colored marked gloves. Although this study did not require the attachment of sensors, it still required colored gloves to be worn.

These drawbacks led to the development of promising and cost-effective techniques that did not require cumbersome gloves to be worn. These techniques are called camera vision-based sensor technologies. With the evolution of open-source software libraries, it is easier than ever to detect hand gestures that can be used under a wide range of applications like clinical operations [ 10 ], sign language [ 11 ], robot control [ 12 ], virtual environments [ 13 ], home automation [ 14 ], personal computer and tablet [ 15 ], gaming [ 16 ]. These techniques essentially involve replacement of the instrumented glove with a camera. Different types of camera are used for this purpose, such as RGB camera, time of flight (TOF) camera, thermal cameras or night vision cameras.

Algorithms have been developed based on computer vision methods to detect hands using these different types of cameras. The algorithms attempt to segment and detect hand features such as skin color, appearance, motion, skeleton, depth, 3D model, deep learn detection and more. These methods involve several challenges, which are discussed in this paper in the following sections.

Several studies based on computer vision techniques were published in the past decade. A study by Murthy et al. [ 17 ] covered the role and fundamental technique of HCI in terms of the recognition approach, classification and applications, describing computer vision limitations under various conditions. Another study by Khan et al. [ 18 ] presented a recognition system concerned with the issue of feature extraction, gesture classification, and considered the application area of the studies. Suriya et al. [ 19 ] provided a specific survey on hand gesture recognition for mouse control applications, including methodologies and algorithms used for human–machine interaction. In addition, they provided a brief review of the hidden Markov model (HMM). A study by Sonkusare et al. [ 20 ] reported various techniques and made comparisons between them according to hand segmentation methodology, tracking, feature extraction, recognition techniques, and concluded that the recognition rate was a tradeoff with temporal rate limited by computing power. Finally, Kaur et al. [ 16 ] reviewed several methods, both sensor-based and vision-based, for hand gesture recognition to improve the precision of algorithms through integrating current techniques.

The studies above give insight into some gesture recognition systems under various scenarios, and address issues such as scene background limitations, illumination conditions, algorithm accuracy for feature extraction, dataset type, classification algorithm used and application. However, no review paper mentions camera type, distance limitations or recognition rate. Therefore, the objective of this study is to provide a comparative review of recent studies concerning computer vision techniques with regard to hand gesture detection and classification supported by different technologies. The current paper discusses the seven most reported approaches to the problem such as skin color, appearance, motion, skeleton, depth, 3D-model, deep-learning. This paper also discusses these approaches in detail and summarizes some modern research under different considerations (type of camera used, resolution of the processed image or video, type of segmentation technique, classification algorithm used, recognition rate, type of region of interest processing, number of gestures, application area, limitation or invariant factor, and detection range achieved and in some cases data set use, runtime speed, hardware run, type of error). In addition, the review presents the most popular applications associated with this topic.

The remainder of this paper is summarized as follows. Section 2 explains hand gesture methods and take consideration and focus on computer vision techniques, where describe seven most common techniques such as skin color, appearance, motion, skeleton, depth, 3D-module, deep learn and support that with tables. Section 3 illustrates in detail seven application areas that deal with hand gesture recognition systems. Section 4 briefly discusses research gaps and challenges. Finally, Section 5 presents our conclusions. Figure 2 below clarify the classification methods conducted by this review.

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00073-g002.jpg

Classifications method conducted by this review.

2. Hand Gesture Methods

The primary goal in studying gesture recognition is to introduce a system that can detect specific human gestures and use them to convey information or for command and control purposes. Therefore, it includes not only tracking of human movement, but also the interpretation of that movement as significant commands. Two approaches are generally used to interpret gestures for HCI applications. The first approach is based on data gloves (wearable or direct contact) and the second approach is based on computer vision without the need to wear any sensors.

2.1. Hand Gestures Based on Instrumented Glove Approach

The wearable glove-based sensors can be used to capture hand motion and position. In addition, they can easily provide the exact coordinates of palm and finger locations, orientation and configurations by using sensors attached to the gloves [ 21 , 22 , 23 ]. However, this approach requires the user to be connected to the computer physically [ 23 ], which blocks the ease of interaction between user and computer. In addition, the price of these devices is quite high [ 23 , 24 ]. However, the modern glove based approach uses the technology of touch, which more promising technology and it is considered Industrial-grade haptic technology. Where the glove gives haptic feedback that makes user sense the shape, texture, movement and weight of a virtual object by using microfluidic technology. Figure 3 shows an example of a sensor glove used in sign language.

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00073-g003.jpg

Sensor-based data glove (adapted from website: https://physicsworld.com/a/smart-glove-translates-sign-language-into-digital-text/ ).

2.2. Hand Gestures Based on Computer Vision Approach

The camera vision based sensor is a common, suitable and applicable technique because it provides contactless communication between humans and computers [ 16 ]. Different configurations of cameras can be utilized, such as monocular, fisheye, TOF and IR [ 20 ]. However, this technique involves several challenges, including lighting variation, background issues, the effect of occlusions, complex background, processing time traded against resolution and frame rate and foreground or background objects presenting the same skin color tone or otherwise appearing as hands [ 17 , 21 ]. These challenges will be discussed in the following sections. A simple diagram of the camera vision-based sensor for extracting and identifying hand gestures is presented in Figure 4 .

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00073-g004.jpg

Using computer vision techniques to identify gestures. Where the user perform specific gesture by single or both hand in front of camera which connect with system framework that involve different possible techniques to extract feature and classify hand gesture to be able control some possible application.

2.2.1. Color-Based Recognition:

Color-based recognition using glove marker.

This method uses a camera to track the movement of the hand using a glove with different color marks, as shown in Figure 4 . This method has been used for interaction with 3D models, permitting some processing, such as zooming, moving, drawing and writing using a virtual keyboard with good flexibility [ 9 ]. The colors on the glove enable the camera sensor to track and detect the location of the palm and fingers, which allows for the extraction of geometric model of the shape of the hand [ 13 , 25 ]. The advantages of this method are its simplicity of use and low price compared with the sensor data glove [ 9 ]. However, it still requires the wearing of colored gloves and limits the degree of natural and spontaneous interaction with the HCI [ 25 ]. The color-based glove marker is shown in Figure 5 [ 13 ].

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00073-g005.jpg

Color-based recognition using glove marker [ 13 ].

Color-Based Recognition of Skin Color

Skin color detection is one of the most popular methods for hand segmentation and is used in a wide range of applications, such as object classification, degraded photograph recovery, person movement tracking, video observation, HCI applications, facial recognition, hand segmentation and gesture identification. Skin color detection has been achieved using two methods. The first method is pixel based skin detection, in which each pixel in an image is classified into skin or not, individually from its neighbor. The second method is region skin detection, in which the skin pixels are spatially processed based on information such as intensity and texture.

Color space can be used as a mathematical model to represent image color information. Several color spaces can be used according to the application type such as digital graphics, image process applications, TV transmission and application of computer vision techniques [ 26 , 27 ]. Figure 6 shows an example of skin color detection using YUV color space.

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00073-g006.jpg

Example of skin color detection. ( a ) Apply threshold to the channels of YUV color space in order to extract only skin color then assign 1 value for the skin and 0 to non-skin color; ( b ) detected and tracked hand using resulted binary image.

A several formats of color space are obtained for skin segmentation, as itemized below:

  • red, green, blue (R–G–B and RGB-normalized);
  • hue and saturation (H–S–V, H–S–I and H–S–L);
  • luminance (YIQ, Y–Cb–Cr and YUV).

More detailed discussion of skin color detection based on RGB channels can be found in [ 28 , 29 ]. However, it is not preferred for skin segmentation purposes because the mixture of the color channel and intensity information of an image has irregular characteristics [ 26 ]. Skin color can detect the threshold value of three channels (red, green and blue). In the case of normalized-RGB, the color information is simply separated from the luminance. However, under lighting variation, it cannot be relied on for segmentation or detection purposes, as shown in the studies [ 30 , 31 ].

The characteristics of color space such as hue/saturation family and luminance family are good under lighting variations. The transformation of format RGB to HSI or HSV takes time in case of substantial variation in color value (hue and saturation). Therefore, a pixel within a range of intensity is chosen. The RGB to HSV transformation may consume time because of the transformation from Cartesian to polar coordinates. Thus, HSV space is useful for detection in simple images.

Transforming and splitting channels of Y–Cb–Cr color space is simple if compared with the HSV color family in regard to skin color detection and segmentation, as illustrated in [ 32 , 33 ]. Skin tone detection based Y–Cb–Cr is demonstrated in detail in [ 34 , 35 ].

The image is processed to convert RGB color space to another color space in order to detect the region of interest, normally a hand. This method can be used to detect the region through the range of possible colors, such as red, orange, pink and brown. The training sample of skin regions is studied to obtain the likely range of skin pixels with the band values for R, G and B pixels. To detect skin regions, the pixel color should compare the colors in the region with the predetermined sample color. If similar, then the region can be labeled as skin [ 36 ]. Table 1 presents a set of research papers that use different techniques to detect skin color.

Set of research papers that have used skin color detection for hand gesture and finger counting application.

AuthorType of CameraResolutionTechniques/Methods for SegmentationFeature Extract TypeClassify
Algorithm
Recognition RateNo. of GesturesApplication AreaInvariant FactorDistance from Camera
[ ]off-the-shelf HD webcam16 MpY–Cb–Crfinger countmaximum distance of centroid two fingers70% to 100%14 gesturesHCIlight intensity, size, noise150 to 200 mm
[ ]computer camera320 × 250
pixels
Y–Cb–Crfinger countexpert system98%6
gestures
deaf-mute peopleheavy light during
capturing
[ ]Fron-Tech E-cam
(web camera)
10 MpRGB threshold & edge detection Sobel methodA–Z alphabet
hand gesture
feature matching
(Euclidian distance)
90.19%26 static gestures(ASL)
American sign language
1000 mm
[ ]webcam640 × 480 pixelsHIS & distance transformfinger countdistance transform method & circular profiling100% > according limitation6
gestures
control the slide during a presentationlocation of hand
[ ]webcamHIS & frame difference & Haar classifierdynamic hand gesturescontour matching difference
with the previous
hand segmentHCIsensitive to moving background
[ ]webcam640 × 480 pixelsHSV & motion detection
(hybrid technique)
hand gestures(SPM) classification technique98.75%hand segmentHCI
[ ]video camera640 × 480 pixelsHSV & cross-correlationhand gesturesEuclidian distance82.67%15
gestures
man–machine interface
(MMI)
[ ]digital or cellphone camera768 × 576 pixelsHSVhand gesturesdivision by shapehand segmentMalaysian sign
language
objects have the same skin color some & hard edges
[ ]web camera320 × 240 pixelsred channel threshold segmentation methodhand posturescombine information from multiple cures of the motion, color and shape100%5 hand posturesHCI
wheelchair control
[ ]Logitech portable webcam C905320 × 240 pixelsnormalized R, G, original redhand gesturesHaar-like
directional patterns & motion history image
93.13 static
95.07 dynamic
Percent
2 static
4 dynamic
gestures
man–machine interface
(MMI)
(< 1) mm
(1000–1500) mm
(1500–2000) mm
[ ]high resolution cameras640 × 480 pixelsHIS & Gaussian mixture
model (GMM)
& second histogram
hand posturesHaarlet-based hand gesture98.24% correct classification rate10
postures
manipulating 3D objects & navigating through a 3D modelchanges in illumination
[ ]ToF camera & AVT Marlin color camera176 × 144 &
640 × 480 pixels
histogram-based
skin color probability &
depth threshold
hand gestures2D Haarlets99.54%hand segmentreal-time
hand gesture interaction system
1000 mm

Table footer: –: none.

The skin color method involves various challenges, such as illumination variation, background issues and other types of noise. A study by Perimal et al. [ 37 ] provided 14 gestures under controlled-conditions room lighting using an HD camera at short distance (0.15 to 0.20 m) and, the gestures were tested with three parameters, noise, light intensity and size of hand, which directly affect recognition rate. Another study by Sulyman et al. [ 38 ] observed that using Y–Cb–Cr color space is beneficial for eliminating illumination effects, although bright light during capture reduces the accuracy. A study by Pansare et al. [ 11 ] used RGB to normalize and detect skin and applied a median filter to the red channel to reduce noise on the captured image. The Euclidian distance algorithm was used for feature matching based on a comprehensive dataset. A study by Rajesh et al. [ 15 ] used HSI to segment the skin color region under controlled environmental conditions, to enable proper illumination and reduce the error.

Another challenge with the skin color method is that the background must not contain any elements that match skin color. Choudhury et al. [ 39 ] suggested a novel hand segmentation based on combining the frame differencing technique and skin color segmentation, which recorded good results, but this method is still sensitive to scenes that contain moving objects in the background, such as moving curtains and waving trees. Stergiopoulou et al. [ 40 ] combined motion-based segmentation (a hybrid of image differencing and background subtraction) with skin color and morphology features to obtain a robust result that overcomes illumination and complex background problems. Another study by Khandade et al. [ 41 ] used a cross-correlation method to match hand segmentation with a dataset to achieve better recognition. Karabasi et al. [ 42 ] proposed hand gestures for deaf-mute communication based on mobile phones, which can translate sign language using HSV color space. Zeng et al. [ 43 ] presented a hand gesture method to assist wheelchair users indoors and outdoors using red channel thresholding with a fixed background to overcome the illumination change. A study by Hsieh et al. [ 44 ] used face skin detection to define skin color. This system can correctly detect skin pixels under low lighting conditions, and even when the face color is not in the normal range of skin chromaticity. Another study, by Bergh et al. [ 45 ], proposed a hybrid method based on a combination of the histogram and a pre-trained Gaussian mixture model to overcome lighting conditions. Pansare et al. [ 46 ] aligned two cameras (RGB and TOF) together to improve skin color detection with the help of the depth property of the TOF camera to enhance detection and face background limitations.

2.2.2. Appearance-Based Recognition

This method depends on extracting the image features in order to model visual appearance such as hand and comparing these parameters with feature extracted from the input image frames. Where the features are directly calculated by the pixel intensities without a previous segmentation process. The method is executed in real time due to the easy 2D image features extracted and is considered easier to implement than the 3D model method. In addition, this method can detect various skin tones. Utilizing the AdaBoost learning algorithm, which maintains fixed feature such as key points for a portion of a hand, which can solve the occlusion issue [ 47 , 48 ], it can separate into two models: a motion model and a 2D static model. Table 2 presents a set of research papers that use different segmentation techniques based on appearance recognition to detect region of interest (ROI).

A set of research papers that have used appearance-based detection for hand gesture application.

AuthorType of CameraResolutionTechniques/
Methods for Segmentation
Feature Extract TypeClassify
Algorithm
RECOGNITION RATENo. of GesturesApplication AreaDataset TypeInvariant FactorDistance from Camera
[ ]Logitech
Quick Cam web camera
320 × 240
pixels
Haar -like features & AdaBoost learning algorithmhand postureparallel cascade
structure
above 90%4
hand postures
real-time vision-based hand gesture classificationPositive and negative hand sample collected by author
[ ]webcam-1.380 × 64
resize image for train
OTSU & canny edge detection
technique for gray scale image
hand signfeed-forward back propagation neural network92.33%26 static
signs
American Sign LanguageDataset created by authorlow differentiationdifferent
distances
[ ]camera
video
320 × 240 pixelsGaussian model describes hand color in HSV & AdaBoost algorithmhand gesturepalm–finger configuration93%6
hand gestures
real-time hand gesture recognition method
[ ]camera–projector system384 × 288 pixelsbackground
subtraction method
hand gestureFourier-based classification87.7%9
hand gestures
user-independent
application
ground truth data set collected manuallypoint coordinates geometrically distorted & skin color
[ ]Monocular web camera320 × 240 pixelscombine Y–Cb–Cr
& edge extraction & parallel finger edge appearance
hand
posture based on finger gesture
finger model14
static gestures
substantial applicationsThe test data are collected from videos captured by web-cameravariation in lightness would result in edge extraction failure≤ 500 mm

A study by Chen et al. [ 49 ] proposed two approaches for hand recognition. The first approach focused on posture recognition using Haar-like features, which can describe the hand posture pattern effectively used the AdaBoost learning algorithm to speed up the performance and thus rate of classification. The second approach focused on gesture recognition using context-free grammar to analyze the syntactic structure based on the detected postures. Another study by Kulkarni and Lokhande [ 50 ] used three feature extraction method such as a histogram technique to segment and observe images that contained a large number of gestures, then suggested using edge detection such as Canny, Sobel and Prewitt operators to detect the edges with a different threshold. The classification gesture performed using feed forward back propagation artificial neural network with supervision learns. Some of the limitation reported by the author where conclude when use histogram technique the system gets misclassified result because histogram can only be used for the small number of gesture which completely different from each other. Fang et al. [ 51 ] used an extended AdaBoost method for hand detection and combined optical flow with the color cue for tracking. They also collected hand color from the neighborhood of features’ mean position using a single Gaussian model to describe hand color in HSV color space. Where multi feature extracted and gesture recognition using palm and finger decomposition, then utilizing scale-space feature detection where integrated into gesture recognition in order to encounter the limitation of aspect ratio which facing most of the learning of hand gesture methods. Licsa’r et al. [ 52 ] used a simple background subtraction method for hand segmentation and extended it to handle background changes in order to face some challenges such as skin like color and complex and dynamic background then used boundary-based method to classify hand gesture. Finally, Zhou et al. [ 53 ] proposed a novel method to directly extract the fingers where the edges were extracted from the gesture images, and then the finger central area was obtained from the obtained edges. Fingers were then obtained from the parallel edge characteristics. The proposed system cannot recognize the side view of hand pose. Figure 7 below show simple example on appearance recognition.

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00073-g007.jpg

Example on appearance recognition using foreground extraction in order to segment only ROI, where the object features can be extracted using different techniques such as pattern or image subtraction and foreground and background segmentation algorithms.

According to information mentioned in Table 2 . The first row indicates Haar-like feature which consider a good for analyze ROI pattern efficiently. Haar-like features can efficiently analyze the contrast between dark and bright object within a kernel, which can operate faster compared with pixel based system. In addition, it is immune for noise and lighting variation because they calculate the gray value difference between the white and black rectangles. The result of first row is 90%, but if compared with single gaussian model which used to describe hand color in HSV color space in the third row the result of recognition rate is 93%. Although both proposed system used the Adaboost algorithm to speed up the system and classification.

2.2.3. Motion-Based Recognition

Motion-based recognition can be utilized for detection purposes; it can be extracts the object through a series of image frames. The AdaBoost algorithm utilized for object detection, characterization, movement modeling, and pattern recognition is needed to recognize the gesture [ 16 ]. The main issue encounter motion recognition is this is an occasion if one more gesture is active at the recognition process and also dynamic background has a negative effect. In addition, the loss of gesture may be caused by occlusion among tracked hand gesture or error in region extraction from tracked gesture and effect long-distance on the region appearance Table 3 presents a set of research papers that used different segmentation techniques based on motion recognition to detect ROI.

A set of research papers that have used motion-based detection for hand gesture application.

AuthorType of CameraResolutionTechniques/
Methods for Segmentation
Feature Extract TypeClassify
Algorithm
Recognition RateNo. of GesturesApplication AreaDataset TypeInvariant FactorDistance from Camera
[ ]off-the-shelf camerasRGB, HSV, Y–Cb–Cr &
motion tracking
hand gesturehistogram distribution model97.33%10 gestureshuman–computer interfaceData set created by authorother object moving and background issue
[ ]Canon GL2 camera720 × 480 pixelsface detection & optical flowmotion gestureleave-one-out cross-validation7
gestures
gesture
recognition system
Data set created by author
[ ]time of flight (TOF) SR4000176 × 144 pixelsdepth information, motion patternsmotion gesturemotion patterns compared95%26 gesturesinteraction with virtual environmentscardinal directions datasetdepth range
limitation
3000 mm
[ ]digital cameraYUV & CAMShift algorithmhand gesturenaïve Bayes classifierhighunlimitedhuman and machine systemData set created by authorchanged illumination,
rotation problem,
position problem

Two stages for efficient hand detection were proposed in [ 54 ]. First, the hand detected for each frame and center point is used for tracking the hand. Then, the second stage matching model applying to each type of gesture using a set of features is extracted from the motion tracking in order to provide better classification where the main drawback of the skin color is affected by lighting variations which lead to detect non-skin color. A standard face detection algorithm and optical flow computation was used by [ 55 ] to give a user-centric coordinate frame in which motion features were used to recognize gestures for classification purposes using the multiclass boosting algorithm. A real-time dynamic hand gesture recognition system based on TOF was offered in [ 56 ], in which motion patterns were detected based on hand gestures received as input depth images. These motion patterns were compared with the hand motion classifications computed from the real dataset videos which do not require the use of a segmentation algorithm. Where the system provides good result except the depth rang limitation of TOF camera. In [ 57 ], YUV color space was used, with the help of the CAMShift algorithm, to distinguish between background and skin color, and the naïve Bayes classifier was implemented to assist with gesture recognition. The proposed system faces some challenges such as illumination variation where light changes affect the result of the skin segment. Other challenges are the degree of gesture freedom which affect directly on the output result by change rotation. Next, hand position capture problem, if hand appears in the corner of the frame and the dots which must cover the hand does not lie on hand that may led to failing captured user gesture. In addition, the hand size quite differs between humans and maybe causes a problem with the interaction system. However, the major still challenging problem is the skin-like color which affects overall system and can abort the result. Figure 8 gives simple example on hand motion recognition.

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00073-g008.jpg

Example on motion recognition using frame difference subtraction to extract hand feature, where the moving object such as hand extracted from the fixed background.

According to information mentioned in Table 3 . The first row recognition rate of system is 97%, where the hybrid system based on skin detect and motion detection is more reliable for gesture recognition, where the motion hand can track using multiple track candidates depend on stand derivation calculation for both skin and motion approach. Where every single gesture encoded as chain-code in order to model every single gesture which considers a simple model compared with (HMM) and classified gesture using a model of the histogram distribution. The proposed system in the third row use depth camera based on (TOF) where the motion pattern of the arm model for human utilized to define motion patterns, were the authors confirm that using the depth information for hand trajectories estimation is to improve gesture recognition rate. Moreover, the proposed system no need for the segmentation algorithm, where the system is examined using 2D and 2.5D approaches, were 2.5D performs better than 2D and gives recognition rate 95%.

2.2.4. Skeleton-Based Recognition

The skeleton-based recognition specifies model parameters which can improve the detection of complex features [ 16 ]. Where the various representations of skeleton data for the hand model can be used for classification, it describes geometric attributes and constraint and easy translates features and correlations of data, in order to focus on geometric and statistic features. The most common feature used is the joint orientation, the space between joints, the skeletal joint location and degree of angle between joints and trajectories and curvature of the joints. Table 4 presents a set of research papers that use different segmentation techniques based on skeletal recognition to detect ROI.

Set of research papers that have used skeleton-based recognition for hand gesture application.

AuthorType of CameraResolutionTechniques/
Methods for Segmentation
Feature Extract TypeClassify
Algorithm
Recognition RateNo. of GesturesApplication AreaDataset TypeInvariant FactorDistance from Camera
[ ]Kinect camera depth sensor512 × 424 pixelsEuclidean distance & geodesic
distance
fingertipskeleton pixels extractedhand trackingreal time hand tracking method
[ ]Intel Real Sense depth cameraskeleton datahand-skeletal joints’ positionsconvolutional neural network (CNN)91.28%
84.35%
14 gestures
28 gestures
classification methodDynamic Hand Gesture-14/28
(DHG) dataset
only works on complete sequences
[ ]Kinect camera240 × 320 pixelsLaplacian-based
contraction
skeleton points cloudsHungarian algorithm80%12 gestureshand gesture recognition methodChaLearn
Gesture Dataset (CGD2011)
HGR less
performance in the viewpoint 0◦condition
[ ]RGB video sequence
recorded
vision-based approach & skeletal datahand and body skeletal featuresskeleton classification networkhand gesturesign language
recognition
LSA64 datasetdifficulties in extracting skeletal data
because of occlusions
[ ]Intel Real Sense depth camera640 × 480 pixelsdepth and skeletal datasethand gesturesupervised learning classifier
support vector machine (SVM) with a linear kernel
88.24%
81.90%
14 gestures
28 gestures
hand gesture applicationCreate SHREC 2017 track “3D Hand Skeletal Dataset
[ ]Kinect v2 camera sensor512 × 424 pixelsdepth metadatadynamic hand gestureSVM95.42%10 gesture
26 gesture
Arabic numbers (0–9) letters (26)author own datasetlow
recognition rate, “O”, “T” and “2”
[ ]Kinect RGB camera & depth sensor640 × 480skeleton datahand blobhand gestureMalaysian sign language

Hand segmentation using the depth sensor of the Kinect camera, followed by location of the fingertips using 3D connections, Euclidean distance, and geodesic distance over hand skeleton pixels to provide increased accuracy was proposed in [ 58 ]. A new 3D hand gesture recognition approach based on a deep learning model using parallel convolutional neural networks (CNN) to process hand skeleton joints’ positions was introduced in [ 59 ], the proposed system has a limitation where it works only with complete sequence. The optimal viewpoint was estimated and the point cloud of gesture transformed using a curve skeleton to specify topology, then Laplacian-based contraction was applied to specify the skeleton points in [ 60 ]. Where the Hungarian algorithm was applied to calculate the match scores of the skeleton point set, but the joint tracking information acquired by Kinect is not accurate enough which give a result with constant vibration. A novel method based on skeletal features extracted from RGB recorded video of sign language, which presents difficulties to extracting accurate skeletal data because of occlusions, was offered in [ 61 ]. A dynamic hand gesture using depth and skeletal dataset for a skeleton-based approach was presented in [ 62 ], where supervised learning (SVM) used for classification with a linear kernel. Another dynamic hand gesture recognition using Kinect sensor depth metadata for acquisition and segmentation which used to extract orientation feature, where the support vector machine (SVM) algorithm and HMM was utilized for classification and recognition to evaluate system performance where the SVM bring a good result than HMM in some specification such elapsed time, average recognition rate, was proposed in [ 63 ]. A hybrid method for hand segmentation based on depth and color data acquired by the Kinect sensor with the help of skeletal data were proposed in [ 64 ]. In this method, the image threshold is applied to the depth frame and the super-pixel segmentation method is used to extract the hand from the color frame, then the two results are combined for robust segmentation. Figure 9 show an example on skeleton recognition.

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00073-g009.jpg

Example of skeleton recognition using depth and skeleton dataset to representation hand skeleton model [ 62 ].

According to information mentioned in Table 4 . The depth camera provides good accuracy for segmentation, because not affected by lightening variations and cluttered background. However, the main issue is in the range of detection. The Kinect V1 sensor has an embedded system in which gives feedback information received by depth sensor as a metadata, which gives information about human body joint coordinate. The Kinect V1 provides information used to track skeletal joint up to 20 joints, that’s help to module the hand skeleton. While Kinect V2 sensor can tracking joint as 25 joints and up to six people at the same time with full joints tracking. With a range of detection between (0.5–4.5) meter.

2.2.5. Depth-Based Recognition

Approaches have proposed for solving hand gesture recognition using different types of cameras. A depth camera provides 3D geometric information about the object [ 65 ]. Previously, both major approximations were utilized: TOF precepts and light coding. The 3D data from a depth camera directly reflects the depth field if compared with a color image which contains only a projection [ 66 ]. Using this approach, the lighting, shade, and color did not affect the result image. However, the cost, size and availability of the depth camera will limit its use [ 67 ]. Table 5 presents a set of research papers that use different segmentation techniques based on depth recognition to detect ROI.

Set of research papers that have used depth-based detection for hand gesture and finger counting application.

AuthorType of CameraResolutionTechniques/
Methods for Segmentation
Feature Extract TypeClassify
Algorithm
Recognition RateNo. of GesturesApplication AreaInvariant FactorDistance from Camera
[ ]Kinect V1RGB - 640 × 480
depth - 320 × 240
threshold & near-convex shapefinger gesturefinger–earth movers
distance (FEMD)
93.9%10
gestures
human–computer interactions (HCI)
[ ]Kinect V2RGB - 1920 × 1080
depth - 512 × 424
local neighbor method & threshold segmentationfingertipconvex hull detection algorithm96%6
gestures
natural
human–robot interaction
(500–2000) mm
[ ]Kinect V2Infrared sensor
depth - 512 × 424
operation of depth and infrared imagesfinger counting
& hand gesture
number of separate areasfinger count & two hand gesturesmouse-movement controlling< 500 mm
[ ]Kinect V1RGB - 640 × 480
depth - 320 × 240
depth thresholdsfinger gesturefinger counting classifier & finger name collect &
vector matching
84% one hand
90% two hand
9
gestures
chatting with speech(500–800)
mm
[ ]Kinect V1RGB - 640 × 480
depth - 320 × 240
frame
difference algorithm
hand gestureautomatic state
machine (ASM)
94%hand
gesture
human–computer interaction
[ ]Kinect V1RGB - 640 × 480
depth - 320 × 240
skin & motion detection & Hu moments an orientationhand gesturediscrete hidden Markov model (DHMM)10
gestures
human–computer
interfacing
[ ]Kinect V1depth - 640 × 480range of depth imagehand gestures
1–5
kNN classifier & Euclidian distance88%5
gestures
electronic home appliances(250–650)
mm
[ ]Kinect V1depth - 640 × 480distance methodhand gesturehand
gesture
human–computer interaction (HCI)
[ ]Kinect V1depth - 640 × 480threshold rangehand gesturehand gesturehand rehabilitation system400–1500
mm
[ ]Kinect V2RGB - 1920 × 1080
depth - 512 × 424
Otsu’s global thresholdfinger gesturekNN classifier & Euclidian distance90%finger counthuman–computer interaction (HCI)hand not identified if it’s not connected with boundary250–650
mm
[ ]Kinect V1RGB - 640 × 480
depth - 640 × 480
depth-based data and RGB data togetherfinger gesturedistance from the device and shape bases matching91%6
gesture
finger mouse interface500––800
mm
[ ]Kinect V1depth - 640 × 480depth threshold and K-curvaturefinger countingdepth threshold and
K-curvature
73.7%5
gestures
picture selection applicationdetection fingertips should though the hand was moving or rotating
[ ]Kinect V1RGB - 640 × 480
depth - 320 × 240
integrate the RGB and depth informationhand
gesture
forward recursion
& SURF
90%hand
gesture
virtual environment
[ ]Kinect V2depth - 512 × 424skeletal data stream & depth & color data streamshand
gesture
support vector machine (SVM) & artificial neural networks (ANN)93.4% for SVM 98.2% for ANN24 alphabets hand gestureAmerican Sign Language500––800
mm

The finger earth mover’s distance (FEMD) approach was evaluated in terms of speed and precision, and then compared with the shape-matching algorithm using the depth map and color image acquired by the Kinect camera [ 65 ]. Improved depth threshold segmentation was offered in [ 68 ], by combining depth and color information using the hierarchical scan method, then hand segmentation by the local neighbor method; this approach gives a result over a range of up to two meters. A new method was proposed in [ 69 ], based on a near depth range of less than 0.5 m where skeletal data were not provided by Kinect. This method was implemented using two image frames, depth and infrared. A depth threshold was used in order to segment the hand, then a K-mean algorithm was applied to obtain both user’s hand pixels [ 70 ]. Next, Graham’s scan algorithm was used to detect the convex hulls of the hand in order to merge with the result of the contour tracing algorithm to detect the fingertip. The depth image frame was analyzed to extract 3D hand gestures in real time, which were executed using frame differences to detect moving objects [ 71 ]. The foremost region was utilized and classified using an automatic state machine algorithm. The skin–motion detection technique was used to detect the hand, then Hu moments were applied to feature extraction, after which HMM was used for gesture recognition [ 72 ]. Depth range was utilized for hand segmentation, then Otsu’s method was used for applying threshold value to the color frame after it was converted into a gray frame [ 14 ]. A kNN classifier was then used to classify gestures. In [ 73 ], where the hand was segmented based on depth information using a distance method, background subtraction and iterative techniques were applied to remove the depth image shadow and decrease noise. In [ 74 ], the segmentation used 3D depth data selected using a threshold range. In [ 75 ], the proposed algorithm used an RGB color frame, which converted to a binary frame using Otsu’s global threshold. After that, a depth range was selected for hand segmentation and then the two methods were aligned. Finally, the kNN algorithm was used with Euclidian distance for finger classification. Depth data and an RGB frame were used together for robust hand segmentation and the segmented hand matched with the dataset classifier to identify the fingertip [ 76 ]. This framework was based on distance from the device and shape based matching. The fingertips selected using depth threshold and the K-curvature algorithm based on depth data were presented in [ 77 ]. A novel segmentation method was implemented in [ 78 ] by integrating RGB and depth data, and classification was offered using speeded up robust features (SURF). Depth information with skeletal and color data were used in [ 79 ], to detect the hand, then the segmented hand was matched with the dataset using SVM and artificial neural networks (ANN) for recognition. The authors concluded that ANN was more accurate than SVM. Figure 10 shows an example of segmentation using Kinect depth sensor.

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00073-g010.jpg

Depth-based recognition: ( a ) hand joint distance from camera; ( b ) different feature extraction using Kinect depth sensor.

2.2.6. 3D Model-Based Recognition

The 3D model essentially depends on 3D Kinematic hand model which has a large degree of freedom, where hand parameter estimation obtained by comparing the input image with the two-dimensional appearance projected by three-dimensional hand model. In addition, the 3D model introduces human hand feature as pose estimation by forming volumetric or skeletal or 3D model that identical to the user’s hand. Where the 3D model parameter updated through the matching process. Where the depth parameter is added to the model to increase accuracy. Table 6 presents a set of research papers based on 3D model.

Set of research papers that have used 3D model-based recognition for HCI, VR and human behavior application.

AuthorType of CameraTechniques/
Methods for Segmentation
Feature Extract TypeClassify
Algorithm
Type of ErrorHardware RunApplication AreaDataset TypeRuntime Speed
[ ]RGB cameranetwork directly predicts the control points in 3D3D hand poses, 6D object poses
,object classes and action categories
PnP algorithm & Single-shot neural networkFingertips
48.4 mm
Object coordinates
23.7 mm
real-time speed of
25 fps on an NVIDIA Tesla M40
framework for understanding human behavior through 3Dhand and object interactionsFirst-person hand action (FPHA) dataset25 fps
[ ]Prime sense depth camerasdepth maps3D hand pose estimation &
sphere model renderings
Pose estimation neural networkmean joint error
(stack = 1) 12.6 mm
(stack = 2) 12.3 mm
design hand pose estimation using self-supervision methodNYU Hand Pose
Dataset
[ ]RGB-D cameraSingle RGB image direct feed to the network3D hand shape and posetrain networks with full supervisionMesh error 7.95 mm
Pose error 8.03 mm
Nvidia GTX 1080 GPUdesign model for estimate 3D hand shape from a monocular
RGB image
Stereo hand pose tracking benchmark (STB) & Rendered
Hand Pose Dataset (RHD)
50 fps
[ ]Kinect V2 camerasegmentation mask Kinect body trackerhandmachine learningMarker error 5% subset of the frames in each sequence & pixel classification errorCPU onlyinteractions with virtual and augmented worldsFinger paint
dataset &
NYU dataset used for comparison
high frame-rate
[ ]raw depth imageCNN-based hand segmentation3D hand pose regression pipelineCNN-based algorithm3D Joint Location Error 12.9 mmNvidia Geforce GTX 1080 Ti GPUapplications of virtual reality (VR)dataset contains 8000 original depth images created by authors
[ ]Kinect V2 camerabounding box around the hand & hand maskhandappearance
and the kinematics of the hand
percentage
of template vertices over all frames
Interaction with deformable object & trackingsynthetic dataset
generated with the Blender modeling software
[ ]RGBD data from
3 Kinect devices
regression-based method & hierarchical feature extraction3D hand pose estimation3D hand pose estimation via semi-supervised learning.Mean error 7.7 mmNVIDIA TITAN Xp GPUhuman–computer interaction (HCI), computer graphics
and virtual/augmented reality
For evaluation ICVL Dataset
& MSRA Dataset
& NYU Dataset
58 fps
[ ]single depth images.depth image3D hand pose3D point cloud of hand as network input and outputs heat-mapsmean error distancesNvidia TITAN Xp
GPU
(HCI), computer graphics
and virtual/augmented reality
For evaluation NYU dataset
& ICVL dataset
& MSRA datasets
41.8 fps
[ ]depth imagespredicting heat maps of hand joints in detection-based methodshand pose estimationdense feature maps through intermediate supervision
in a regression-based framework
mean error 6.68 mm
maximal per-joint error 8.73 mm
GeForce GTX 1080 Ti(HCI), virtual and mixed realityFor evaluation
‘HANDS 2017′ challenge dataset & first-person hand action
[ ]RGB-D cameras3D hand pose estimationweakly supervised methodmean error 0.6 mmGeForce GTX 1080 GPU with CUDA 8.0.(HCI), virtual and mixed realityRendered hand pose (RHD) dataset

A study by Tekin et al. [ 80 ] proposed a new model to understand interactions between 3D hands and object using single RGB image, where single image is trained end-to-end using neural network, and show jointly estimation of the hand and object poses in 3D. Wan et al. [ 81 ] proposed 3D hand pose estimation from single depth map using self-supervision neural network by approximating the hand surface with a set of spheres. A novel of estimating full 3D hand shape and pose presented by Ge et al. [ 82 ] based on single RGB image. Where Graph Convolutional Neural Network (Graph CNN) utilized to reconstruct full 3D mesh for hand surface. Another study by Taylor et al. [ 83 ] proposed a new system tracking human hand by combine surface model with new energy function which continuously optimized jointly over pose and correspondences, which can track the hand for several meter from the camera. Malik et al. [ 84 ] proposed a novel CNN-based algorithm which automatically learns in order to segment hand from a raw depth image and estimate 3D hand pose estimation including the structural constraints of hand skeleton. Tsoli et al. [ 85 ] presented a novel method to track a complex deformable object in interaction with a hand. Chen et al. [ 86 ] proposed self-organizing hand network SO—Hand Net—which achieved 3D hand pose estimation via semi-supervised learning. Where end-to-end regression method utilized for single depth image to estimation 3D hand pose. Another study by Ge et al. [ 87 ] proposed a point-to-point regression method for 3D hand pose estimation in single depth images. Wu et al. [ 88 ] proposed novel hand pose estimation from a single depth image by combine detection based method and regression-based method to improve accuracy. Cai et al. [ 89 ] present one-way to adapt a weakly labeled real-world dataset from a fully annotated synthetic dataset with the aid of low-cost depth images and take only RGB inputs for 3D joint predictions. Figure 11 shows an example of a 3D hand model interaction with virtual system.

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00073-g011.jpg

3D hand model interaction with virtual system [ 83 ].

There are some reported limitations, such as 3D hand required a large dataset of images to formulate the characteristic shapes of the hand in case multi-view. Moreover, the matching process considers time consumption, also computation costly and less ability to treat unclear views.

2.2.7. Deep-Learning Based Recognition

The artificial intelligence offers a good and reliable technique used in a wide range of modern applications because of using a learning role principle. The deep learning used multilayers for learning data and gives a good prediction out result. The most challenges facing this technique is required dataset to learn algorithm which may affect time processing. Table 7 presents a set of research papers that use different techniques based on deep-learning recognition to detect ROI.

Set of research papers that have used deep-learning-based recognition for hand gesture application.

AuthorType of CameraResolutionTechniques/
Methods for Segmentation
Feature Extract TypeClassify AlgorithmRecognition RateNo. of GesturesApplication AreaDataset TypeHardware Run
[ ]Different mobile camerasHD and 4kfeatures extraction by CNNhand gesturesAdapted Deep Convolutional Neural
Network (ADCNN)
training set 100%

test set 99%
7 hand gestures(HCI) communicate for people was injured StrokeCreated by video frame recordedCore™ i7-6700 CPU @ 3.40 GHz
[ ]webcamskin color detection
and morphology & background subtraction
hand gesturesdeep
convolutional neural network (CNN)
training set 99.9%
test set 95.61%
6 hand gesturesHome appliance control
(smart homes)
4800 image collect for train and 300 for test
[ ]RGB image640 × 480
pixels
No segment stage
Image direct fed to CNN after resizing
hand gesturesdeep convolutional neural networksimple backgrounds
97.1%
complex background 85.3%
7 hand gesturesCommand consumer
electronics device such as mobiles phones and TVs
Mantecón et al.* dataset for direct testingGPU with 1664
cores, base clock of 1050 MHz
[ ]Kinectskin color modeling combined with convolution neural network image featurehand gesturesconvolution neural network & support vector machine98.52%8 hand gesturesimage information
collected by Kinect
CPUE
5-1620v4,
3.50 GHz
[ ]KinectImage size 200 × 200skin color -Y–Cb–Cr color space & Gaussian Mixture modelhand gesturesconvolution neural
network
Average 95.96%7 hand gestureshuman hand gesture recognition systemimage information
collected by Kinect
[ ]video sequences
recorded
Semantic segmentation based deconvolution
neural network
hand gesture motionconvolution network (LRCN) deep95%9 hand gesturesintelligent vehicle applicationsCambridge
gesture recognition dataset
Nvidia Geforce GTX 980 graphics
[ ]imageOriginal images in the database
248 × 256 or
128 × 128 pixels
Canny operator edge detectionhand gesturedouble channel convolutional neural network (DC-CNN)
&
softmax classifier
98.02%10 hand gesturesman–machine interactionJochen Triesch Database (JTD) & NAO Camera hand posture Database (NCD)Core i5 processor
[ ]KinectSkeleton-based hand gesture recognition.neural network based on SPD85.39%14 hand gesturesDynamic Hand
Gesture (DHG) dataset & First-Person Hand
Action (FPHA) dataset
non-optimized CPU 3.4 GHz

Authors proposed seven popular hand gestures which captured by mobile camera and generate 24,698 image frames. The feature extraction and adapted deep convolutional neural network (ADCNN) utilized for hand classification. The experiment evaluates result for the training data 100% and testing data 99%, with execution time 15,598 s [ 90 ]. While other proposed systems used webcam in order to track hand. Then used skin color (Y–Cb–Cr color space) technique and morphology to remove the background. In addition, kernel correlation filters (KCF) used to track ROI. The resulted image enters into a deep convolutional neural network (CNN). Where the CNN model used to compare performance of two modified from Alex Net and VGG Net. The recognition rate for training data and testing data, respectively 99.90% and 95.61% in [ 91 ]. A new method based on deep convolutional neural network, where the resized image directly feds into the network ignoring segmentation and detection stages in orders to classify hand gestures directly. The system works in real time and gives a result with simple background 97.1% and with complex background 85.3% in [ 92 ]. The depth image produced by Kinect sensor used to segment color image then skin color modeling combined with convolution neural network, where error back propagation algorithm applied to modify the threshold and weights for the neural network. The SVM classification algorithm added to the network to enhance result in [ 93 ]. Other research study used Gaussian Mixture model (GMM) to filter out non-skin colors of an image which used to train the CNN in order to recognize seven hand gestures, where the average recognition rate 95.96 % in [ 94 ]. The next proposed system used long-term recurrent convolutional network-based action classifier, where multiple frames sampled from the video sequence recorded is fed to the network. In order to extract the representative frames, the semantic segmentation-based de-convolutional neural network is used. The tiled image patterns and tiled binary patterns are utilized to train the de-convolutional network in [ 95 ]. A double-channel convolutional neural network (DC-CNN) is proposed by [ 96 ] where the original image preprocessed to detect the edge of the hand before fed to the network. The each of two-channel CNN has a separate weight and softmax classifier used to classify output results. The proposed system gives recognition rate of 98.02%. Finally, a new neural network based on SPD manifold learning for skeleton-based hand gesture recognition proposed by [ 97 ]. Figure 12 below shown example on deep learn convolution neural network.

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00073-g012.jpg

Simple example on deep learning convolutional neural network architecture.

3. Application Areas of Hand Gesture Recognition Systems

Research into hand gestures has become an exciting and relevant field; it offers a means of natural interaction and reduces the cost of using sensors in terms of data gloves. Conventional interactive methods depend on different devices such as a mouse, keyboard, touch screen, joystick for gaming and consoles for machine controls. The following sections describe some popular applications of hand gestures. Figure 13 shows the most common application area deal with hand gesture recognition techniques.

An external file that holds a picture, illustration, etc.
Object name is jimaging-06-00073-g013.jpg

Most common application area of hand gesture interaction system (the image of Figure 13 is adapted from [ 12 , 14 , 42 , 76 , 83 , 98 , 99 ]).

3.1. Clinical and Health

During clinical operations, a surgeon may need details about the patient’s entire body structure or a detailed organ model in order to shorten the operating time or increase the accuracy of the result. This is achieved by using a medical imaging system such as MRI, CT or X-ray system [ 10 , 99 ], which collects data from the patient’s body and displays them on the screen as a detailed image. The surgeon can facilitate interaction with the viewed images by performing hand gestures in front of the camera using a computer vision technique. These gestures can enable some operations such as zooming, rotating, image cropping and going to the next or previous slide without using any peripheral device such as a mouse, keyboard or touch screen. Any additional equipment requires sterilization, which can be difficult in the case of keyboards and touch screen. In addition, hand gestures can be used for assistive purpose such as wheelchair control [ 43 ].

3.2. Sign Language Recognition

Sign language is an alternative method used by people who are unable to communicate with others by speech. It consists of a set of gestures wherein every gesture represents one letter, number or expression. Many research papers have proposed recognition of sign language for deaf-mute people, using a glove-attached sensor worn on the hand that gives responses according to hand movement. Alternatively, it may involve uncovered hand interaction with the camera, using computer vision techniques to identify the gesture. For both approaches mentioned above, the dataset used for classification of gestures matches a real-time gesture made by the user [ 11 , 42 , 50 ].

3.3. Robot Control

Robot technology is used in many application fields such as industry, assistive services [ 100 ], stores, sports and entertainment. Robotic control systems use machine learning techniques, artificial intelligence and complex algorithms to execute a specific task, which lets the robotic system, interact naturally with the environment and make an independent decision. Some research proposes computer vision technology with a robot to build assistive systems for elderly people. Other research uses computer vision to enable a robot to ask a human for a proper path inside a specific building [ 12 ].

3.4. Virtual Environment

Virtual environments are based on a 3D model that needs a 3D gesture recognition system in order to interact in real time as a HCI. These gestures may be used for modification and viewing or for recreational purposes, such as playing a virtual piano. The gesture recognition system utilizes a dataset to match it with an acquired gesture in real time [ 13 , 78 , 83 ].

3.5. Home Automation

Hand gestures can be used efficiently for home automation. Shaking a hand or performing some gesture can easily enable control of lighting, fans, television, radio, etc. They can be used to improve older people’s quality of life [ 14 ].

3.6. Personal Computer and Tablet

Hand gestures can be used as an alternative input device that enables interaction with a computer without a mouse or keyboard, such as dragging, dropping and moving files through the desktop environment, as well as cut and paste operations [ 19 , 69 , 76 ]. Moreover, they can be used to control slide show presentations [ 15 ]. In addition, they are used with a tablet to permit deaf-mute people to interact with other people by moving their hand in front of tablet’s camera. This requires the installation of an application that translates sign language to text, which is displayed on the screen. This is analogous to the conversion of acquired voice to text.

3.7. Gestures for Gaming

The best example of gesture interaction for gaming purposes is the Microsoft Kinect Xbox, which has a camera placed over the screen and connects with the Xbox device through the cable port. The user can interact with the game by using hand motions and body movements that are tracked by the Kinect camera sensor [ 16 , 98 ].

4. Research Gaps and Challenges

From the previous sections, it is easy to identify the research gap, since most research studies focus on computer applications, sign language and interaction with a 3D object through a virtual environment. However, many research papers deal with enhancing frameworks for hand gesture recognition or developing new algorithms rather than executing a practical application with regard to health care. The biggest challenge encountered by the researcher is in designing a robust framework that overcomes the most common issues with fewer limitations and gives an accurate and reliable result. Most proposed hand gesture systems can be divided into two categories of computer vision techniques. First, a simple approach is to use image processing techniques via Open-NI library or OpenCV library and possibly other tools to provide interaction in real time, which considers time consumption because of real-time processing. This has some limitations, such as background issues, illumination variation, distance limit and multi-object or multi-gesture problems. A second approach uses dataset gestures to match against the input gesture, where considerably more complex patterns require complex algorithm. Deep learning technique and artificial intelligence techniques to match the interaction gesture in real time with dataset gestures already containing specific postures or gestures. Although this approach can identify a large number of gestures, it has some drawbacks in some cases, such as missing some gestures because of the classification algorithms accuracy contrast. In addition, it takes time more than first approach because of the matching dataset in case of using a large number of the dataset. In addition, the dataset of gestures cannot be used by other frameworks.

5. Conclusions

Hand gesture recognition addresses a fault in interaction systems. Controlling things by hand is more natural, easier, more flexible and cheaper, and there is no need to fix problems caused by hardware devices, since none is required. From previous sections, it was clear to need to put much effort into developing reliable and robust algorithms with the help of using a camera sensor has a certain characteristic to encounter common issues and achieve a reliable result. Each technique mentioned above, however, has its advantages and disadvantages and may perform well in some challenges while being inferior in others.

Acknowledgments

The authors would like to thank the staff in Electrical Engineering Technical College, Middle Technical University, Baghdad, Iraq and the participants for their support to conduct the experiments.

Author Contributions

Conceptualization, A.A.-N. & M.O.; funding acquisition, A.A.-N. & J.C.; investigation, M.O.; methodology, M.O. & A.A.-N.; project administration, A.A.-N. and J.C.; supervision, A.A.-N. & J.C.; writing– original draft, M.O.; writing– review & editing, M.O., A.A.-N. & J.C. All authors have read and agreed to the published version of the manuscript.

This research received no external funding.

Conflicts of Interest

The authors of this manuscript have no conflicts of interest relevant to this work.

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Data glove-based gesture recognition using CNN-BiLSTM model with attention mechanism

Contributed equally to this work with: Jiawei Wu, Peng Ren

Roles Conceptualization, Data curation, Methodology, Writing – original draft

Affiliation School of Medical Information and Engineering, Xuzhou Medical University, Xuzhou, China

ORCID logo

Roles Conceptualization, Methodology, Writing – review & editing

* E-mail: [email protected]

Affiliations School of Medical Information and Engineering, Xuzhou Medical University, Xuzhou, China, Engineering Research Center of Medical and Health Sensing Technology, Xuzhou Medical University, Xuzhou, China

Roles Formal analysis, Methodology

Roles Data curation, Resources

Roles Conceptualization, Writing – review & editing

  • Jiawei Wu, 
  • Peng Ren, 
  • Boming Song, 
  • Ran Zhang, 
  • Chen Zhao, 

PLOS

  • Published: November 17, 2023
  • https://doi.org/10.1371/journal.pone.0294174
  • Reader Comments

Fig 1

As a novel form of human machine interaction (HMI), hand gesture recognition (HGR) has garnered extensive attention and research. The majority of HGR studies are based on visual systems, inevitably encountering challenges such as depth and occlusion. On the contrary, data gloves can facilitate data collection with minimal interference in complex environments, thus becoming a research focus in fields such as medical simulation and virtual reality. To explore the application of data gloves in dynamic gesture recognition, this paper proposes a data glove-based dynamic gesture recognition model called the Attention-based CNN-BiLSTM Network (A-CBLN). In A-CBLN, the convolutional neural network (CNN) is employed to capture local features, while the bidirectional long short-term memory (BiLSTM) is used to extract contextual temporal features of gesture data. By utilizing attention mechanisms to allocate weights to gesture features, the model enhances its understanding of different gesture meanings, thereby improving recognition accuracy. We selected seven dynamic gestures as research targets and recruited 32 subjects for participation. Experimental results demonstrate that A-CBLN effectively addresses the challenge of dynamic gesture recognition, outperforming existing models and achieving optimal gesture recognition performance, with the accuracy of 95.05% and precision of 95.43% on the test dataset.

Citation: Wu J, Ren P, Song B, Zhang R, Zhao C, Zhang X (2023) Data glove-based gesture recognition using CNN-BiLSTM model with attention mechanism. PLoS ONE 18(11): e0294174. https://doi.org/10.1371/journal.pone.0294174

Editor: Muhammad Bilal, University of Southampton - Malaysia Campus, MALAYSIA

Received: July 20, 2023; Accepted: October 26, 2023; Published: November 17, 2023

Copyright: © 2023 Wu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The experimental data used in this article are subject to access restrictions, and the corresponding author does not have permission to make them public. If you have any questions, or if you would like to request access to the data set, please contact Heng Wan, Director of the Information Security Department at Xuzhou Medical University, at the following email: [email protected] .

Funding: This research was funded by The Unveiling & Leading Project of XZHMU, grant number No. JBGS202204. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

With the rapid development of computer technology and artificial intelligence, Human Machine Interaction (HMI) has emerged as one of the most prominent research fields in contemporary times. The driving force behind HMI is our expectation that machines will become intelligent and perceptive like humans [ 1 ]. HMI refers to the process of exchanging information between humans and machines through effective dialogue. HMI systems can collect human-intended information and transform it into a format understandable by machines, enabling machines to operate based on human intent [ 2 ]. Traditional HMI primarily relies on tools such as joysticks, keyboards, and mice to control terminals, which usually require fixed operational spaces. This severely restricts the range of human expressive actions and diminishes work efficiency. Consequently, to enhance the naturalness of HMI, the next generation of HMI technology needs to be human-centric, diversified, and intelligent [ 3 ]. In real-life situations, besides verbal communication, gestures serve as one of the most significant means for humans to convey information, enabling direct and effective expression of user needs. Research conducted by Liu et al. pointed out that hand gestures constitute a significant part of human communication, with advantages including high flexibility and rich meaning, making them an important modality in HMI [ 4 ]. Consequently, Hand Gesture Recognition (HGR) has emerged as a new type of HMI technology and has become a research hotspot with enormous potential in various domains. For instance, in the healthcare domain, capturing and analyzing physiological characteristics related to finger movements can significantly assist in studying and developing appropriate rehabilitation postures [ 5 ]. In the field of mechanical automation, interaction between fingers and machines can be achieved by detecting finger motion trajectories [ 6 ]. In the field of virtual reality, defining different gesture commands allows users to control the movements of virtual characters from a first-person perspective [ 7 ].

Research on HGR can be classified into two categories based on the methods of acquiring gesture data: vision-based HGR and wearable device-based HGR. Vision-based HGR relies on cameras as the primary tools for capturing gesture data. They offer advantages such as low cost and no direct contact with the human hands. However, despite the success of high-quality cameras, vision-based systems still have some inherent limitations, including a restricted field of view and high computational costs [ 8 , 9 ]. In certain scenarios, robust results may require the combined data acquisition from multiple cameras due to issues like depth and occlusion [ 10 , 11 ]. Consequently, the presence of these aforementioned challenges often hinders vision-based HGR methods from achieving optimal performance. In recent years, wearable device-based HGR has witnessed rapid development due to advancements in sensing technology and widespread sensor applications. Compared to vision-based approaches, wearable device-based HGR eliminates the need to consider camera distribution and is less susceptible to external environmental factors such as lighting, occlusion, and background interference. Data gloves represent a typical example of wearable devices used in HGR. These gloves are equipped with position tracking sensors that enable real-time capture of spatial motion trajectory information of users’ hand postures. Based on predefined algorithms, gesture actions can be recognized, mapped to corresponding response modules, and thus complete the HMI process. HGR systems based on data gloves have become a research hotspot in the relevant field. These systems offer several advantages, including stable acquisition of gesture data, reduced interference from complex environments and satisfactory modeling and recognition results, especially when dealing with large-scale gesture data [ 12 ].

In the field of HGR, researchers primarily focus on two types of gestures: static gestures and dynamic gestures. Static HGR systems analyze hand posture data at a specific moment to determine its corresponding meaning. However, static gesture data only provide spatial information of hand postures at each moment, while temporal information of hand movements is disregarded. As a result, the actual semantic information conveyed is limited, making it challenging to extend to complex real-world applications. Dynamic HGR systems, on the other hand, deal with information regarding the changes in hand movement postures over a period of time. These systems require a comprehensive consideration of both spatial and temporal aspects of hand postures. Clearly, compared to static gestures, dynamic gestures can convey richer semantic information and better align with people’s actual needs in real-life scenarios. Although numerous research efforts have been dedicated to dynamic HGR algorithms, most are based on vision systems, and the challenge of dynamic HGR using data gloves remains.

The dynamic gesture investigated in this study is the seven-step handwashing, which is a crucial step in the healthcare field. Proper handwashing procedures can effectively reduce the probability of disease transmission. Our work applies the seven-step handwashing to medical simulation training, where users wear data gloves to perform the handwashing process. Additionally, we design an automated dynamic gesture recognition algorithm to assess whether users correctly execute the specified hand gesture steps. Specifically, we developed a data glove-based dynamic HGR algorithm in this paper by incorporating deep learning techniques. This algorithm considers both spatial and temporal information of gesture data. Firstly, the Convolutional Neural Network (CNN) is utilized to extract local features of gesture data at each moment. Subsequently, these features are incorporated into the Bidirectional Long Short-Term Memory (BiLSTM) structure to model the temporal relationships. Finally, an attention mechanism is employed to enhance the gesture features and output the recognition results of dynamic gestures. In summary, this paper makes three main contributions:

  • Within the context of medical simulation, a data glove-based seven-step handwashing dynamic hand gesture data collection process was defined, and dynamic hand gesture data from 32 subjects were collected following this procedure.
  • A novel data glove-based dynamic HGR algorithm, called Attention-based CNN-BiLSTM Network (A-CBLN), was designed by combining deep learning techniques with the characteristics of dynamic gesture data. A-CBLN integrates the advantages of CNN and BiLSTM, effectively capturing the spatiotemporal features of gesture data, and further enhancing the features using an attention mechanism, resulting in precise recognition of dynamic gestures.
  • Extensive experiments were conducted to verify the effectiveness of the A-CBLN algorithm for dynamic gesture recognition, and key parameter settings within A-CBLN were thoroughly discussed. The results obtained from the test dataset demonstrated that our proposed method outperformed other comparative algorithms in terms of accuracy, precision, recall and F1-score.

The remaining sections of this paper are organized as follows. In Section 2, we review recent works related to HGR, with a particular focus on data glove-based HGR methods. Section 3 provides a detailed description of the proposed algorithm for dynamic gesture recognition. Section 4 encompasses the data collection methodology for gestures and provides implementation details of the conducted experiments. The relevant experimental results and analysis are presented in Section 5, followed by a concise summary of this paper in Section 6.

2. Related works

In recent years, research in the HGR field has focused on two main aspects: the type of gesture data (static or dynamic) and the sensors used for data collection (visual systems or wearable devices). This section provides an overview of relevant studies in HGR, emphasizing research involving wearable devices like data gloves.

Static hand gesture recognition research primarily focuses on analyzing the spatial features of gesture data without considering its temporal variations. This type of research is primarily applied in sign language recognition scenarios. A static hand gesture recognition system based on wavelet transform and neural networks was proposed by Karami et al. [ 13 ]. The system operated by taking hand gesture images acquired by a camera as input and extracting image features using Discrete Wavelet Transform (DWT). These features were fed into a neural network for classification. In the experimental section, 32 Persian sign language (PSL) letter symbols were selected for investigation. The training was conducted on 416 images, while testing was performed on 224 images, resulting in a test accuracy of 83.03%. Thalange et al. [ 14 ] introduced two novel feature extraction techniques, Combined Orientation Histogram and Statistical (COHST) Features and Wavelet Features, to address the recognition of static symbols representing numbers 0 to 9 in American Sign Language. Hand gesture data was collected using a 5-megapixel network camera and processed with different feature extraction methods before input into a neural network for training. The proposed approach achieved an outstanding average recognition rate of 98.17%. Moreover, a novel data glove with 14 sensor units was proposed by Wu et al. [ 15 ], who explored its performance in static hand gesture recognition. They defined 10 static hand gestures representing digits 0–9 and collected data from 10 subjects, with 50% of the data used for training and the remaining 50% for testing. By employing a neural network for classification experiments, they achieved an impressive overall recognition accuracy of 98.8%. Lee et al. [ 16 ] introduced a knitted glove capable of pattern recognition for hand poses and designed a novel CNN model for hand gesture classification experiments. The experimental results demonstrated that the proposed CNN structure effectively recognized 10 static hand gestures, with classification accuracies ranging from 79% to 97% for different gestures and an average accuracy of 89.5%. However, they only recruited 10 subjects for the experiments. Antillon et al. [ 17 ] developed an intelligent diving glove capable of recognizing 13 static hand gestures for underwater communication. They employed five classical machine learning classification algorithms and conducted training on hand gesture data from 24 subjects, with testing performed on an independent group of 10 subjects. The experimental results indicated that all classification algorithms achieved satisfactory hand gesture recognition performance in dry environments, with accuracies ranging from 95% to 98%. The performance slightly declined in underwater experimental conditions, with accuracies ranging from 81% to 94%. Yuan et al. [ 18 ] developed a wearable gesture recognition system that can simultaneously recognize ten types of numeric gestures and nine types of complex gestures. They utilized the Multilayer Perceptron (MLP) algorithm to recognize 19 static gestures with 100% accuracy, showcasing the strong capabilities of deep learning technology in the field of HGR. However, it is worth noting that the sample data in their experimental section was derived solely from four male volunteers. Moreover, a data glove based on flexible sensors was utilized by Ge et al. [ 19 ] to accurately predict the final hand gesture before the completion of the user’s hand movement in real time. They constructed a gesture dataset called Flex-Gesture, which consisted of 16 common gestures, each comprising 3000 six-dimensional flexion data points. Additionally, they proposed a multimodal data feature fusion approach and employed a combination of neural networks and support vector machines (SVM) as classifiers. The system achieved a remarkable prediction accuracy of 98.29% with a prediction time of only 0.2329 ms. However, it should be noted that the data glove-based system had certain limitations as it did not consider temporal information in the hand gestures. It is worth mentioning that the authors believe that incorporating deep learning algorithms with temporal features analysis could potentially yield more effective results.

Unlike static gesture recognition, dynamic gesture recognition requires considering the spatial information of hand movements and their temporal variations. With the rapid advancement of deep learning techniques, researchers have extensively investigated structures such as Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) and applied them to real-time dynamic gesture recognition problems. Nguyen et al. [ 20 ] presented a novel approach for continuous dynamic gesture recognition using RGB video input. Their method comprises two main components: a gesture localization module and a gesture classification module. The former aims to separate gestures using a BiLSTM network to segment continuous gesture sequences. The latter aims to classify gestures and efficiently combine data from multiple channels, including RGB, optical flow, and 3D key pose positions, using two 3D CNNs and a Long Short-Term Memory (LSTM). The method was evaluated on three publicly available datasets, achieving an average Jaccard index of 0.5535. Furthermore, Paweł et al. [ 21 ] developed a system capable of rapidly and effectively recognizing hand gestures in hand-body language using a dedicated glove with ten sensors. Their experiments defined 22 hand-body language gestures and recorded 2200 gesture data samples (10 participants, each gesture action repeated 10 times). Three machine learning classifiers were employed for training and testing, resulting in a high sensitivity rate of 98.32%. The pioneering work of Emmanuel et al. [ 22 ] introduced the use of CNN for grasp classification using piezoelectric data gloves. Experimental data were collected from five participants, each performing 30 object grasps following Schlesinger’s classification method. The results demonstrated that the CNN architecture achieved the highest classification accuracy (88.27%). It is worth mentioning that the authors plan to leverage the strengths of both CNN and RNN in future work to improve gesture prediction accuracy. Lee et al. [ 23 ] developed a real-time dynamic gesture recognition data glove. They employed neural network structures such as LSTM, fully connected layers, and novel gesture localization and recognition algorithms. This allowed the successful classification of 11 dynamic finger gestures with a gesture recognition time of less than 12 ms. Yuan et al. [ 24 ] designed a data glove equipped with 3D flexible sensors and two wristbands and proposed a novel deep feature fusion network to capture fine-grained gesture information. They first fused multi-sensor data using a CNN structure with residual connections and then modeled long-range dependencies of complex gestures using LSTM. Experimental results demonstrated the effectiveness of this approach in classifying complex hand movements, achieving a maximum precision of 99.3% on the American Sign Language dataset. Wang et al. [ 25 ] combined attention mechanism with BiLSTM and designed a deep learning algorithm capable of effectively recognizing 10 types of dynamic gestures. Their proposed method achieved an accuracy of 98.3% on the test dataset, showing a 14.5% improvement compared to a standalone LSTM model. This indicates that incorporating attention mechanism can effectively enhance the model’s understanding of gesture semantics. Dong et al. [ 12 ] introduced a novel dynamic gesture recognition algorithm called DGDL-GR. Built upon deep learning, this algorithm combined CNN and temporal convolutional networks (TCN) to simultaneously extract temporal and spatial features of hand movements. They defined 10 gestures according to relevant standards and recruited 20 participants for testing. The experimental results demonstrated that DGDL-GR achieved the highest recognition accuracy (0.9869), surpassing state-of-the-art algorithms such as CNN and LSTM. Hu et al. [ 26 ] explored deep learning-based gesture recognition using surface electromyography (sEMG) signals and proposed a hybrid CNN and RNN structure with attention mechanism. In this framework, CNN was employed for feature extraction from sEMG signals, while RNN was utilized for modeling the temporal sequence of the signals. Experimental results on multiple publicly available datasets revealed that the performance of the hybrid CNN-RNN structure was superior to individual CNN and RNN modules.

Despite the existence of a large body of research on HGR, research on dynamic gesture recognition using data gloves is still limited, especially in exploring the feasibility of applying deep learning in this field. Therefore, this study focused on the intelligent recognition of handwashing steps in the context of medical simulation. We utilized data gloves as the medium for dynamic gesture data collection and selected the seven-step handwashing series of dynamic gestures as the research target. Specifically, we considered the characteristics of dynamic gestures, including local feature variations in spatial positions and temporal changes in sequences. We systematically combined structures such as CNN, BiLSTM, and attention mechanism and designed a deep learning algorithm for dynamic gesture recognition based on data gloves. The next section will provide a detailed introduction to the proposed algorithm framework.

3. Methodology

3.1. convolutional neural network (cnn).

A classic CNN architecture was designed by LeCun et al. in 1998 [ 27 ], which achieved remarkable performance in handwritten digit recognition tasks. Compared to traditional neural network structures, CNN exhibits characteristics of local connectivity and weight sharing [ 28 ]. Consequently, CNN can improve the learning efficiency of neural networks and effectively avoid overfitting issues caused by excessive parameters. The classic CNN architecture consists of three components: the convolutional layer, the pooling layer, and the fully connected layer.

The convolutional layer’s core component is the convolutional kernel (or weight matrix). Each convolutional kernel multiplies and sums the corresponding receptive field elements in the input data. This operation is repeated by sliding the kernel with a certain stride on the input data until the entire data has been processed for feature extraction. Finally, these feature maps are typically generated as the output of the convolutional layer through a non-linear activation function. It is worth mentioning that multiple convolutional kernels are usually chosen to extract more diverse features since each kernel extracts different feature information. ReLU [ 29 ] is the most popular activation function in CNN, it has the capability to retain the segments of input features that are greater than 0 and rectify the remaining segments to 0.

The pooling layer, also known as the down-sampling layer, extracts the minor features of the input data using pooling kernels. Similar to the convolutional kernels, each pooling kernel slides over the input data with a certain stride, preserving either the maximum value or the average value of the elements within the corresponding receptive field. This process continues until the feature extraction of the entire data is completed. The pooling layer is typically placed after the convolutional layers to reduce the dimensionality of the feature maps, thereby reducing the computational complexity of the entire network.

In classification tasks, the input data undergoes feature extraction by passing through multiple convolutional and pooling layers, and the resulting feature maps are flattened and fed into the fully connected layer. The fully connected layer usually consists of a few hidden layers and a softmax classifier, which further extracts features from the data and outputs the probability distribution of each class.

3.2. Bidirectional long short-term memory (BiLSTM)

The RNN is a recursively connected neural network with the ability of short-term memory that has been widely applied in the analysis and prediction of time series data [ 30 ]. However, due to memory and information storage limitations, RNN faces challenges in effectively learning long-term dependencies in time sequences, and gradient vanishing is often encountered during training [ 31 ]. To overcome these challenges, Greff et al. proposed the LSTM network structure that exhibits long-range memory capabilities [ 32 ]. The LSTM structure achieves this by introducing memory cells to retain long-term historical information and employing different gate mechanisms to regulate the flow of information. In fact, gate mechanisms can be understood as a multi-level feature selection approach. Consequently, compared to RNN, LSTM offer more advantages in handling time series problems.

The classical LSTM unit is equipped with three gate functions to control the state of the memory cell, denoted as the forget gate f t , input gate i t and output gate o t . The forget gate f t determines which information should be retained from the previous cell state c t −1 to the current cell state c t . The input gate i t regulates the amount of information from the current input x t that should be stored in the current cell state c t . The output gate o t governs the amount of information from the current cell state c t that should be transmitted to the current hidden state h t . Fig 1 illustrates the internal structure of a LSTM unit.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0294174.g001

The LSTM unit has three inputs at time t: the current input x t , the previous hidden state h t −1 , and the previous cell state c t −1 . After being regulated by the gate functions, two outputs are obtained: the current hidden state h t and the current cell state c t . Specifically, the output of f t is obtained by linearly transforming the current input x t and the previous hidden state h t −1 , followed by the application of the sigmoid activation function. This process can be expressed by Formula 1 .

research papers on gesture recognition

Here, the weight matrix and bias vector of f t are represented by w f and b f , respectively. The sigmoid activation function, denoted by σ , is applied. The value of f t ranges from 0 to 1, where a value closer to 0 indicates that information will be discarded, and a value closer to 1 implies more information will be preserved. The computation process of the input gate i t is similar to that of f t , and the specific formula is as follows.

research papers on gesture recognition

LSTM addresses the issue of vanishing gradients during training by incorporating a series of gate mechanisms. However, as LSTM only propagates information in one direction, it can only learn forward features and not capture backward features. To overcome this limitation, Graves et al. introduced BiLSTM based on LSTM [ 33 ]. BiLSTM effectively combines a pair of forward and backward LSTM sequences, inheriting the advantages of LSTM while addressing the unidirectional learning problem. This integration allows BiLSTM to effectively capture contextual information in sequential data. From a temporal perspective, BiLSTM analyzes both the "past-to-future" and "future-to-past" directions of data flow, enabling better exploration of temporal features in the data and improving the utilization efficiency of the data and the predictive accuracy of the model.

research papers on gesture recognition

https://doi.org/10.1371/journal.pone.0294174.g002

research papers on gesture recognition

3.3. Attention mechanism (AM)

research papers on gesture recognition

Finally, the weighted sum of a t and h t is computed to obtain the final output enhanced by the attention mechanism.

research papers on gesture recognition

3.4. Attention-based CNN-BiLSTM network (A-CBLN)

This study aims to recognize the meaning conveyed by dynamic gesture data over time, which can be understood as a classification task for time series data. Building upon the previous discussions, it is highly conceivable that CNN can effectively extract local features from time series data, but may not capture long-range dependencies present in the data. The advantages of BiLSTM can overcome this limitation by learning from the forward and backward processes of dynamic gesture data, allowing the model to effectively capture the underlying long-term dependencies. Furthermore, the incorporation of attention mechanism can enhance the model’s semantic understanding of various gestures, thereby boosting the accuracy of gesture recognition. Therefore, in this paper, we proposed to combine CNN, BiLSTM, and the attention mechanism, presenting a novel framework for dynamic gesture recognition called Attention-based CNN-BiLSTM Network (A-CBLN). A-CBLN effectively integrates the advantages of different types of neural networks, thereby improving the predictive accuracy of dynamic gesture recognition. Fig 3 illustrates the pipeline of dynamic gesture recognition based on A-CBLN.

thumbnail

https://doi.org/10.1371/journal.pone.0294174.g003

Specifically, as shown in Fig 3 , the A-CBLN consists of five main components. The input layer transforms the data collected by the data glove into the model’s input format: T × L ×1, where T represents the number of samples of gesture data collected within a specified time range, L represents the feature dimension of the gesture data returned by the data glove, and 1 represents the number of channels. The CNN layer performs feature extraction and dimensionality reduction using two convolutional operations and one max-pooling operation. It is worth noting that we did not directly employ 1D or 2D convolutional methods for feature extraction, but instead utilized a 2D convolutional method with a kernel size of 1×3, enabling the extraction of spatial features from gesture data without being influenced by the temporal dimension. The BiLSTM layer provides additional modeling of the long-term dependencies of gesture features. Both the CNN layer and the BiLSTM layer use the ReLU activation function. The AM layer helps the network better understand the specific meaning of gesture features. The FC layer utilizes fully connected layers to flatten the features and further reduce the dimensionality. Finally, it outputs the probability prediction of the current dynamic gesture through the softmax function. Table 1 presents the specific parameter settings for each network layer in A-CBLN.

thumbnail

https://doi.org/10.1371/journal.pone.0294174.t001

research papers on gesture recognition

Algorithm 1. The pseudocode of A-CBLN .

Input : Gesture Dataset X , Gesture Labels y

Output : Trained model weights: w *

Parameter :

Batch size: 64

Best validation accuracy: 0

1. Load training dataset and validation dataset from X and y ;

2. Randomly initialize weight w ;

3. Start Training and Valid;

4. For each epoch do :

5.    For each batch ( X train , y train ) in training dataset do :

6.      F 1 is obtained by using two convolution layers on X train ;

7.      F 2 is obtained by using a max-pooling layer on F 1 ;

8.      F 3 is obtained by using two BiLSTM layers on F 2 ;

research papers on gesture recognition

10.     Update weights of the model using the categorical cross-entropy loss function with the Adam optimizer;

11. Calculate the accuracy of the model on the validation dataset denoted as V acc ;

12. If V acc > Best Validation accuracy:

13.     Save Trained model weights w *

14.     Update the value of Best Validation accuracy to the value of V acc

15. End Training and Valid

4. Experiments

4.1. data glove.

The wearable sensor gesture data extraction device used in this study is provided by the VRTRIXTM Data Glove( http://www.vrtrix.com.cn/ ). The core component of this glove is a 9-axis MEMS (Micro Electro Mechanical System) inertial sensor, which can capture real-time motion data related to finger joints and enable the reproduction of the hand postures assumed by the operator during motion execution. The transmission of data from the glove employs wireless transmission technology, where the data captured by the sensors on both hands can be wirelessly transmitted to a personal computer (PC) through the wireless transmission module on the back of the hand for real-time rendering. In addition, the VRTRIXTM Data Glove provides a low-level Python API interface, allowing users to access joint pose data of the data glove, facilitating secondary development. It has been widely used in fields such as industrial simulation, mechanical control, and scientific research data acquisition.

Once the data glove is properly worn, the left hand has a total of 11 inertial sensors for capturing finger gestures. Specifically, each finger is assigned 2 sensors, while 1 sensor is allocated to the back of the hand. The number and distribution of sensors on the right hand are identical to those on the left hand. Table 2 presents the key parameters information of the data glove used in this study.

thumbnail

https://doi.org/10.1371/journal.pone.0294174.t002

4.2. Gesture definition

This study sought to explore the applications of dynamic gesture recognition in the field of medical virtual simulation based on wearable devices (data gloves). We first comprehensively reviewed the existing literature on dynamic gesture recognition. As mentioned in Section 2, most publicly available dynamic gesture datasets are based on visual systems, with only a few studies utilizing wearable devices. Therefore, we created a new dynamic gesture dataset based on the common seven-step handwashing in medical virtual simulation systems in conjunction with the data gloves. We followed the handwashing method recommended by the World Health Organization (WHO) [ 36 ] and established a complete handwashing procedure comprising seven steps. More details on these steps are presented in Table 3 .

thumbnail

https://doi.org/10.1371/journal.pone.0294174.t003

4.3. Data acquisition and preprocessing

According to the approval of the Medical Ethics Committee of Affiliated Hospital of Xuzhou Medical University, 32 healthy subjects were recruited for this study. We organized and conducted data acquisition for this study between January 5, 2023 and March 25, 2023. Prior to gesture data collection, each subject was required to sign a consent form granting permission for their data to be used in the study and was informed of the specific steps involved in data collection. In order to ensure the precise expression of gesture actions while wearing the data gloves, participants who were initially unacquainted with the seven-step handwashing received training sessions conducted by healthcare professionals until all subjects could correctly perform the hand gestures while wearing the gloves. Additionally, a timekeeper was assigned to prompt the start and end of each gesture action and record the corresponding time information.

Once the subject correctly wore the data gloves as instructed, the gesture data collection process followed the following detailed steps:

  • Subjects kept their hands in the initial position, with both hands on the same horizontal plane, palms facing upward, and not more than 20cm apart.
  • The timekeeper issued the instruction to start the action, recorded the current time, and subjects began repeatedly performing the current gesture action within a 15-second interval. After 15 seconds, the timekeeper instructed to end the action and recorded the end time.
  • Subjects returned their hands to the initial position and prepared to collect data for the next gesture action following the same procedure as in step 2.

Fig 4 illustrates the specific flow of gesture data acquisition. The data gloves used in this study provided a Python API interface, facilitating the recording of gesture data using Python scripts. The data for each subject were stored in individual folders named after their respective names. Additionally, subjects were requested to repeat the gesture collection process five times to increase the dataset size. Once data collection from all subjects was completed, the data were exported for further processing and analysis.

thumbnail

https://doi.org/10.1371/journal.pone.0294174.g004

Specifically, the archival structure for each subject encompassed a set of five folders, and each folder consisted of seven dynamic gesture data files in text format. The data sampling frequency was set at 60Hz. We used a 3s time window to slide and segment the 15s data of each sample without overlapping, since the actions within 3s already contained the specific semantics of the current gesture. In summary, the sample size used for dynamic gesture modeling analysis in this study was 5,600, with each sample having a data dimension of (180×128×1). Here, 180 represents the number of gesture samples within 3 seconds, 128 represents the joint data returned by the data glove sensors, and 1 denotes the number of channels. Finally, to enhance the training of the gesture recognition model, a min-max scaling technique was applied to rescale the data intensity of all samples to the range [0, 1] using Formula 13 .

research papers on gesture recognition

Here, f represents the input data, f norm refers to the normalized data, f min and f max represent the minimum and maximum values of the input data, respectively.

To evaluate the performance of the proposed gesture recognition model, we divided the data into training, validation, and test dataset in a ratio of 8:1:1. Therefore, the data from 26 subjects were used for training, while the remaining 6 were evenly split between the validation and test dataset.

4.4. Implementation details

To validate the effectiveness of the proposed dynamic gesture recognition algorithm, we selected three deep learning algorithms related to gesture recognition research for comparison:

  • LSTM [ 37 ]: The model consists of 1 LSTM structure with 3 fully connected layers.
  • Attention-BiLSTM [ 25 ]: The model consists of two BiLSTM layers, an attention mechanism layer and a softmax classifier.
  • CNN-LSTM [ 38 ]: The model consists of a mixture of 2D convolutional layers, LSTM layers and fully connected layers.

All the experimental code in this study was written using Python (version 3.8). The deep learning algorithms were implemented using the TensorFlow framework (version 2.9.0). To ensure a fair comparison of the performance of each deep learning algorithm, we used the following training parameters consistently during the model training process: an initial training epoch of 50 and a batch size of 64. Since the research in this paper involved a typical multi-class classification task, we employed the cross-entropy loss to measure the error between the predicted values of the model and the true labels. We used the ADAM optimizer [ 37 ] to update the model parameters, and the initial learning rate was set to 0.001, the beta1 was set to 0.9, the beta2 was set to 0.999. After each training epoch, validation was performed, and the model with the lowest validation loss was saved for subsequent algorithm testing.

4.5. Evaluation metrics

research papers on gesture recognition

Here, TP represents the number of true positive samples, which are the samples that are correctly predicted as positive. TN represents the number of true negative samples, which are the samples that are correctly predicted as negative. FP represents the number of false positive samples, which are the samples that are actually negative but predicted as positive. FN represents the number of false negative samples, which are the samples that are actually positive but predicted as negative.

5. Results and analysis

This section presents and analyzes the effectiveness of all models for dynamic gesture recognition from multiple perspectives. It includes a comparative analysis of the learning capabilities of different models and their predictive performance on the test dataset. Additionally, we conducted relevant experiments to discuss the impact of key parameters in A-CBLN, including the kernel size in the convolutional layer and the number of neurons in the BiLSTM layer, the findings from these experiments provide valuable insights into the optimal configuration of A-CBLN for enhanced gesture recognition performance. Finally, we further analyzed and discussed the confusion matrix predicted by A-CBLN on the test dataset.

5.1. Comparative analysis between different models

We first analyzed the learning progress of the models during the training process. Fig 5 shows that as the number of training epochs increases, the validation accuracy gradually improves and stabilizes for all models. This finding indicates that all models possess certain learning capabilities, and overfitting phenomena does not occur during training. Further analysis revealed that the single LSTM structure exhibits the lowest learning capability, reaching its highest validation accuracy of 88.95% at 50 epochs. This may be due to the fact that the pure LSTM structure fails to focus on the local features within the dynamic handwashing steps. For instance, actions like rubbing or rotating are of utmost importance in understanding the semantic meaning conveyed by the gestures. In contrast, the best validation accuracy of the Attention-BiLSTM structure has been improved and peaked at 45 epochs (92.77%). Nevertheless, the entire training progress displays instability. This limitation is also attributed to the structure’s limited ability in capturing local features. By combining CNN and LSTM, the model can not only perceive the local features of dynamic gestures in spatial changes but also capture their temporal variations. As a result, the recognition ability has been significantly improved, the model achieves an accuracy of 93.71% at 48 epochs. Finally, our proposed A-CBLN combines attention mechanisms, further enhancing the model’s understanding of different gesture semantics. Consequently, it exhibits the most powerful learning capability during the training process. Its validation accuracy stabilizes and consistently outperforms other models after 18 epochs, peaked at 32 epochs (93.62%).

thumbnail

https://doi.org/10.1371/journal.pone.0294174.g005

The model with the best performance on the validation dataset was preserved for further analysis of their performance on the test dataset. As shown in Table 4 , all models perform well on the test dataset, with prediction accuracy exceeding 87%. Further observation reveals that the pure LSTM and Attention-BiLSTM models have relatively lower prediction accuracy (87.43% and 91.43% respectively), while the hybrid CNN-LSTM structure significantly improves the prediction accuracy to 93.38%. This is consistent with our previous analysis, indicating that the hybrid CNN-LSTM structure possesses stronger feature extraction capability for dynamic gesture data. Finally, our proposed A-CBLN model demonstrates the best predictive performance for dynamic gestures, achieving optimal values in all evaluation metrics, with an accuracy of 95.05%, precision of 95.43%, recall of 95.25%, and F1-score of 95.22%. Compared to the pure LSTM structure, it improves by 7.62%, 5.84%, 7.32%, and 7.78% in accuracy, precision, recall, and F1-score, respectively.

thumbnail

https://doi.org/10.1371/journal.pone.0294174.t004

5.2. Different size of convolution kernels of the A-CBLN

The choice of different kernel size in convolutional layers implies variations in the receptive field for extracting local features. Therefore, selecting an appropriate kernel size is crucial for improving model performance. We conducted a comparative analysis to investigate the impact of four different kernel sizes (1×2, 1×3, 1×5, and 1×7) on the recognition performance of the A-CBLN algorithm. Fig 6 reveals that the recognition performance of the A-CBLN algorithm initially improves and then declines as the kernel size increases. Through further observation, it can be noted that the utilization of large convolutional kernels can lead to a decrease in the overall recognition performance of the model. This is because, while enlarging the receptive field, they also extract redundant features. When the kernel size is set to 1×3, the A-CBLN algorithm achieves the optimal performance in terms of accuracy, precision, recall, and F1- score. The corresponding performance metrics reach their peak values of 93.94%, 94.60%, 94.02%, and 93.98%, respectively.

thumbnail

https://doi.org/10.1371/journal.pone.0294174.g006

5.3. Number of neurons in BiLSTM of the A-CBLN

The number of neurons in the BiLSTM layer also influences the recognition performance of the A-CBLN algorithm. In this section, we discussed four different neuron quantities: 2, 4, 8, and 16. As shown in Fig 7 , the recognition performance of the A-CBLN algorithm initially improves and then declines with an increase in the number of neurons in the BiLSTM layer. When the neuron quantity is set to 8, the A-CBLN algorithm achieves the optimal performance in terms of accuracy, precision, recall, and F1-score. The corresponding performance metrics reach their peak values of 92.56%, 93.63%, 92.64%, and 92.54%, respectively.

thumbnail

https://doi.org/10.1371/journal.pone.0294174.g007

5.4. Confusion matrix of the A-CBLN on the test dataset

Finally, we conducted a separate discussion and analysis of the prediction results of the A-CBLN algorithm on the test dataset. As shown in Fig 8 , the values on the main diagonal of the confusion matrix represent the percentage of correctly predicted samples in each gesture category, while the remaining positions indicate cases where the model incorrectly predicts a given gesture as another category. Upon further observation, it can be determined that A-CBLN achieves recognition accuracy higher than 85% for all seven handwashing steps. Specifically, the model achieves perfect recognition for gestures in steps 1, 5, and 7, as these gestures exhibit distinct spatial features. However, the recognition performance for handwashing step 3 actions is poor, with approximately 15% of the samples incorrectly classified as step 2. This may be attributed to the similarity between the hand gestures in these two steps, involving actions such as "finger crossing" and "mutual friction," which the two convolutional layers in A-CBLN may struggle to differentiate between them. Additionally, there are also some recognition errors for handwashing actions in steps 4 and 6, likely due to the presence of similar actions such as "finger bending" and "rotational friction," leading to misjudgment by the model. Overall, A-CBLN demonstrates good overall recognition performance for the seven dynamic gestures, with an average accuracy exceeding 95%.

thumbnail

https://doi.org/10.1371/journal.pone.0294174.g008

6. Conclusion

This paper aims to investigate the problem of dynamic gesture recognition based on data gloves. Based on deep learning techniques, we proposed a dynamic gesture recognition algorithm called A-CBLN, which combines structures such as CNN, BiLSTM, and attention mechanism to capture the spatiotemporal features of dynamic gestures to the maximum extent. We selected the commonly used seven-step handwashing method in the medical simulation domain as the research subject and validated the performance of the proposed model in recognizing the seven dynamic gestures. The experimental results demonstrated that our proposed approach effectively addresses the task of dynamic gesture recognition and achieved superior prediction results compared to similar models, with the accuracy of 95.05%, precision of 95.43%, recall of 95.25%, and F1-score of 95.22% on the test dataset. In the future, we plan to further improve our approach in the following aspects: (1) design more efficient feature extraction modules to enhance the discriminability of gestures with similar action sequences; (2) recruit more subjects to increase the dataset size and improve the model’s generalization ability; (3) explore the fusion of multimodal data captured by infrared cameras to enhance the recognition performance of the model.

  • View Article
  • Google Scholar
  • PubMed/NCBI
  • 33. Graves A, Fernández S, Schmidhuber J. Bidirectional LSTM networks for improved phoneme classification and recognition. International conference on artificial neural networks. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005: 799–804. https://doi.org/10.1007/11550907_126

ACM Digital Library home

  • Advanced Search

Hand Gesture Recognition Methods and Applications: A Literature Survey

research papers on gesture recognition

New Citation Alert added!

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, editorial notes.

  • Dayananda Kumar N Suresh K Dinesh R (2022) CNN based Static Hand Gesture Recognition using RGB-D Data 2022 2nd International Conference on Artificial Intelligence and Signal Processing (AISP) 10.1109/AISP53593.2022.9760658 (1-6) Online publication date: 12-Feb-2022 https://doi.org/10.1109/AISP53593.2022.9760658
  • Xia C Saito A Sugiura Y (2022) Using the virtual data-driven measurement to support the prototyping of hand gesture recognition interface with distance sensor Sensors and Actuators A: Physical 10.1016/j.sna.2022.113463 338 (113463) Online publication date: May-2022 https://doi.org/10.1016/j.sna.2022.113463
  • Zhou X Guo Y Jia L Jin Y Li H Xue C (2022) A study of button size for virtual hand interaction in virtual environments based on clicking performance Multimedia Tools and Applications 10.1007/s11042-022-14038-w Online publication date: 15-Oct-2022 https://doi.org/10.1007/s11042-022-14038-w

Index Terms

Applied computing

Computing methodologies

Artificial intelligence

Natural language processing

Machine learning

Human-centered computing

Human computer interaction (HCI)

Social and professional topics

Professional topics

Computing profession

Recommendations

Robust hand gesture recognition with kinect sensor.

Hand gesture based Human-Computer-Interaction (HCI) is one of the most natural and intuitive ways to communicate between people and machines, since it closely mimics how human interact with each other. In this demo, we present a hand gesture recognition ...

A Survey on Hand Gesture Recognition

Hand gesture recognition has become one of the key techniques of human-computer interaction (HCI). Many researchers are devoted in this field. In this paper, firstly the history of hand gesture recognition is discussed and the technical difficulties are ...

Finger identification and hand gesture recognition techniques for natural user interface

The natural user interface using hand gesture have been popular field in Human-Computer-Interaction(HCI). Many research papers have been proposed in this field. They proposed vision-based, glove-based and depth-based approach for hand gesture ...

Information

Published in.

cover image ACM Other conferences

Rector of International Information Technology University, IITU, Kazakhstan

PhD, Professor, IITU, Kazakhstan

Associate professor, IITU, Kazakhstan

JUST, Jordan

IITU, Kazakhstan

Association for Computing Machinery

New York, NY, United States

Publication History

Permissions, check for updates, author tags.

  • computer vision
  • deep learning
  • feature extraction
  • hand gesture recognition
  • sign language
  • Research-article
  • Refereed limited

Acceptance Rates

Contributors, other metrics, bibliometrics, article metrics.

  • 3 Total Citations View Citations
  • 821 Total Downloads
  • Downloads (Last 12 months) 225
  • Downloads (Last 6 weeks) 32

View Options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

View options.

View or Download as a PDF file.

View online with eReader .

HTML Format

View this article in HTML Format.

Share this Publication link

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

jimaging-logo

Article Menu

research papers on gesture recognition

  • Subscribe SciFeed
  • Recommended Articles
  • PubMed/Medline
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

A structured and methodological review on vision-based hand gesture recognition system.

research papers on gesture recognition

1. Introduction

1.1. background, 1.2. survey methodology, 1.3. research gaps and new research challenges, 1.4. contribution, 1.5. research questions.

  • What are the main difficulties faced in gesture recognition?
  • What are some challenges faced with gesture recognition?
  • What are the major algorithms involved in gesture recognition?

1.6. Organization of the Work

2. hand gestures types, 3. recognition technologies of hand gesture, 3.1. technology based on sensor, 3.1.1. techniques for recognizing hand gestures using impulse radio signals, 3.1.2. ultrasonic hand gesture recognition techniques, 3.2. technology based on vision, 4. significant research works on hand gesture recognition, 4.1. data augmentation, 4.2. deep learning for gesture recognition, 4.3. summary, 5. conclusions, institutional review board statement, informed consent statement, data availability statement, conflicts of interest.

  • Gupta, H.P.; Chudgar, H.S.; Mukherjee, S.; Dutta, T.; Sharma, K. A continuous hand gestures recognition technique for human-machine interaction using accelerometer and gyroscope sensors. IEEE Sens. J. 2016 , 16 , 6425–6432. [ Google Scholar ] [ CrossRef ]
  • Xie, R.; Cao, J. Accelerometer-based hand gesture recognition by neural network and similarity matching. IEEE Sens. J. 2016 , 16 , 4537–4545. [ Google Scholar ] [ CrossRef ]
  • Rautaray, S.S.; Agrawal, A. Vision based hand gesture recognition for human computer interaction: A survey. Artif. Intell. Rev. 2015 , 43 , 1–54. [ Google Scholar ] [ CrossRef ]
  • Zhang, Q.-Y.; Lu, J.-C.; Zhang, M.-Y.; Duan, H.-X. Hand gesture segmentation method based on YCbCr color space and K-means clustering. Int. J. Signal Process. Image Process. Pattern Recognit. 2015 , 8 , 105–116. [ Google Scholar ] [ CrossRef ]
  • Lai, H.Y.; Lai, H.J. Real-time dynamic hand gesture recognition. In Proceedings of the 2014 International Symposium on Computer, Consumer and Control, Taichung, Taiwan, 10–12 June 2014; pp. 658–661. [ Google Scholar ]
  • Hasan, M.M.; Mishra, P.K. Features fitting using multivariate gaussian distribution for hand gesture recognition. Int. J. Comput. Sci. Emerg. Technol. Ijcset 2012 , 3 , 73–80. [ Google Scholar ]
  • Bargellesi, N.; Carletti, M.; Cenedese, A.; Susto, G.A.; Terzi, M. A random forest-based approach for hand gesture recognition with wireless wearable motion capture sensors. IFAC-PapersOnLine 2019 , 52 , 128–133. [ Google Scholar ] [ CrossRef ]
  • Cho, Y.; Lee, A.; Park, J.; Ko, B.; Kim, N. Enhancement of gesture recognition for contactless interface using a personalized classifier in the operating room. Comput. Methods Programs Biomed. 2018 , 161 , 39–44. [ Google Scholar ] [ CrossRef ]
  • Zhao, H.; Ma, Y.; Wang, S.; Watson, A.; Zhou, G. MobiGesture: Mobility-aware hand gesture recognition for healthcare. Smart Health 2018 , 9 , 129–143. [ Google Scholar ] [ CrossRef ]
  • Tavakoli, M.; Benussi, C.; Lopes, P.A.; Osorio, L.B.; de Almeida, A.T. Robust hand gesture recognition with a double channel surface EMG wearable armband and SVM classifier. Biomed. Signal Process. Control. 2018 , 46 , 121–130. [ Google Scholar ] [ CrossRef ]
  • Zhang, Y.; Chen, Y.; Yu, H.; Yang, X.; Lu, W.; Liu, H. Wearing-independent hand gesture recognition method based on EMG armband. Pers. Ubiquitous Comput. 2018 , 22 , 511–524. [ Google Scholar ] [ CrossRef ]
  • Li, Y.; He, Z.; Ye, X.; He, Z.; Han, K. Spatial temporal graph convolutional networks for skeleton-based dynamic hand gesture recognition. Eurasip J. Image Video Process. 2019 , 2019 , 78. [ Google Scholar ] [ CrossRef ]
  • Alonso, D.G.; Teyseyre, A.; Soria, A.; Berdun, L. Hand gesture recognition in real world scenarios using approximate string matching. Multimed. Tools Appl. 2020 , 79 , 20773–20794. [ Google Scholar ] [ CrossRef ]
  • Zhang, T.; Lin, H.; Ju, Z.; Yang, C. Hand Gesture recognition in complex background based on convolutional pose machine and fuzzy Gaussian mixture models. Int. J. Fuzzy Syst. 2020 , 22 , 1330–1341. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Tam, S.; Boukadoum, M.; Campeau-Lecours, A.; Gosselin, B. A fully embedded adaptive real-time hand gesture classifier leveraging HD-sEMG and deep learning. IEEE Trans. Biomed. Circuits Syst. 2019 , 14 , 232–243. [ Google Scholar ] [ CrossRef ]
  • Li, H.; Wu, L.; Wang, H.; Han, C.; Quan, W.; Zhao, J. Hand gesture recognition enhancement based on spatial fuzzy matching in leap motion. IEEE Trans. Ind. Inform. 2019 , 16 , 1885–1894. [ Google Scholar ] [ CrossRef ]
  • Köpüklü, O.; Gunduz, A.; Kose, N.; Rigoll, G. Online dynamic hand gesture recognition including efficiency analysis. IEEE Trans. Biom. Behav. Identity Sci. 2020 , 2 , 85–97. [ Google Scholar ] [ CrossRef ]
  • Tai, T.M.; Jhang, Y.J.; Liao, Z.W.; Teng, K.C.; Hwang, W.J. Sensor-based continuous hand gesture recognition by long short-term memory. IEEE Sens. Lett. 2018 , 2 , 1–4. [ Google Scholar ] [ CrossRef ]
  • Ram Rajesh, J.; Sudharshan, R.; Nagarjunan, D.; Aarthi, R. Remotely controlled PowerPoint presentation navigation using hand gestures. In Proceedings of the International conference on Advances in Computer, Electronics and Electrical Engineering, Vijayawada, India, 22 July 2012. [ Google Scholar ]
  • Czupryna, M.; Kawulok, M. Real-time vision pointer interface. In Proceedings of the ELMAR-2012, Zadar, Croatia, 12–14 September 2012; pp. 49–52. [ Google Scholar ]
  • Gupta, A.; Sehrawat, V.K.; Khosla, M. FPGA based real time human hand gesture recognition system. Procedia Technol. 2012 , 6 , 98–107. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Chen, L.; Wang, F.; Deng, H.; Ji, K. A survey on hand gesture recognition. In Proceedings of the 2013 International Conference on Computer Sciences and Applications, Wuhan, China, 14–15 December 2013; pp. 313–316. [ Google Scholar ]
  • Jalab, H.A.; Omer, H.K. Human computer interface using hand gesture recognition based on neural network. In Proceedings of the 2015 5th National Symposium on Information Technology: Towards New Smart World (NSITNSW), Riyadh, Saudi Arabia, 17–19 February 2015; pp. 1–6. [ Google Scholar ]
  • Pisharady, P.K.; Saerbeck, M. Recent methods and databases in vision-based hand gesture recognition: A review. Comput. Vis. Image Underst. 2015 , 141 , 152–165. [ Google Scholar ] [ CrossRef ]
  • Plouffe, G.; Cretu, A.M. Static and dynamic hand gesture recognition in depth data using dynamic time warping. IEEE Trans. Instrum. Meas. 2015 , 65 , 305–316. [ Google Scholar ] [ CrossRef ]
  • Rios-Soria, D.J.; Schaeffer, S.E.; Garza-Villarreal, S.E. Hand-gesture recognition using computer-vision techniques. In Proceedings of the 21st International Conference on Computer Graphics, Visualization and Computer Vision, Plzen, Czech Republic, 24–27 June 2013. [ Google Scholar ]
  • Cheng, H.; Yang, L.; Liu, Z. Survey on 3D hand gesture recognition. IEEE Trans. Circuits Syst. Video Technol. 2015 , 26 , 1659–1673. [ Google Scholar ] [ CrossRef ]
  • Ahuja, M.K.; Singh, A. Static vision based Hand Gesture recognition using principal component analysis. In Proceedings of the 2015 IEEE 3rd International Conference on MOOCs, Innovation and Technology in Education (MITE), Amritsar, India, 1–2 October 2015; pp. 402–406. [ Google Scholar ]
  • Kaur, H.; Rani, J. A review: Study of various techniques of Hand gesture recognition. In Proceedings of the 2016 IEEE 1st International Conference on Power Electronics, Intelligent Control and Energy Systems (ICPEICES), Delhi, India, 4–6 July 2016; pp. 1–5. [ Google Scholar ]
  • Sonkusare, J.S.; Chopade, N.B.; Sor, R.; Tade, S.L. A review on hand gesture recognition system. In Proceedings of the 2015 International Conference on Computing Communication Control and Automation, Pune, India, 26–27 February 2015; pp. 790–794. [ Google Scholar ]
  • Shimada, A.; Yamashita, T.; Taniguchi, R.I. Hand gesture based TV control system—Towards both user-& machine-friendly gesture applications. In Proceedings of the 19th Korea-Japan Joint Workshop on Frontiers of Computer Vision, Incheon, Korea, 30 January–1 February 2013; pp. 121–126. [ Google Scholar ]
  • Palacios, J.M.; Sagüés, C.; Montijano, E.; Llorente, S. Human-computer interaction based on hand gestures using RGB-D sensors. Sensors 2013 , 13 , 11842–11860. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Trigueiros, P.; Ribeiro, F.; Reis, L.P. Generic system for human-computer gesture interaction. In Proceedings of the 2014 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), Espinho, Portugal, 14–15 May 2014; pp. 175–180. [ Google Scholar ]
  • Dhule, C.; Nagrare, T. Computer vision based human-computer interaction using color detection techniques. In Proceedings of the 2014 Fourth International Conference on Communication Systems and Network Technologies, Washington, DC, USA, 7–9 April 2014; pp. 934–938. [ Google Scholar ]
  • Poularakis, S.; Katsavounidis, I. Finger detection and hand posture recognition based on depth information. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 4329–4333. [ Google Scholar ]
  • Dinh, D.L.; Kim, J.T.; Kim, T.S. Hand gesture recognition and interface via a depth imaging sensor for smart home appliances. Energy Procedia 2014 , 62 , 576–582. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Panwar, M. Hand gesture recognition based on shape parameters. In Proceedings of the 2012 International Conference on Computing, Communication and Applications, Dindigul, India, 22–24 February 2012; pp. 1–6. [ Google Scholar ]
  • Wang, W.; Pan, J. Hand segmentation using skin color and background information. In Proceedings of the 2012 International Conference on Machine Learning and Cybernetics, Xi’an, China, 15–17 July 2012; Volume 4, pp. 1487–1492. [ Google Scholar ]
  • Doğan, R.Ö.; Köse, C. Computer monitoring and control with hand movements. In Proceedings of the 2014 22nd Signal Processing and Communications Applications Conference (SIU), Trabzon, Turkey, 23–25 April 2014; pp. 2110–2113. [ Google Scholar ]
  • Suarez, J.; Murphy, R.R. Hand gesture recognition with depth images: A review. In Proceedings of the 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication, Paris, France, 9–13 September 2012; pp. 411–417. [ Google Scholar ]
  • Puri, R. Gesture recognition based mouse events. arXiv 2014 , arXiv:1401.2058. [ Google Scholar ] [ CrossRef ]
  • Wang, C.; Liu, Z.; Chan, S.C. Superpixel-based hand gesture recognition with kinect depth camera. IEEE Trans. Multimed. 2014 , 17 , 29–39. [ Google Scholar ] [ CrossRef ]
  • Garg, P.; Aggarwal, N.; Sofat, S. Vision based hand gesture recognition. World Acad. Sci. Eng. Technol. 2009 , 49 , 972–977. [ Google Scholar ]
  • Chastine, J.; Kosoris, N.; Skelton, J. A study of gesture-based first person control. In Proceedings of the CGAMES’2013 USA, Louisville, KY, USA, 30 July–1 August 2013; pp. 79–86. [ Google Scholar ]
  • Dominio, F.; Donadeo, M.; Marin, G.; Zanuttigh, P.; Cortelazzo, G.M. Hand gesture recognition with depth data. In Proceedings of the 4th ACM/IEEE International Workshop on Analysis and Retrieval of Tracked Events and Motion in Imagery Stream, Barcelona, Spain, 21 October 2013; pp. 9–16. [ Google Scholar ]
  • Xu, Y.; Wang, Q.; Bai, X.; Chen, Y.L.; Wu, X. A novel feature extracting method for dynamic gesture recognition based on support vector machine. In Proceedings of the 2014 IEEE International Conference on Information and Automation (ICIA), Hailar, China, 28–30 July 2014; pp. 437–441. [ Google Scholar ]
  • Jais, H.M.; Mahayuddin, Z.R.; Arshad, H. A review on gesture recognition using Kinect. In Proceedings of the 2015 International Conference on Electrical Engineering and Informatics (ICEEI), Bali, Indonesia, 10–11 August 2015; pp. 594–599. [ Google Scholar ]
  • Czuszynski, K.; Ruminski, J.; Wtorek, J. Pose classification in the gesture recognition using the linear optical sensor. In Proceedings of the 2017 10th International Conference on Human System Interactions (HSI), Ulsan, Korea, 17–19 July 2017; pp. 18–24. [ Google Scholar ]
  • Park, S.; Ryu, M.; Chang, J.Y.; Park, J. A hand posture recognition system utilizing frequency difference of infrared light. In Proceedings of the 20th ACM Symposium on Virtual Reality Software and Technology, Edinburgh, Scotland, 11–13 November 2014; pp. 65–68. [ Google Scholar ]
  • Jangyodsuk, P.; Conly, C.; Athitsos, V. Sign language recognition using dynamic time warping and hand shape distance based on histogram of oriented gradient features. In Proceedings of the 7th International Conference on PErvasive Technologies Related to Assistive Environments, Rhodes, Greece, 27–30 May 2014; pp. 1–6. [ Google Scholar ]
  • Sahoo, J.P.; Prakash, A.J.; Pławiak, P.; Samantray, S. Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional Neural Network. Sensors 2022 , 22 , 706. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Gadekallu, T.R.; Srivastava, G.; Liyanage, M.; Iyapparaja, M.; Chowdhary, C.L.; Koppu, S.; Maddikunta, P.K.R. Hand gesture recognition based on a Harris hawks optimized convolution neural network. Comput. Electr. Eng. 2022 , 100 , 107836. [ Google Scholar ] [ CrossRef ]
  • Amin, M.S.; Rizvi, S.T.H. Sign Gesture Classification and Recognition Using Machine Learning. Cybern. Syst. 2022 . [ Google Scholar ] [ CrossRef ]
  • Kong, F.; Deng, J.; Fan, Z. Gesture recognition system based on ultrasonic FMCW and ConvLSTM model. Measurement 2022 , 190 , 110743. [ Google Scholar ] [ CrossRef ]
  • Saboo, S.; Singha, J.; Laskar, R.H. Dynamic hand gesture recognition using combination of two-level tracker and trajectory-guided features. Multimed. Syst. 2022 , 28 , 183–194. [ Google Scholar ] [ CrossRef ]
  • Alnaim, N. Hand Gesture Recognition Using Deep Learning Neural Networks. Ph.D. Thesis, Brunel University, London, UK, 2020. [ Google Scholar ]
  • Oudah, M.; Al-Naji, A.; Chahl, J. Computer Vision for Elderly Care Based on Hand Gestures. Computers 2021 , 10 , 5. [ Google Scholar ] [ CrossRef ]
  • Joseph, P. Recent Trends and Technologies in Hand Gesture Recognition. Int. J. Adv. Res. Comput. Sci. 2017 , 8 . [ Google Scholar ]
  • Zhang, Y.; Liu, B.; Liu, Z. Recognizing hand gestures with pressure-sensor-based motion sensing. IEEE Trans. Biomed. Circuits Syst. 2019 , 13 , 1425–1436. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Mujahid, A.; Awan, M.J.; Yasin, A.; Mohammed, M.A.; Damaševičius, R.; Maskeliūnas, R.; Abdulkareem, K.H. Real-Time Hand Gesture Recognition Based on Deep Learning YOLOv3 Model. Appl. Sci. 2021 , 11 , 4164. [ Google Scholar ] [ CrossRef ]
  • Min, Y.; Zhang, Y.; Chai, X.; Chen, X. An efficient pointlstm for point clouds based gesture recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5761–5770. [ Google Scholar ]
  • Al-Hammadi, M.; Muhammad, G.; Abdul, W.; Alsulaiman, M.; Bencherif, M.A.; Alrayes, T.S.; Mathkour, H.; Mekhtiche, M.A. Deep learning-based approach for sign language gesture recognition with efficient hand gesture representation. IEEE Access 2020 , 8 , 192527–192542. [ Google Scholar ] [ CrossRef ]
  • Neethu, P.; Suguna, R.; Sathish, D. An efficient method for human hand gesture detection and recognition using deep learning convolutional neural networks. Soft Comput. 2020 , 24 , 15239–15248. [ Google Scholar ] [ CrossRef ]
  • Asadi-Aghbolaghi, M.; Clapes, A.; Bellantonio, M.; Escalante, H.J.; Ponce-López, V.; Baró, X.; Guyon, I.; Kasaei, S.; Escalera, S. A survey on deep learning based approaches for action and gesture recognition in image sequences. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017; pp. 476–483. [ Google Scholar ]
  • Cao, C.; Zhang, Y.; Wu, Y.; Lu, H.; Cheng, J. Egocentric gesture recognition using recurrent 3d convolutional neural networks with spatiotemporal transformer modules. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3763–3771. [ Google Scholar ]
  • John, V.; Boyali, A.; Mita, S.; Imanishi, M.; Sanma, N. Deep learning-based fast hand gesture recognition using representative frames. In Proceedings of the 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), Gold Coast, QLD, Australia, 30 November–2 December 2016; pp. 1–8. [ Google Scholar ]
  • Zhang, X.; Li, X. Dynamic gesture recognition based on MEMP network. Future Internet 2019 , 11 , 91. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Wang, S.; Song, J.; Lien, J.; Poupyrev, I.; Hilliges, O. Interacting with soli: Exploring fine-grained dynamic gesture recognition in the radio-frequency spectrum. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology, Tokyo, Japan, 16–19 October 2016; pp. 851–860. [ Google Scholar ]
  • Funke, I.; Bodenstedt, S.; Oehme, F.; von Bechtolsheim, F.; Weitz, J.; Speidel, S. Using 3D convolutional neural networks to learn spatiotemporal features for automatic surgical gesture recognition in video. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 467–475. [ Google Scholar ]
  • Al Farid, F.; Hashim, N.; Abdullah, J. Vision Based Gesture Recognition from RGB Video Frames Using Morphological Image Processing Techniques. Int. J. Adv. Sci. Technol. 2019 , 28 , 321–332. [ Google Scholar ]
  • Al Farid, F.; Hashim, N.; Abdullah, J. Vision-based hand gesture recognition from RGB video data using SVM. In Proceedings of the International Workshop on Advanced Image Technology (IWAIT) 2019, International Society for Optics and Photonics, NTU, Singapore, 22 March 2019; Volume 11049, p. 110491E. [ Google Scholar ]
  • Bhuiyan, M.R.; Abdullah, D.; Hashim, D.; Farid, F.; Uddin, D.; Abdullah, N.; Samsudin, D. Crowd density estimation using deep learning for Hajj pilgrimage video analytics. F1000Research 2021 , 10 , 1190. [ Google Scholar ] [ CrossRef ]
  • Bhuiyan, M.R.; Abdullah, J.; Hashim, N.; Al Farid, F.; Samsudin, M.A.; Abdullah, N.; Uddin, J. Hajj pilgrimage video analytics using CNN. Bull. Electr. Eng. Inform. 2021 , 10 , 2598–2606. [ Google Scholar ] [ CrossRef ]
  • Zamri, M.N.H.B.; Abdullah, J.; Bhuiyan, R.; Hashim, N.; Farid, F.A.; Uddin, J.; Husen, M.N.; Abdullah, N. A Comparison of ML and DL Approaches for Crowd Analysis on the Hajj Pilgrimage. In Proceedings of the International Visual Informatics Conference; Springer: Berlin/Heidelberg, Germany, 2021; pp. 552–561. [ Google Scholar ]
  • Bari, B.S.; Islam, M.N.; Rashid, M.; Hasan, M.J.; Razman, M.A.M.; Musa, R.M.; Ab Nasir, A.F.; Majeed, A.P.A. A real-time approach of diagnosing rice leaf disease using deep learning-based faster R-CNN framework. Peerj Comput. Sci. 2021 , 7 , e432. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Zoph, B.; Cubuk, E.D.; Ghiasi, G.; Lin, T.Y.; Shlens, J.; Le, Q.V. Learning data augmentation strategies for object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 566–583. [ Google Scholar ]
  • Xie, Q.; Dai, Z.; Hovy, E.; Luong, M.T.; Le, Q.V. Unsupervised data augmentation for consistency training. arXiv 2019 , arXiv:1904.12848. [ Google Scholar ]
  • Islam, M.Z.; Hossain, M.S.; ul Islam, R.; Andersson, K. Static hand gesture recognition using convolutional neural network with data augmentation. In Proceedings of the 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging, Vision & Pattern Recognition (icIVPR), Spokane, WA, USA, 30 May–2 June 2019; pp. 324–329. [ Google Scholar ]
  • Mungra, D.; Agrawal, A.; Sharma, P.; Tanwar, S.; Obaidat, M.S. PRATIT: A CNN-based emotion recognition system using histogram equalization and data augmentation. Multimed. Tools Appl. 2020 , 79 , 2285–2307. [ Google Scholar ] [ CrossRef ]
  • Rashid, M.; Bari, B.S.; Yusup, Y.; Kamaruddin, M.A.; Khan, N. A Comprehensive Review of Crop Yield Prediction Using Machine Learning Approaches With Special Emphasis on Palm Oil Yield Prediction. IEEE Access 2021 , 9 , 63406–63439. [ Google Scholar ] [ CrossRef ]
  • Rashid, M.; Sulaiman, N.; PP Abdul Majeed, A.; Musa, R.M.; Bari, B.S.; Khatun, S. Current status, challenges, and possible solutions of EEG-based brain-computer interface: A comprehensive review. Front. Neurorobotics 2020 , 14 , 25. [ Google Scholar ] [ CrossRef ]
  • Mathew, A.; Amudha, P.; Sivakumari, S. Deep Learning Techniques: An Overview. In Proceedings of the International Conference on Advanced Machine Learning Technologies and Applications, Manipal, India, 13–15 February 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 599–608. [ Google Scholar ]
  • Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning ; MIT Press: Cambridge, MA, USA, 2016. [ Google Scholar ]
  • Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012 , 35 , 221–231. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Liu, Z.; Zhang, C.; Tian, Y. 3D-based deep convolutional neural network for action recognition with depth sequences. Image Vis. Comput. 2016 , 55 , 93–100. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Sun, L.; Jia, K.; Yeung, D.Y.; Shi, B.E. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4597–4605. [ Google Scholar ]
  • Escorcia, V.; Heilbron, F.C.; Niebles, J.C.; Ghanem, B. Daps: Deep action proposals for action understanding. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 768–784. [ Google Scholar ]
  • Mansimov, E.; Srivastava, N.; Salakhutdinov, R. Initialization strategies of spatio-temporal convolutional neural networks. arXiv 2015 , arXiv:1503.07274. [ Google Scholar ]
  • Baccouche, M.; Mamalet, F.; Wolf, C.; Garcia, C.; Baskurt, A. Sequential deep learning for human action recognition. In Proceedings of the International Workshop on Human Behavior Understanding, Amsterdam, The Netherlands, 16 November 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 29–39. [ Google Scholar ]
  • Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1933–1941. [ Google Scholar ]
  • Shou, Z.; Wang, D.; Chang, S.F. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1049–1058. [ Google Scholar ]
  • Varol, G.; Laptev, I.; Schmid, C. Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017 , 40 , 1510–1517. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Neverova, N.; Wolf, C.; Taylor, G.W.; Nebout, F. Multi-scale deep learning for gesture detection and localization. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 474–490. [ Google Scholar ]
  • Wang, L.; Qiao, Y.; Tang, X. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4305–4314. [ Google Scholar ]
  • Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv 2015 , arXiv:1510.00149. [ Google Scholar ]
  • Zhang, B.; Wang, L.; Wang, Z.; Qiao, Y.; Wang, H. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2718–2726. [ Google Scholar ]
  • Xu, Z.; Zhu, L.; Yang, Y.; Hauptmann, A.G. Uts-cmu at thumos 2015. Thumos Chall. 2015 , 2015 , 2. [ Google Scholar ]
  • Gkioxari, G.; Malik, J. Finding action tubes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 759–768. [ Google Scholar ]
  • Escalante, H.J.; Morales, E.F.; Sucar, L.E. A naive bayes baseline for early gesture recognition. Pattern Recognit. Lett. 2016 , 73 , 91–99. [ Google Scholar ] [ CrossRef ]
  • Xu, X.; Hospedales, T.M.; Gong, S. Multi-task zero-shot action recognition with prioritised data augmentation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 343–359. [ Google Scholar ]
  • Montes, A.; Salvador, A.; Pascual, S.; Giro-i Nieto, X. Temporal activity detection in untrimmed videos with recurrent neural networks. arXiv 2016 , arXiv:1608.08128. [ Google Scholar ]
  • Nasrollahi, K.; Escalera, S.; Rasti, P.; Anbarjafari, G.; Baro, X.; Escalante, H.J.; Moeslund, T.B. Deep learning based super-resolution for improved action recognition. In Proceedings of the 2015 International Conference on Image Processing Theory, Tools and Applications (IPTA), Orleans, France, 10–13 November 2015; pp. 67–72. [ Google Scholar ]

Click here to enlarge figure

AuthorFindingsChallenges
[ ]In this study, image processing techniques such as wavelets and empirical mode decomposition were suggested to extract picture functionalities in order to identify 2D or 3D manual motions. Classification of artificial neural networks (ANN), which was utilized for the training and classification of data in addition to the CNN (CNN).Three-dimensional gesture disparities were measured utilizing the left and right 3D gesture videos.
[ ]Deaf–mute elderly folk use five distinct hand signals to seek a particular item, such as drink, food, toilet, assistance, and medication. Since older individuals cannot do anything independently, their requests were delivered to their smartphone.Microsoft Kinect v2 sensor’s capability to extract hand movements in real time keeps this study in a restricted area.
[ ]The physical closeness of gestures and voices may be loosened slightly and utilized by individuals with unique abilities. It was always important to explore efficient human computer interaction (HCI) in developing new approaches and methodologies.Many of the methods encounter difficulties like occlusions, changes in lighting, low resolution and a high frame rate.
[ ]A working prototype is created to perform gestures based on real-time interactions, comprising a wearable gesture detecting device with four pressure sensors and the appropriate computational framework.The hardware design of the system has to be further simplified to make it more feasible. More research on the balance between system resilience and sensitivity is required.
[ ]This article offers a lightweight model based on the YOLO (You Look Only Once) v3 and the DarkNet-53 neural networks for gesture detection without further preprocessing, filtration of pictures and image improvement. Even in a complicated context the suggested model was very accurate, and even in low resolution image mode motions were effectively identified. Rate of high frame.The primary challenge of this application for identification of gestures in real time is the classification and recognition of gestures. Hand recognition is a method used by several algorithms and ideas of diverse approaches for understanding the movement of a hand, such as picture and neural networks.
[ ]This work formulates the recognition of gestures as an irregular issue of sequence identification and aims to capture long-run spatial correlations in points of the cloud. In order to spread information from past to future while maintaining its spatial structure, a new and effective PointLSTM is suggested.The underlying geometric structure and distance information for the object surfaces are accurately described in dot clouds as compared with RGB data, which offer additional indicators of gesture identification.
[ ]A new system is presented for a dynamic recognition of hand gestures utilizing various architectures to learn how to partition hands, local and global features and globalization and recognition features of the sequence.To create an efficient system for recognition, hand segmentation, local representation of hand forms, global corporate configuration, and gesture sequence modeling need to be addressed.
[ ]This article detects and recognizes the gestures of the human hand using the method to classification for neural networks (CNN). This process flow includes hand area segmentation using mask image, finger segmentation, segmented finger image normalization and CNN classification finger identification.SVM and the naive Bayes classification were used to recognize the conventional gesture technique and needed a large number of data for the identification of gesture patterns.
[ ]They provided a study of existing deep learning methodologies for action and gesture detection in picture sequences, as well as a taxonomy that outlines key components of deep learning for both tasks.They looked through the suggested architectures, fusion methodologies, primary datasets, and competitions in depth. They described and analyzed the key works presented so far, focusing on how they deal with the temporal component of data and suggesting potential and challenges for future study.
[ ]They solve the problems by employing an end-to-end learning recurrent 3D convolutional neural network. They created a spatiotemporal transformer module with recurrent connections between surrounding time slices that can dynamically change a 3D feature map into a canonical view in both space and time.The main challenge in egocentric vision gesture detection is the global camera motion created by the device wearer’s spontaneous head movement.
[ ]To categorize video sequences of hand motions, a long-term recurrent convolution network is utilized. Long-term recurrent convolution is the most common kind of long-term recurrent convolution. Multiple frames captured from a video sequence are fed into a network to conduct categorization in a network-based action classifier.Apart from lowering the accuracy of the classifier, the inclusion of several frames increases the computing complexity of the system.
[ ]The MEMP network’s major characteristic is that it extracts and predicts the temporal and spatial feature information of gesture video numerous times, allowing for great accuracy. MEMP stands for multiple extraction and multiple prediction.They present a neural network with an alternative fusion of 3D CNN and ConvLSTM since each kind of neural network structure has its own constraints. MEMP was developed by them.
[ ]This research introduces a new machine learning architecture that is especially built for gesture identification based on radio frequency. They are particularly interested in high-frequency (60 GHz) short-range radar sensing, such as Google’s Soli sensor.The signal has certain unique characteristics, such as the ability to resolve motion at a very fine level and the ability to segment in range and velocity space rather than picture space. This allows for the identification of new sorts of inputs, but it makes the design of input recognition algorithms much more challenging.
[ ]They propose learning spatio-temporal properties from successive video frames using a 3D convolutional neural network (CNN). They test their method using recordings of robot-assisted suturing on a bench-top model from the JIGSAWS dataset, which is freely accessible.Recognizing surgical gestures automatically is an important step in gaining a complete grasp of surgical expertise. Automatic skill evaluation, intra-operative monitoring of essential surgical processes, and semi-automation of surgical activities are all possible applications.
[ , ]They blur the image frames from videos to remove the background noise. The photos are then converted to HSV color mode. They transform the picture to black-and-white format through dilation, erosion, filtering, and thresholding. Finally, hand movements are identified using SVM.Gesture-based technology may assist the handicapped, as well as the general public, to maintain their safety and requirements. Due to the significant changeability of the properties of each motion with regard to various persons, gesture detection from video streams is a complicated matter.
[ , ]The purpose of this study is to offer a method for Hajj applications that is based on a convolutional neural network model. They also created a technique for counting and then assessing crowd density. The model employs an architecture that recognizes each individual in the crowd, marks their head position with a bounding box, and counts them in their own unique dataset (HAJJ-Crowd).There has been a growth in interest in the improvement of video analytics and visual monitoring to better the safety and security of pilgrims while in Makkah. It is mostly due to the fact that Hajj is a one-of-a-kind event with hundreds of thousands of people crowded into a small area.
[ ]This study presents crowd density analysis using machine learning. The primary goal of this model is to find the best machine learning method for crowd density categorization with the greatest performance.Crowd control is essential for ensuring crowd safety. Crowd monitoring is an efficient method of observing, controlling, and comprehending crowd behavior.
MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

Al Farid, F.; Hashim, N.; Abdullah, J.; Bhuiyan, M.R.; Shahida Mohd Isa, W.N.; Uddin, J.; Haque, M.A.; Husen, M.N. A Structured and Methodological Review on Vision-Based Hand Gesture Recognition System. J. Imaging 2022 , 8 , 153. https://doi.org/10.3390/jimaging8060153

Al Farid F, Hashim N, Abdullah J, Bhuiyan MR, Shahida Mohd Isa WN, Uddin J, Haque MA, Husen MN. A Structured and Methodological Review on Vision-Based Hand Gesture Recognition System. Journal of Imaging . 2022; 8(6):153. https://doi.org/10.3390/jimaging8060153

Al Farid, Fahmid, Noramiza Hashim, Junaidi Abdullah, Md Roman Bhuiyan, Wan Noor Shahida Mohd Isa, Jia Uddin, Mohammad Ahsanul Haque, and Mohd Nizam Husen. 2022. "A Structured and Methodological Review on Vision-Based Hand Gesture Recognition System" Journal of Imaging 8, no. 6: 153. https://doi.org/10.3390/jimaging8060153

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

A multi-modal framework for continuous and isolated hand gesture recognition utilizing movement epenthesis detection

  • Published: 27 June 2024
  • Volume 35 , article number  86 , ( 2024 )

Cite this article

research papers on gesture recognition

  • Navneet Nayan 1 ,
  • Debashis Ghosh 1 &
  • Pyari Mohan Pradhan 1  

Explore all metrics

Gesture recognition, having multitudinous applications in the real world, is one of the core areas of research in the field of human-computer interaction. In this paper, we propose a novel method for isolated and continuous hand gesture recognition utilizing the movement epenthesis detection and removal. For this purpose, the present work detects and removes the movement epenthesis frames from the isolated and continuous hand gesture videos. In this paper, we have also proposed a novel modality based on the temporal difference that extracts hand regions, removes gesture irrelevant factors and provides temporal information contained in the hand gesture videos. Using the proposed modality and other modalities such as the RGB modality, depth modality and segmented hand modality, features are extracted using Googlenet Caffe Model. Next, we derive a set of discriminative features by fusing the acquired features that form a feature vector representing the sign gesture in question. We have designed and used a Bidirectional Long Short-Term Memory Network (Bi-LSTM) for classification purpose. To test the efficacy of our proposed work, we applied our method on various publicly available continuous and isolated hand gesture datasets like ChaLearn LAP IsoGD, ChaLearn LAP ConGD, IPN Hand, and NVGesture. We observe in our experiments that our proposed method performs exceptionally well with several individual modalities as well as combination of modalities of these datasets. The combined effect of the proposed modality and movement epenthesis frames removal led to significant improvement in gesture recognition accuracy and considerable reduction in computational burden. Thus the obtained results advocate our proposed approach to be at par with the existing state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

research papers on gesture recognition

Data availability

Datasets ChaLearn LAP ConGD and ChaLearn LAP IsoGD can be accessed using the website link mentioned as: https://gesture.chalearn.org/2016-looking-at-people-cvpr-challenge/isogd-and-congd-datasets Dataset IPN Hand can be accessed using the following website link: https://gibranbenitez.github.io/IPN_Hand/ Dataset NVGesture can be accessed with the help of following link: https://research.nvidia.com/publication/2016-06_online-detection-and-classification-dynamic-hand-gestures-recurrent-3d Rest other declarations are not applicable.

Abavisani, M., Joze, H.R.V., Patel, V.M.: Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1165–1174 (2019)

Belgacem, S., Chatelain, C., Paquet, T.: Gesture sequence recognition with one shot learned crf/hmm hybrid model. Image Vis. Comput. 61 , 12–21 (2017)

Article   Google Scholar  

Benitez-Garcia, G., Olivares-Mercado, J., Sanchez-Perez, G., et al.: IPN Hand: a video dataset and benchmark for real-time continuous hand gesture recognition. In: 2020 25th International Conference on Pattern Recognition, pp. 4340–4347 (2021)

Camgoz, N.C., Hadfield, S., Koller, O., et al.: Using convolutional 3D neural networks for user-independent continuous gesture recognition. In: 2016 23rd International Conference on Pattern Recognition, pp. 49–54 (2016)

Camgoz, N.C., Hadfield, S., Bowden, R.: Particle filter based probabilistic forced alignment for continuous gesture recognition. In: 2017 IEEE International Conference on Computer Vision Workshops, pp. 3079–3085 (2017)

Chai, X., Liu, Z., Yin, F., et al.: Two streams recurrent neural networks for large-scale continuous gesture recognition. In: 2016 23rd International Conference on Pattern Recognition, pp. 31–36 (2016)

Choudhury, A., Talukdar, A.K., Bhuyan, M.K., et al.: Movement epenthesis detection for continuous sign language recognition. J. Intell. Syst. 26 (3), 471–481 (2017)

Google Scholar  

Duan, J., Wan, J., Zhou, S., et al.: A unified framework for multi-modal isolated gesture recognition. ACM Trans. Multimed. Comput. Commun. Appl. 14 (1), 1–16 (2018)

Gammulle, H., Denman, S., Sridharan, S., et al.: TMMF: Temporal multi-modal fusion for single-stage continuous gesture recognition. IEEE Trans. Image Process. 30 , 7689–7701 (2021)

Gao, W., Fang, G., Zhao, D., et al.: Transition movement models for large vocabulary continuous sign language recognition. In: Sixth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 553–558 (2004)

Guyon, I., Athitsos, V., Jangyodsuk, P., et al.: The ChaLearn gesture dataset (CGD 2011). Mach. Vis. Appl. 25 , 1929–1951 (2014)

Hu, T.K., Lin, Y.Y., Hsiu, P.C.: Learning adaptive hidden layers for mobile gesture recognition. Proceedings of the AAAI Conference on Artificial Intelligence pp. 32(1) (2018)

Jia, Y., Shelhamer, E., Donahue, J., et al.: Caffe: Convolutional architecture for fast feature embedding. Proceedings of the 2014 ACM Conference on Multimedia pp. 675–678 (2014)

Joshi, A., Monnier, C., Betke, M., et al.: Comparing random forest approaches to segmenting and classifying gestures. Image Vis. Comput. 58 , 86–95 (2017)

Kelly, D., Mc Donald, J., Markham, C.: Continuous recognition of motion based gestures in sign language. In: 2009 IEEE 12th International Conference on Computer Vision Workshops, pp. 1073–1080 (2009)

Kelly, D., McDonald, J., Markham, C.: Recognizing spatiotemporal gestures and movement epenthesis in sign language. In: 2009 13th International Machine Vision and Image Processing Conference, pp. 145–150 (2009)

Köpüklü, O., Gunduz, A., Kose, N., et al.: Real-time hand gesture detection and classification using convolutional neural networks. In: 2019 14th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 1–8 (2019)

Li, Y., Miao, Q., Tian, K., et al.: Large-scale gesture recognition with a fusion of RGB-D data based on saliency theory and C3D model. IEEE Trans. Circuits Syst. Video Technol. 28 (10), 2956–2964 (2018)

Li, Y., Miao, Q., Qi, X., et al.: A spatiotemporal attention-based ResC3D model for large-scale gesture recognition. Mach. Vis. Appl. 30 (5), 875–888 (2019)

Lin, C., Wan, J., Liang, Y., et al.: Large-scale isolated gesture recognition using a refined fused model based on masked Res-C3D network and skeleton LSTM. In: 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 52–58 (2018)

Liu, Z., Chai, X., Liu, Z., et al.: Continuous gesture recognition with hand-oriented spatiotemporal feature. In: 2017 IEEE International Conference on Computer Vision Workshops, pp. 3056–3064 (2017)

Mohandes, M., Deriche, M., Aliyu, S.O.: Classifiers combination techniques: a comprehensive review. IEEE Access 6 , 19626–19639 (2018)

Molchanov, P., Yang, X., Gupta, S., et al.: Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 4207–4215 (2016)

Narayana, P., Beveridge, J.R., Draper, B.A.: Gesture recognition: focus on the hands. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5235–5244 (2018)

Nayan, N., Ghosh, D., Pradhan, P.M.: An optical flow based approach to detect movement epenthesis in continuous fingerspelling of sign language. In: 2021 National Conference on Communications, pp. 1–5 (2021)

Nayan, N., Ghosh, D., Pradhan, P.M.: A cnn bi-lstm based multimodal continuous hand gesture recognition. In: 2022 IEEE India Council International Subsections Conference (INDISCON), pp. 1–4 (2022)

Nayan, N., Ghosh, D., Pradhan, P.M.: An unsupervised learning approach to handle movement epenthesis in continuous sign language recognition. In: 2022 17th International Conference on Control, pp. 862–867. Automation, Robotics and Vision (ICARCV) (2022)

Ni, B., Wang, G., Moulin, P.: RGBD-HuDaAct: a color-depth video database for human daily activity recognition. In: 2011 IEEE International Conference on Computer Vision Workshops, pp. 1147–1153 (2011)

Pigou, L., Van Herreweghe, M., Dambre, J.: Gesture and sign language recognition with temporal residual networks. In: 2017 IEEE International Conference on Computer Vision Workshops, pp. 3086–3093 (2017)

Shen, X., Hua, G., Williams, L., et al.: Dynamic hand gesture recognition: an exemplar-based approach from motion divergence fields. Image Vis. Comput. 30 (3), 227–235 (2012)

Suau, X., Alcoverro, M., López-Méndez, A., et al.: Real-time fingertip localization conditioned on hand gesture classification. Image Vis. Comput. 32 (8), 522–532 (2014)

Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015)

Talukdar, A.K., Bhuyan, M.K.: Movement epenthesis detection in continuous fingerspelling from a coarsely sampled motion vector field in h.264/avc video. In: 2018 IEEE Recent Advances in Intelligent Computational Systems, pp. 26–30 (2018)

Theodorakis, S., Pitsikalis, V., Maragos, P.: Dynamic-static unsupervised sequentiality, statistical subunits and lexicon for sign language recognition. Image Vis. Comput. 32 (8), 533–549 (2014)

Vogler, C., Metaxas, D.: ASL recognition based on a coupling between HMMs and 3D motion analysis. In: Sixth International Conference on Computer Vision, pp. 363–369 (1998)

Vogler, C., Metaxas, D.: A framework for recognizing the simultaneous aspects of American sign language. Comput. Vis. Image Underst. 81 (3), 358–384 (2001)

Wan, J., Guo, G., Li, S.Z.: Explore efficient local features from RGB-D data for one-shot learning gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38 (8), 1626–1639 (2016)

Wan, J., Li, S.Z., Zhao, Y., et al.: Chalearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 761–769 (2016)

Wan, J., Lin, C., Wen, L., et al.: Chalearn looking at people: IsoGD and ConGD large-scale RGB-D gesture recognition. IEEE Trans. Cybern. 52 (5), 3422–3433 (2022)

Wang, H., Wang, P., Song, Z., et al.: Large-scale multimodal gesture recognition using heterogeneous networks. In: 2017 IEEE International Conference on Computer Vision Workshops, pp. 3129–3137 (2017)

Wang, H., Wang, P., Song, Z., et al.: Large-scale multimodal gesture segmentation and recognition based on convolutional neural networks. In: 2017 IEEE International Conference on Computer Vision Workshops, pp. 3138–3146 (2017)

Wang, P., Li, W., Liu, S., et al.: Large-scale isolated gesture recognition using convolutional neural networks. In: 2016 23rd International Conference on Pattern Recognition, pp. 7–12 (2016)

Wang, P., Li, W., Liu, S., et al.: Large-scale continuous gesture recognition using convolutional neural networks. In: 2016 23rd International Conference on Pattern Recognition, pp. 13–18 (2016)

Wang, P., Li, W., Gao, Z., et al.: Depth pooling based large-scale 3-D action recognition with convolutional neural networks. IEEE Trans. Multimed. 20 (5), 1051–1061 (2018)

Wang, P., Li, W., Wan, J., et al.: Cooperative training of deep aggregation networks for RGB-D action recognition. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 1–8 (2018)

Yang, R., Sarkar, S., Loeding, B.: Enhanced level building algorithm for the movement epenthesis problem in sign language recognition. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)

Yang, R., Sarkar, S., Loeding, B.: Handling movement epenthesis and hand segmentation ambiguities in continuous sign language recognition using nested dynamic programming. IEEE Trans. Pattern Anal. Mach. Intell. 32 (3), 462–477 (2010)

Yuan, Q., Geo, W., Yao, H., et al.: Recognition of strong and weak connection models in continuous sign language. In: 2002 International Conference on Pattern Recognition, pp. 75–78 (2002)

Zhang, L., Zhu, G., Shen, P., et al.: Learning spatiotemporal features using 3DCNN and convolutional LSTM for gesture recognition. In: 2017 IEEE International Conference on Computer Vision Workshops, pp. 3120–3128 (2017)

Zhu, G., Zhang, L., Mei, L., et al.: Large-scale isolated gesture recognition using pyramidal 3D convolutional networks. In: 2016 23rd International Conference on Pattern Recognition, pp. 19–24 (2016)

Zhu, G., Zhang, L., Shen, P., et al.: Continuous gesture segmentation and recognition using 3DCNN and convolutional LSTM. IEEE Trans. Multimed. 21 (4), 1011–1021 (2019)

Download references

Author information

Authors and affiliations.

Department of Electronics and Communication Engineering, Indian Institute of Technology (IIT) Roorkee, Roorkee, Uttarakhand, 247667, India

Navneet Nayan, Debashis Ghosh & Pyari Mohan Pradhan

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Debashis Ghosh .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Nayan, N., Ghosh, D. & Pradhan, P.M. A multi-modal framework for continuous and isolated hand gesture recognition utilizing movement epenthesis detection. Machine Vision and Applications 35 , 86 (2024). https://doi.org/10.1007/s00138-024-01565-9

Download citation

Received : 07 December 2023

Revised : 24 May 2024

Accepted : 07 June 2024

Published : 27 June 2024

DOI : https://doi.org/10.1007/s00138-024-01565-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Bidirectional long short-term memory
  • Continuous gesture recognition
  • Isolated gesture recognition
  • Movement epenthesis
  • EHTD modality
  • Find a journal
  • Publish with us
  • Track your research
  • IEEE CS Standards
  • Career Center
  • Subscribe to Newsletter
  • IEEE Standards

research papers on gesture recognition

  • For Industry Professionals
  • For Students
  • Launch a New Career
  • Membership FAQ
  • Membership FAQs
  • Membership Grades
  • Special Circumstances
  • Discounts & Payments
  • Distinguished Contributor Recognition
  • Grant Programs
  • Find a Local Chapter
  • Find a Distinguished Visitor
  • About Distinguished Visitors Program
  • Find a Speaker on Early Career Topics
  • Technical Communities
  • Collabratec (Discussion Forum)
  • My Subscriptions
  • My Referrals
  • Computer Magazine
  • ComputingEdge Magazine
  • Let us help make your event a success. EXPLORE PLANNING SERVICES
  • Events Calendar
  • Calls for Papers
  • Conference Proceedings
  • Conference Highlights
  • Top 2024 Conferences
  • Conference Sponsorship Options
  • Conference Planning Services
  • Conference Organizer Resources
  • Virtual Conference Guide
  • Get a Quote
  • CPS Dashboard
  • CPS Author FAQ
  • CPS Organizer FAQ
  • Find the latest in advanced computing research. VISIT THE DIGITAL LIBRARY
  • Open Access
  • Tech News Blog
  • Author Guidelines
  • Reviewer Information
  • Guest Editor Information
  • Editor Information
  • Editor-in-Chief Information
  • Volunteer Opportunities
  • Video Library
  • Member Benefits
  • Institutional Library Subscriptions
  • Advertising and Sponsorship
  • Code of Ethics
  • Educational Webinars
  • Online Education
  • Certifications
  • Industry Webinars & Whitepapers
  • Research Reports
  • Bodies of Knowledge
  • CS for Industry Professionals
  • Resource Library
  • Newsletters
  • Women in Computing
  • Digital Library Access
  • Organize a Conference
  • Run a Publication
  • Become a Distinguished Speaker
  • Participate in Standards Activities
  • Peer Review Content
  • Author Resources
  • Publish Open Access
  • Society Leadership
  • Boards & Committees
  • Local Chapters
  • Governance Resources
  • Conference Publishing Services
  • Chapter Resources
  • About the Board of Governors
  • Board of Governors Members
  • Diversity & Inclusion
  • Open Volunteer Opportunities
  • Award Recipients
  • Student Scholarships & Awards
  • Nominate an Election Candidate
  • Nominate a Colleague
  • Corporate Partnerships
  • Conference Sponsorships & Exhibits
  • Advertising
  • Recruitment
  • Publications
  • Education & Career

CVPR 2024 Announces Best Paper Award Winners

research papers on gesture recognition

This year, from more than 11,500 paper submissions, the CVPR 2024 Awards Committee selected the following 10 winners for the honor of Best Papers during the Awards Program at CVPR 2024, taking place now through 21 June at the Seattle Convention Center in Seattle, Wash., U.S.A.

Best Papers

  • “ Generative Image Dynamics ” Authors: Zhengqi Li, Richard Tucker, Noah Snavely, Aleksander Holynski The paper presents a new approach for modeling natural oscillation dynamics from a single still picture. This approach produces photo-realistic animations from a single picture and significantly outperforms prior baselines. It also demonstrates potential to enable several downstream applications such as creating seamlessly looping or interactive image dynamics.
  • “ Rich Human Feedback for Text-to-Image Generation ” Authors: Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, Junjie Ke, Krishnamurthy Dj Dvijotham, Katherine M. Collins, Yiwen Luo, Yang Li, Kai J. Kohlhoff, Deepak Ramachandran, and Vidhya Navalpakkam This paper highlights the first rich human feedback dataset for image generation. Authors designed and trained a multimodal Transformer to predict the rich human feedback and demonstrated some instances to improve image generation.

Honorable mention papers included, “ EventPS: Real-Time Photometric Stereo Using an Event Camera ” and “ pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction. ”

Best Student Papers

  • “ Mip-Splatting: Alias-free 3D Gaussian Splatting ” Authors: Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, Andreas Geiger This paper introduces Mip-Splatting, a technique improving 3D Gaussian Splatting (3DGS) with a 3D smoothing filter and a 2D Mip filter for alias-free rendering at any scale. This approach significantly outperforms state-of-the-art methods in out-of-distribution scenarios, when testing at sampling rates different from training, resulting in better generalization to out-of-distribution camera poses and zoom factors.
  • “ BioCLIP: A Vision Foundation Model for the Tree of Life ” Authors: Samuel Stevens, Jiaman Wu, Matthew J. Thompson, Elizabeth G. Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M. Dahdul, Charles Stewart, Tanya Berger-Wolf, Wei-Lun Chao, and Yu Su This paper offers TREEOFLIFE-10M and BIOCLIP, a large-scale diverse biology image dataset and a foundation model for the tree of life, respectively. This work shows BIOCLIP is a strong fine-grained classifier for biology in both zero- and few-shot settings.

There also were four honorable mentions in this category this year: “ SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency ”; “ Image Processing GNN: Breaking Rigidity in Super-Resolution; Objects as Volumes: A Stochastic Geometry View of Opaque Solids ;” and “ Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods. ”

“We are honored to recognize the CVPR 2024 Best Paper Awards winners,” said David Crandall, Professor of Computer Science at Indiana University, Bloomington, Ind., U.S.A., and CVPR 2024 Program Co-Chair. “The 10 papers selected this year – double the number awarded in 2023 – are a testament to the continued growth of CVPR and the field, and to all of the advances that await.”

Additionally, the IEEE Computer Society (CS), a CVPR organizing sponsor, announced the Technical Community on Pattern Analysis and Machine Intelligence (TCPAMI) Awards at this year’s conference. The following were recognized for their achievements:

  • 2024 Recipient : “ Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation ” Authors: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik
  • 2024 Recipient : Angjoo Kanazawa, Carl Vondrick
  • 2024 Recipient : Andrea Vedaldi

“The TCPAMI Awards demonstrate the lasting impact and influence of CVPR research and researchers,” said Walter J. Scheirer, University of Notre Dame, Notre Dame, Ind., U.S.A., and CVPR 2024 General Chair. “The contributions of these leaders have helped to shape and drive forward continued advancements in the field. We are proud to recognize these achievements and congratulate them on their success.”

About the CVPR 2024 The Computer Vision and Pattern Recognition Conference (CVPR) is the preeminent computer vision event for new research in support of artificial intelligence (AI), machine learning (ML), augmented, virtual and mixed reality (AR/VR/MR), deep learning, and much more. Sponsored by the IEEE Computer Society (CS) and the Computer Vision Foundation (CVF), CVPR delivers the important advances in all areas of computer vision and pattern recognition and the various fields and industries they impact. With a first-in-class technical program, including tutorials and workshops, a leading-edge expo, and robust networking opportunities, CVPR, which is annually attended by more than 10,000 scientists and engineers, creates a one-of-a-kind opportunity for networking, recruiting, inspiration, and motivation.

CVPR 2024 takes place 17-21 June at the Seattle Convention Center in Seattle, Wash., U.S.A., and participants may also access sessions virtually. For more information about CVPR 2024, visit cvpr.thecvf.com .

About the Computer Vision Foundation The Computer Vision Foundation (CVF) is a non-profit organization whose purpose is to foster and support research on all aspects of computer vision. Together with the IEEE Computer Society, it co-sponsors the two largest computer vision conferences, CVPR and the International Conference on Computer Vision (ICCV). Visit thecvf.com for more information.

About the IEEE Computer Society Engaging computer engineers, scientists, academia, and industry professionals from all areas and levels of computing, the IEEE Computer Society (CS) serves as the world’s largest and most established professional organization of its type. IEEE CS sets the standard for the education and engagement that fuels continued global technological advancement. Through conferences, publications, and programs that inspire dialogue, debate, and collaboration, IEEE CS empowers, shapes, and guides the future of not only its 375,000+ community members, but the greater industry, enabling new opportunities to better serve our world. Visit computer.org for more information.

Recommended by IEEE Computer Society

research papers on gesture recognition

The IEEE International Roadmap for Devices and Systems (IRDS) Emerges as a Global Leader for Chips Acts Visions and Programs

research papers on gesture recognition

IEEE Computer Society Announces 2024 Class of Fellow

research papers on gesture recognition

IEEE CS Releases 20 in their 20s List, Identifying Emerging Leaders in Computer Science and Engineering

research papers on gesture recognition

IEEE CS Authors, Speakers, and Leaders Named to Inaugural TIME100 Most Influential People in Artificial Intelligence List

research papers on gesture recognition

IEEE SustainTech Leadership Forum 2024: Unlocking the Future of Sustainable Technology for Buildings and Factories in the Built Environment

research papers on gesture recognition

J. Gregory Pauloski and Rohan Basu Roy Named Recipients of 2023 ACM/IEEE CS George Michael Memorial HPC Fellowships

research papers on gesture recognition

Keshav Pingali Selected to Receive ACM-IEEE CS Ken Kennedy Award

research papers on gesture recognition

Hironori Washizaki Elected IEEE Computer Society 2025 President

Self-assessment, Exhibition, and Recognition: a Review of Personality in Large Language Models

  • Wen, Zhiyuan
  • Cao, Jiannong
  • Sun, Haoming
  • Yang, Ruosong
  • Liu, Shuaiqi

As large language models (LLMs) appear to behave increasingly human-like in text-based interactions, more and more researchers become interested in investigating personality in LLMs. However, the diversity of psychological personality research and the rapid development of LLMs have led to a broad yet fragmented landscape of studies in this interdisciplinary field. Extensive studies across different research focuses, different personality psychometrics, and different LLMs make it challenging to have a holistic overview and further pose difficulties in applying findings to real-world applications. In this paper, we present a comprehensive review by categorizing current studies into three research problems: self-assessment, exhibition, and recognition, based on the intrinsic characteristics and external manifestations of personality in LLMs. For each problem, we provide a thorough analysis and conduct in-depth comparisons of their corresponding solutions. Besides, we summarize research findings and open challenges from current studies and further discuss their underlying causes. We also collect extensive publicly available resources to facilitate interested researchers and developers. Lastly, we discuss the potential future research directions and application scenarios. Our paper is the first comprehensive survey of up-to-date literature on personality in LLMs. By presenting a clear taxonomy, in-depth analysis, promising future directions, and extensive resource collections, we aim to provide a better understanding and facilitate further advancements in this emerging field.

  • Computer Science - Computation and Language;
  • Computer Science - Artificial Intelligence

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: burst image super-resolution with base frame selection.

Abstract: Burst image super-resolution has been a topic of active research in recent years due to its ability to obtain a high-resolution image by using complementary information between multiple frames in the burst. In this work, we explore using burst shots with non-uniform exposures to confront real-world practical scenarios by introducing a new benchmark dataset, dubbed Non-uniformly Exposed Burst Image (NEBI), that includes the burst frames at varying exposure times to obtain a broader range of irradiance and motion characteristics within a scene. As burst shots with non-uniform exposures exhibit varying levels of degradation, fusing information of the burst shots into the first frame as a base frame may not result in optimal image quality. To address this limitation, we propose a Frame Selection Network (FSN) for non-uniform scenarios. This network seamlessly integrates into existing super-resolution methods in a plug-and-play manner with low computational costs. The comparative analysis reveals the effectiveness of the nonuniform setting for the practical scenario and our FSN on synthetic-/real- NEBI datasets.
Comments: CVPR2024W NTIRE accepted
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: [cs.CV]
  (or [cs.CV] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. (PDF) Hand Gesture Recognition Based on Computer Vision: A Review of

    research papers on gesture recognition

  2. (PDF) Hand gesture recognition with depth images: A review

    research papers on gesture recognition

  3. (PDF) Survey on Gesture Recognition for Hand Image Postures

    research papers on gesture recognition

  4. Sample images from hand gesture recognition datasets.

    research papers on gesture recognition

  5. (PDF) Hand Gesture Recognition: A Review

    research papers on gesture recognition

  6. (PDF) Literature Survey on Hand Gesture Recognition System

    research papers on gesture recognition

VIDEO

  1. AI-based gesture detection using radar

  2. The nomination papers of Senator Ejaz Chaudhry were taken away

  3. Gesture Recognition Solution for Presentations or Device Remote Control

  4. Gesture Recognition Software

  5. Dynamic Hand Gesture Recognition

  6. Saraswati Group of Colleges North's Premier Institution with Top Faculty & Global Recognition!

COMMENTS

  1. A Review of the Hand Gesture Recognition System: Current Progress and

    This paper reviewed the sign language research in the vision-based hand gesture recognition system from 2014 to 2020. Its objective is to identify the progress and what needs more attention. We have extracted a total of 98 articles from well-known online databases using selected keywords. The review shows that the vision-based hand gesture recognition research is an active field of research ...

  2. A systematic review on hand gesture recognition techniques, challenges

    The paper will discuss the gesture acquisition methods, the feature extraction process, the classification of hand gestures, the applications that were recently proposed, the challenges that face researchers in the hand gesture recognition process, and the future of hand gesture recognition. ... The research work in hand gesture recognition has ...

  3. Hand gesture recognition using machine learning and infrared

    Currently, gesture recognition is like a problem of feature extraction and pattern recognition, in which a movement is labeling as belonging to a given class. A gesture recognition system's response could solve different problems in various fields, such as medicine, robotics, sign language, human-computer interfaces, virtual reality, augmented reality, and security. In this context, this ...

  4. A Review of Hand Gesture Recognition Systems Based on Noninvasive

    Hand gesture, one of the essential ways for a human to convey information and express intuitive intention, has a significant degree of differentiation, substantial flexibility, and high robustness of information transmission to make hand gesture recognition (HGR) one of the research hotspots in the fields of human-human and human-computer or human-machine interactions.

  5. Gesture Recognition

    124 papers with code • 13 benchmarks • 14 datasets. Gesture Recognition is an active field of research with applications such as automatic recognition of sign language, interaction of humans and robots or for new ways of controlling video games. Source: Gesture Recognition in RGB Videos Using Human Body Keypoints and Dynamic Time Warping.

  6. Exploiting domain transformation and deep learning for hand gesture

    Hand gesture recognition is one of the most widely explored areas under the human-computer interaction domain. Although various modalities of hand gesture recognition have been explored in the ...

  7. Dynamic gesture recognition based on 2D convolutional neural network

    Among the many gesture recognition methods, they can be divided into two categories: static gesture recognition and dynamic gesture recognition. Static gesture recognition methods have significant ...

  8. An Exploration into Human-Computer Interaction: Hand Gesture

    The fundamental objective of gesture recognition research is to develop a technology capable of recognizing distinct human gestures and utilizing them to communicate information or control devices . As a result, it incorporates monitoring hand movement and translation of such motion as crucial instruction. ... In this paper, we worked on 5 one ...

  9. Hand Gesture Recognition

    45 papers with code • 18 benchmarks • 14 datasets. Hand gesture recognition (HGR) is a subarea of Computer Vision where the focus is on classifying a video or image containing a dynamic or static, respectively, hand gesture. In the static case, gestures are also generally called poses. HGR can also be performed with point cloud or joint ...

  10. Gesture recognition using a bioinspired learning architecture that

    Gesture recognition using machine-learning methods is valuable in the development of advanced cybernetics, robotics and healthcare systems, and typically relies on images or videos. To improve ...

  11. Hand Gesture Recognition: A Literature Review

    H AND GESTURE RECOGNITION: A LITERATURE. R EVIEW. 1 Rafiqul Zaman Khan and 2 Noor Adnan Ibraheem. 1,2 Department of Computer Science, A.M.U. Aligarh, India. 1 [email protected]. 2 naibraheem@gmail ...

  12. Hand Gesture Recognition Based on Computer Vision: A Review of

    However, many research papers deal with enhancing frameworks for hand gesture recognition or developing new algorithms rather than executing a practical application with regard to health care. The biggest challenge encountered by the researcher is in designing a robust framework that overcomes the most common issues with fewer limitations and ...

  13. (PDF) Hand Gesture Recognition and Control for Human ...

    Abstract. This paper introduces a real-time system for recognizing hand gestures using Python and OpenCV, centred on a Convolutional Neural Network (CNN) model. The primary objective of this study ...

  14. Hand gesture recognition with focus on leap motion: An overview, real

    Static gesture recognition employs a gesture image acquired at a specific point in time, with the recognition result based on the location, shape, and texture (Yuanyuan et al., 2021). However, dynamic gestures are referred to the variation of hand movement in a period (De Smedt et al., 2016, Lupinetti et al., 2020, Shi et al., 2021). Thus, for ...

  15. Data glove-based gesture recognition using CNN-BiLSTM model with

    As a novel form of human machine interaction (HMI), hand gesture recognition (HGR) has garnered extensive attention and research. The majority of HGR studies are based on visual systems, inevitably encountering challenges such as depth and occlusion. On the contrary, data gloves can facilitate data collection with minimal interference in complex environments, thus becoming a research focus in ...

  16. Real-time hand gesture recognition using multiple deep learning

    Human gesture recognition is one of the most challenging problems in computer vision, striving to analyze human gestures by machine. However, most of the literature on gesture recognition utilizes isolated data with only one gesture in one image or a video for classifying gestures. This work targets the identification of human gestures from the continuous stream of data input taken from a live ...

  17. (PDF) A systematic review on hand gesture recognition techniques

    However, to focus the scope of the study 465 papers have been excluded. Only the most relevant hand gesture recognition works to the research questions, and the well-organized papers have been ...

  18. Hand Gesture Recognition Methods and Applications: A Literature Survey

    A Review on Vision-Based Hand Gesture Recognition and Applications, Research Gate, pp.261-286. Google Scholar Cross Ref; Tao Liu, Wen-gang Zhou, and Houquiang Li. 2016. ... Gesture recognition using data glove: an extreme learning machine method. In International 9 conference on robotics and biomimetics ROBIO. ... Many research papers have been ...

  19. [2111.00038] On-device Real-time Hand Gesture Recognition

    We present an on-device real-time hand gesture recognition (HGR) system, which detects a set of predefined static gestures from a single RGB camera. The system consists of two parts: a hand skeleton tracker and a gesture classifier. We use MediaPipe Hands as the basis of the hand skeleton tracker, improve the keypoint accuracy, and add the estimation of 3D keypoints in a world metric space. We ...

  20. J. Imaging

    Many research papers have proposed recognition of sign language for deaf-mute people, using a glove-attached sensor worn on the hand that gives responses according to hand movement. Alternatively, it may involve uncovered hand interaction with the camera, using computer vision techniques to identify the gesture.

  21. A Structured and Methodological Review on Vision-Based Hand Gesture

    Researchers have recently focused their attention on vision-based hand gesture recognition. However, due to several constraints, achieving an effective vision-driven hand gesture recognition system in real time has remained a challenge. This paper aims to uncover the limitations faced in image acquisition through the use of cameras, image segmentation and tracking, feature extraction, and ...

  22. Hand Gesture Recognition: A Survey

    In this paper we present a literature survey on Hand Gesture Recognition (HGR). Having reached all the best possible ways for data acquisition like cameras, wrist sensors, hand gloves now these are of less concern. Now the higher emphasis is on feature extraction from the available data, algorithms used to improvise feature extraction. These processes have also been tested and in recent papers ...

  23. Research on the Hand Gesture Recognition Based on Deep Learning

    With the rapid development of computer vision, the demand for interaction between human and machine is becoming more and more extensive. Since hand gestures are able to express enriched information, the hand gesture recognition is widely used in robot control, intelligent furniture and other aspects. The paper realizes the segmentation of hand gestures by establishing the skin color model and ...

  24. P‐4.29: The Research on Virtual Reality Field Based on Gesture Recognition

    SID Symposium Digest of Technical Papers is an information display journal publishing short papers and poster session content from SID's annual symposium, Display Week. In recent years, gesture recognition technology has been increasingly used in the field of virtual reality. ... The Research on Virtual Reality Field Based on Gesture Recognition.

  25. [2406.19217] Think Step by Step: Chain-of-Gesture Prompting for Error

    Computer Science > Computer Vision and Pattern Recognition. arXiv:2406.19217 (cs) ... which utilizes transformer and attention architectures for gesture prompting, while the second, a Multi-Scale Temporal Reasoning module, employs a multi-stage temporal convolutional network with both slow and fast paths for temporal information extraction ...

  26. A multi-modal framework for continuous and isolated hand gesture

    Gesture recognition, having multitudinous applications in the real world, is one of the core areas of research in the field of human-computer interaction. In this paper, we propose a novel method for isolated and continuous hand gesture recognition utilizing the movement epenthesis detection and removal. For this purpose, the present work detects and removes the movement epenthesis frames from ...

  27. CVPR 2024 Announces Best Paper Award Winners

    SEATTLE, 19 June 2024 - Today, during the 2024 Computer Vision and Pattern Recognition (CVPR) Conference opening session, the CVPR Awards Committee announced the winners of its prestigious Best Paper Awards, which annually recognize top research in computer vision, artificial intelligence (AI), machine learning (ML), augmented, virtual and mixed reality (AR/VR/MR), deep learning, and much more.

  28. Self-assessment, Exhibition, and Recognition: a Review of ...

    As large language models (LLMs) appear to behave increasingly human-like in text-based interactions, more and more researchers become interested in investigating personality in LLMs. However, the diversity of psychological personality research and the rapid development of LLMs have led to a broad yet fragmented landscape of studies in this interdisciplinary field. Extensive studies across ...

  29. Burst Image Super-Resolution with Base Frame Selection

    Burst image super-resolution has been a topic of active research in recent years due to its ability to obtain a high-resolution image by using complementary information between multiple frames in the burst. In this work, we explore using burst shots with non-uniform exposures to confront real-world practical scenarios by introducing a new benchmark dataset, dubbed Non-uniformly Exposed Burst ...