data mining Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Distance Based Pattern Driven Mining for Outlier Detection in High Dimensional Big Dataset

Detection of outliers or anomalies is one of the vital issues in pattern-driven data mining. Outlier detection detects the inconsistent behavior of individual objects. It is an important sector in the data mining field with several different applications such as detecting credit card fraud, hacking discovery and discovering criminal activities. It is necessary to develop tools used to uncover the critical information established in the extensive data. This paper investigated a novel method for detecting cluster outliers in a multidimensional dataset, capable of identifying the clusters and outliers for datasets containing noise. The proposed method can detect the groups and outliers left by the clustering process, like instant irregular sets of clusters (C) and outliers (O), to boost the results. The results obtained after applying the algorithm to the dataset improved in terms of several parameters. For the comparative analysis, the accurate average value and the recall value parameters are computed. The accurate average value is 74.05% of the existing COID algorithm, and our proposed algorithm has 77.21%. The average recall value is 81.19% and 89.51% of the existing and proposed algorithm, which shows that the proposed work efficiency is better than the existing COID algorithm.

Implementation of Data Mining Technology in Bonded Warehouse Inbound and Outbound Goods Trade

For the taxed goods, the actual freight is generally determined by multiplying the allocated freight for each KG and actual outgoing weight based on the outgoing order number on the outgoing bill. Considering the conventional logistics is insufficient to cope with the rapid response of e-commerce orders to logistics requirements, this work discussed the implementation of data mining technology in bonded warehouse inbound and outbound goods trade. Specifically, a bonded warehouse decision-making system with data warehouse, conceptual model, online analytical processing system, human-computer interaction module and WEB data sharing platform was developed. The statistical query module can be used to perform statistics and queries on warehousing operations. After the optimization of the whole warehousing business process, it only takes 19.1 hours to get the actual freight, which is nearly one third less than the time before optimization. This study could create a better environment for the development of China's processing trade.

Multi-objective economic load dispatch method based on data mining technology for large coal-fired power plants

User activity classification and domain-wise ranking through social interactions.

Twitter has gained a significant prevalence among the users across the numerous domains, in the majority of the countries, and among different age groups. It servers a real-time micro-blogging service for communication and opinion sharing. Twitter is sharing its data for research and study purposes by exposing open APIs that make it the most suitable source of data for social media analytics. Applying data mining and machine learning techniques on tweets is gaining more and more interest. The most prominent enigma in social media analytics is to automatically identify and rank influencers. This research is aimed to detect the user's topics of interest in social media and rank them based on specific topics, domains, etc. Few hybrid parameters are also distinguished in this research based on the post's content, post’s metadata, user’s profile, and user's network feature to capture different aspects of being influential and used in the ranking algorithm. Results concluded that the proposed approach is well effective in both the classification and ranking of individuals in a cluster.

A data mining analysis of COVID-19 cases in states of United States of America

Epidemic diseases can be extremely dangerous with its hazarding influences. They may have negative effects on economies, businesses, environment, humans, and workforce. In this paper, some of the factors that are interrelated with COVID-19 pandemic have been examined using data mining methodologies and approaches. As a result of the analysis some rules and insights have been discovered and performances of the data mining algorithms have been evaluated. According to the analysis results, JRip algorithmic technique had the most correct classification rate and the lowest root mean squared error (RMSE). Considering classification rate and RMSE measure, JRip can be considered as an effective method in understanding factors that are related with corona virus caused deaths.

Exploring distributed energy generation for sustainable development: A data mining approach

A comprehensive guideline for bengali sentiment annotation.

Sentiment Analysis (SA) is a Natural Language Processing (NLP) and an Information Extraction (IE) task that primarily aims to obtain the writer’s feelings expressed in positive or negative by analyzing a large number of documents. SA is also widely studied in the fields of data mining, web mining, text mining, and information retrieval. The fundamental task in sentiment analysis is to classify the polarity of a given content as Positive, Negative, or Neutral . Although extensive research has been conducted in this area of computational linguistics, most of the research work has been carried out in the context of English language. However, Bengali sentiment expression has varying degree of sentiment labels, which can be plausibly distinct from English language. Therefore, sentiment assessment of Bengali language is undeniably important to be developed and executed properly. In sentiment analysis, the prediction potential of an automatic modeling is completely dependent on the quality of dataset annotation. Bengali sentiment annotation is a challenging task due to diversified structures (syntax) of the language and its different degrees of innate sentiments (i.e., weakly and strongly positive/negative sentiments). Thus, in this article, we propose a novel and precise guideline for the researchers, linguistic experts, and referees to annotate Bengali sentences immaculately with a view to building effective datasets for automatic sentiment prediction efficiently.

Capturing Dynamics of Information Diffusion in SNS: A Survey of Methodology and Techniques

Studying information diffusion in SNS (Social Networks Service) has remarkable significance in both academia and industry. Theoretically, it boosts the development of other subjects such as statistics, sociology, and data mining. Practically, diffusion modeling provides fundamental support for many downstream applications (e.g., public opinion monitoring, rumor source identification, and viral marketing). Tremendous efforts have been devoted to this area to understand and quantify information diffusion dynamics. This survey investigates and summarizes the emerging distinguished works in diffusion modeling. We first put forward a unified information diffusion concept in terms of three components: information, user decision, and social vectors, followed by a detailed introduction of the methodologies for diffusion modeling. And then, a new taxonomy adopting hybrid philosophy (i.e., granularity and techniques) is proposed, and we made a series of comparative studies on elementary diffusion models under our taxonomy from the aspects of assumptions, methods, and pros and cons. We further summarized representative diffusion modeling in special scenarios and significant downstream tasks based on these elementary models. Finally, open issues in this field following the methodology of diffusion modeling are discussed.

The Influence of E-book Teaching on the Motivation and Effectiveness of Learning Law by Using Data Mining Analysis

This paper studies the motivation of learning law, compares the teaching effectiveness of two different teaching methods, e-book teaching and traditional teaching, and analyses the influence of e-book teaching on the effectiveness of law by using big data analysis. From the perspective of law student psychology, e-book teaching can attract students' attention, stimulate students' interest in learning, deepen knowledge impression while learning, expand knowledge, and ultimately improve the performance of practical assessment. With a small sample size, there may be some deficiencies in the research results' representativeness. To stimulate the learning motivation of law as well as some other theoretical disciplines in colleges and universities has particular referential significance and provides ideas for the reform of teaching mode at colleges and universities. This paper uses a decision tree algorithm in data mining for the analysis and finds out the influencing factors of law students' learning motivation and effectiveness in the learning process from students' perspective.

Intelligent Data Mining based Method for Efficient English Teaching and Cultural Analysis

The emergence of online education helps improving the traditional English teaching quality greatly. However, it only moves the teaching process from offline to online, which does not really change the essence of traditional English teaching. In this work, we mainly study an intelligent English teaching method to further improve the quality of English teaching. Specifically, the random forest is firstly used to analyze and excavate the grammatical and syntactic features of the English text. Then, the decision tree based method is proposed to make a prediction about the English text in terms of its grammar or syntax issues. The evaluation results indicate that the proposed method can effectively improve the accuracy of English grammar or syntax recognition.

Export Citation Format

Share document.

50 selected papers in Data Mining and Machine Learning

Here is the list of 50 selected papers in Data Mining and Machine Learning . You can download them for your detailed reading and research. Enjoy!

Data Mining and Statistics: What’s the Connection?

Data Mining: Statistics and More? , D. Hand, American Statistician, 52(2):112-118.

Data Mining , G. Weiss and B. Davison, in Handbook of Technology Management, John Wiley and Sons, expected 2010.

From Data Mining to Knowledge Discovery in Databases , U. Fayyad, G. Piatesky-Shapiro & P. Smyth, AI Magazine, 17(3):37-54, Fall 1996.

Mining Business Databases , Communications of the ACM, 39(11): 42-48.

10 Challenging Problems in Data Mining Research , Q. Yiang and X. Wu, International Journal of Information Technology & Decision Making, Vol. 5, No. 4, 2006, 597-604.

The Long Tail , by Anderson, C., Wired magazine.

AOL’s Disturbing Glimpse Into Users’ Lives , by McCullagh, D., News.com, August 9, 2006

General Data Mining Methods and Algorithms

Top 10 Algorithms in Data Mining , X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. motoda, G.J. MClachlan, A. Ng, B. Liu, P.S. Yu, Z. Zhou, M. Steinbach, D. J. Hand, D. Steinberg, Knowl Inf Syst (2008) 141-37.

Induction of Decision Trees , R. Quinlan, Machine Learning, 1(1):81-106, 1986.

Web and Link Mining

The Pagerank Citation Ranking: Bringing Order to the Web , L. Page, S. Brin, R. Motwani, T. Winograd, Technical Report, Stanford University, 1999.

The Structure and Function of Complex Networks , M. E. J. Newman, SIAM Review, 2003, 45, 167-256.

Link Mining: A New Data Mining Challenge , L. Getoor, SIGKDD Explorations, 2003, 5(1), 84-89.

Link Mining: A Survey , L. Getoor, SIGKDD Explorations, 2005, 7(2), 3-12.

Semi-supervised Learning

Semi-Supervised Learning Literature Survey , X. Zhu, Computer Sciences TR 1530, University of Wisconsin — Madison.

Introduction to Semi-Supervised Learning, in Semi-Supervised Learning (Chapter 1) O. Chapelle, B. Scholkopf, A. Zien (eds.), MIT Press, 2006. (Fordham’s library has online access to the entire text)

Learning with Labeled and Unlabeled Data , M. Seeger, University of Edinburgh (unpublished), 2002.

Person Identification in Webcam Images: An Application of Semi-Supervised Learning , M. Balcan, A. Blum, P. Choi, J. lafferty, B. Pantano, M. Rwebangira, X. Zhu, Proceedings of the 22nd ICML Workshop on Learning with Partially Classified Training Data , 2005.

Learning from Labeled and Unlabeled Data: An Empirical Study across Techniques and Domains , N. Chawla, G. Karakoulas, Journal of Artificial Intelligence Research , 23:331-366, 2005.

Text Classification from Labeled and Unlabeled Documents using EM , K. Nigam, A. McCallum, S. Thrun, T. Mitchell, Machine Learning , 39, 103-134, 2000.

Self-taught Learning: Transfer Learning from Unlabeled Data , R. Raina, A. Battle, H. Lee, B. Packer, A. Ng, in Proceedings of the 24th International Conference on Machine Learning , 2007.

An iterative algorithm for extending learners to a semisupervised setting , M. Culp, G. Michailidis, 2007 Joint Statistical Meetings (JSM), 2007

Partially-Supervised Learning / Learning with Uncertain Class Labels

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers , V. Sheng, F. Provost, P. Ipeirotis, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2008.

Logistic Regression for Partial Labels , in 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems , Volume III, pp. 1935-1941, 2002.

Classification with Partial labels , N. Nguyen, R. Caruana, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2008.

Imprecise and Uncertain Labelling: A Solution based on Mixture Model and Belief Functions, E. Come, 2008 (powerpoint slides).

Induction of Decision Trees from Partially Classified Data Using Belief Functions , M. Bjanger, Norweigen University of Science and Technology, 2000.

Knowledge Discovery in Large Image Databases: Dealing with Uncertainties in Ground Truth , P. Smyth, M. Burl, U. Fayyad, P. Perona, KDD Workshop 1994, AAAI Technical Report WS-94-03, pp. 109-120, 1994.

Recommender Systems

Trust No One: Evaluating Trust-based Filtering for Recommenders , J. O’Donovan and B. Smyth, In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI-05), 2005, 1663-1665.

Trust in Recommender Systems, J. O’Donovan and B. Symyth, In Proceedings of the 10th International Conference on Intelligent User Interfaces (IUI-05), 2005, 167-174.

General resources available on this topic :

ICML 2003 Workshop: Learning from Imbalanced Data Sets II

AAAI ‘2000 Workshop on Learning from Imbalanced Data Sets

A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data , G. Batista, R. Prati, and M. Monard, SIGKDD Explorations , 6(1):20-29, 2004.

Class Imbalance versus Small Disjuncts , T. Jo and N. Japkowicz, SIGKDD Explorations , 6(1): 40-49, 2004.

Extreme Re-balancing for SVMs: a Case Study , B. Raskutti and A. Kowalczyk, SIGKDD Explorations , 6(1):60-69, 2004.

A Multiple Resampling Method for Learning from Imbalanced Data Sets , A. Estabrooks, T. Jo, and N. Japkowicz, in Computational Intelligence , 20(1), 2004.

SMOTE: Synthetic Minority Over-sampling Technique , N. Chawla, K. Boyer, L. Hall, and W. Kegelmeyer, Journal of Articifial Intelligence Research , 16:321-357.

Generative Oversampling for Mining Imbalanced Datasets, A. Liu, J. Ghosh, and C. Martin, Third International Conference on Data Mining (DMIN-07), 66-72.

Learning from Little: Comparison of Classifiers Given Little of Classifiers given Little Training , G. Forman and I. Cohen, in 8th European Conference on Principles and Practice of Knowledge Discovery in Databases , 161-172, 2004.

Issues in Mining Imbalanced Data Sets – A Review Paper , S. Visa and A. Ralescu, in Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference , pp. 67-73, 2005.

Wrapper-based Computation and Evaluation of Sampling Methods for Imbalanced Datasets , N. Chawla, L. Hall, and A. Joshi, in Proceedings of the 1st International Workshop on Utility-based Data Mining , 24-33, 2005.

C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , C. Drummond and R. Holte, in ICML Workshop onLearning from Imbalanced Datasets II , 2003.

C4.5 and Imbalanced Data sets: Investigating the effect of sampling method, probabilistic estimate, and decision tree structure , N. Chawla, in ICML Workshop on Learning from Imbalanced Datasets II , 2003.

Class Imbalances: Are we Focusing on the Right Issue?, N. Japkowicz, in ICML Workshop on Learning from Imbalanced Datasets II , 2003.

Learning when Data Sets are Imbalanced and When Costs are Unequal and Unknown , M. Maloof, in ICML Workshop on Learning from Imbalanced Datasets II , 2003.

Uncertainty Sampling Methods for One-class Classifiers , P. Juszcak and R. Duin, in ICML Workshop on Learning from Imbalanced Datasets II , 2003.

Active Learning

Improving Generalization with Active Learning , D Cohn, L. Atlas, and R. Ladner, Machine Learning 15(2), 201-221, May 1994.

On Active Learning for Data Acquisition , Z. Zheng and B. Padmanabhan, In Proc. of IEEE Intl. Conf. on Data Mining, 2002.

Active Sampling for Class Probability Estimation and Ranking , M. Saar-Tsechansky and F. Provost, Machine Learning 54:2 2004, 153-178.

The Learning-Curve Sampling Method Applied to Model-Based Clustering , C. Meek, B. Thiesson, and D. Heckerman, Journal of Machine Learning Research 2:397-418, 2002.

Active Sampling for Feature Selection , S. Veeramachaneni and P. Avesani, Third IEEE Conference on Data Mining, 2003.

Heterogeneous Uncertainty Sampling for Supervised Learning , D. Lewis and J. Catlett, In Proceedings of the 11th International Conference on Machine Learning, 148-156, 1994.

Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , G. Weiss and F. Provost, Journal of Artificial Intelligence Research, 19:315-354, 2003.

Active Learning using Adaptive Resampling , KDD 2000, 91-98.

Cost-Sensitive Learning

Types of Cost in Inductive Concept Learning , P. Turney, In Proceedings Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on Machine Learning.

Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , P. Chan and S. Stolfo, KDD 1998.

Recent Blogs

Artificial intelligence and machine learning: What’s the difference

Artificial intelligence and machine learning: What’s the difference

Artificial Intelligence , Machine Learning

10 online courses for understanding machine learning

10 online courses for understanding machine learning

Machine Learning , Tutorials

How is ML Being Used to Handle Security Vulnerabilities?

Machine Learning

10 groups of machine learning algorithms

10 groups of machine learning algorithms

How a nearly forgotten physicist shaped internet access today 

How a nearly forgotten physicist shaped internet access today 

Massachuse...

FinTech 2019: 5 uses cases of machine learning in finance

FinTech 2019: 5 uses cases of machine learning in finance

Banking / Finance , Machine Learning

The biggest impact of machine learning for digital marketing professionals

The biggest impact of machine learning for digital marketing professionals

Machine Learning , Marketing

Looking ahead: the innovative future of iOS in 2019

How machine learning is changing identity theft detection

How machine learning is changing identity theft detection

Machine Learning , Privacy / Security

Wearable technology to boost the process of digitalization of the modern world

Wearable technology to boost the process of digitalization of the modern world

Top 8 machine learning startups you should know about

Top 8 machine learning startups you should know about

The term...

How retargeting algorithms help in web personalization

How retargeting algorithms help in web personalization

others , Machine Learning

3 automation tools to help you in your next app build

3 automation tools to help you in your next app build

Machine learning and information security: impact and trends

Machine learning and information security: impact and trends

Machine Learning , Privacy / Security , Sectors , Tech and Tools

How to improve your productivity with AI and Machine Learning?

How to improve your productivity with AI and Machine Learning?

Artificial Intelligence , Human Resource , Machine Learning

Artificial...

Ask Data – A new and intuitive way to analyze data with natural language

10 free machine learning ebooks all scientists & ai engineers should read, yisi, a machine translation teacher who cracks down on errors in meaning, machine learning & license plate recognition: an ideal partnership, top 17 data science and machine learning vendors shortlisted by gartner, accuracy and bias in machine learning models – overview, interview with dejan s. milojicic on top technology trends and predictions for 2019.

Artificial Intelligence , Interviews , Machine Learning

Recently,...

Why every small business should use machine learning?

Why every small business should use machine learning?

Microsoft’s ML.NET: A blend of machine learning and .NET

Microsoft’s ML.NET: A blend of machine learning and .NET

Machine learning: best examples and ideas for mobile apps, researchers harness machine learning to predict chemical reactions, subscribe to the crayon blog.

Get the latest posts in your inbox!

  • Open access
  • Published: 11 August 2021

Data mining in clinical big data: the frequently used databases, steps, and methodological models

  • Wen-Tao Wu 1 , 2   na1 ,
  • Yuan-Jie Li 3   na1 ,
  • Ao-Zi Feng 1 ,
  • Tao Huang 1 ,
  • An-Ding Xu 4 &
  • Jun Lyu   ORCID: orcid.org/0000-0002-2237-8771 1  

Military Medical Research volume  8 , Article number:  44 ( 2021 ) Cite this article

43k Accesses

218 Citations

2 Altmetric

Metrics details

Many high quality studies have emerged from public databases, such as Surveillance, Epidemiology, and End Results (SEER), National Health and Nutrition Examination Survey (NHANES), The Cancer Genome Atlas (TCGA), and Medical Information Mart for Intensive Care (MIMIC); however, these data are often characterized by a high degree of dimensional heterogeneity, timeliness, scarcity, irregularity, and other characteristics, resulting in the value of these data not being fully utilized. Data-mining technology has been a frontier field in medical research, as it demonstrates excellent performance in evaluating patient risks and assisting clinical decision-making in building disease-prediction models. Therefore, data mining has unique advantages in clinical big-data research, especially in large-scale medical public databases. This article introduced the main medical public database and described the steps, tasks, and models of data mining in simple language. Additionally, we described data-mining methods along with their practical applications. The goal of this work was to aid clinical researchers in gaining a clear and intuitive understanding of the application of data-mining technology on clinical big-data in order to promote the production of research results that are beneficial to doctors and patients.

With the rapid development of computer software/hardware and internet technology, the amount of data has increased at an amazing speed. “Big data” as an abstract concept currently affects all walks of life [ 1 ], and although its importance has been recognized, its definition varies slightly from field to field. In the field of computer science, big data refers to a dataset that cannot be perceived, acquired, managed, processed, or served within a tolerable time by using traditional IT and software and hardware tools. Generally, big data refers to a dataset that exceeds the scope of a simple database and data-processing architecture used in the early days of computing and is characterized by high-volume and -dimensional data that is rapidly updated represents a phenomenon or feature that has emerged in the digital age. Across the medical industry, various types of medical data are generated at a high speed, and trends indicate that applying big data in the medical field helps improve the quality of medical care and optimizes medical processes and management strategies [ 2 , 3 ]. Currently, this trend is shifting from civilian medicine to military medicine. For example, the United States is exploring the potential to use of one of its largest healthcare systems (the Military Healthcare System) to provide healthcare to eligible veterans in order to potentially benefit > 9 million eligible personnel [ 4 ]. Another data-management system has been developed to assess the physical and mental health of active-duty personnel, with this expected to yield significant economic benefits to the military medical system [ 5 ]. However, in medical research, the wide variety of clinical data and differences between several medical concepts in different classification standards results in a high degree of dimensionality heterogeneity, timeliness, scarcity, and irregularity to existing clinical data [ 6 , 7 ]. Furthermore, new data analysis techniques have yet to be popularized in medical research [ 8 ]. These reasons hinder the full realization of the value of existing data, and the intensive exploration of the value of clinical data remains a challenging problem.

Computer scientists have made outstanding contributions to the application of big data and introduced the concept of data mining to solve difficulties associated with such applications. Data mining (also known as knowledge discovery in databases) refers to the process of extracting potentially useful information and knowledge hidden in a large amount of incomplete, noisy, fuzzy, and random practical application data [ 9 ]. Unlike traditional research methods, several data-mining technologies mine information to discover knowledge based on the premise of unclear assumptions (i.e., they are directly applied without prior research design). The obtained information should have previously unknown, valid, and practical characteristics [ 9 ]. Data-mining technology does not aim to replace traditional statistical analysis techniques, but it does seek to extend and expand statistical analysis methodologies. From a practical point of view, machine learning (ML) is the main analytical method in data mining, as it represents a method of training models by using data and then using those models for predicting outcomes. Given the rapid progress of data-mining technology and its excellent performance in other industries and fields, it has introduced new opportunities and prospects to clinical big-data research [ 10 ]. Large amounts of high quality medical data are available to researchers in the form of public databases, which enable more researchers to participate in the process of medical data mining in the hope that the generated results can further guide clinical practice.

This article provided a valuable overview to medical researchers interested in studying the application of data mining on clinical big data. To allow a clearer understanding of the application of data-mining technology on clinical big data, the second part of this paper introduced the concept of public databases and summarized those commonly used in medical research. In the third part of the paper, we offered an overview of data mining, including introducing an appropriate model, tasks, and processes, and summarized the specific methods of data mining. In the fourth and fifth parts of this paper, we introduced data-mining algorithms commonly used in clinical practice along with specific cases in order to help clinical researchers clearly and intuitively understand the application of data-mining technology on clinical big data. Finally, we discussed the advantages and disadvantages of data mining in clinical analysis and offered insight into possible future applications.

Overview of common public medical databases

A public database describes a data repository used for research and dedicated to housing data related to scientific research on an open platform. Such databases collect and store heterogeneous and multi-dimensional health, medical, scientific research in a structured form and characteristics of mass/multi-ownership, complexity, and security. These databases cover a wide range of data, including those related to cancer research, disease burden, nutrition and health, and genetics and the environment. Table 1 summarizes the main public medical databases [ 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 ]. Researchers can apply for access to data based on the scope of the database and the application procedures required to perform relevant medical research.

Data mining: an overview

Data mining is a multidisciplinary field at the intersection of database technology, statistics, ML, and pattern recognition that profits from all these disciplines [ 27 ]. Although this approach is not yet widespread in the field of medical research, several studies have demonstrated the promise of data mining in building disease-prediction models, assessing patient risk, and helping physicians make clinical decisions [ 28 , 29 , 30 , 31 ].

Data-mining models

Data-mining has two kinds of models: descriptive and predictive. Predictive models are used to predict unknown or future values of other variables of interest, whereas descriptive models are often used to find patterns that describe data that can be interpreted by humans [ 32 ].

Data-mining tasks

A model is usually implemented by a task, with the goal of description being to generalize patterns of potential associations in the data. Therefore, using a descriptive model usually results in a few collections with the same or similar attributes. Prediction mainly refers to estimation of the variable value of a specific attribute based on the variable values of other attributes, including classification and regression [ 33 ].

Data-mining methods

After defining the data-mining model and task, the data mining methods required to build the approach based on the discipline involved are then defined. The data-mining method depends on whether or not dependent variables (labels) are present in the analysis. Predictions with dependent variables (labels) are generated through supervised learning, which can be performed by the use of linear regression, generalized linear regression, a proportional hazards model (the Cox regression model), a competitive risk model, decision trees, the random forest (RF) algorithm, and support vector machines (SVMs). In contrast, unsupervised learning involves no labels. The learning model infers some internal data structure. Common unsupervised learning methods include principal component analysis (PCA), association analysis, and clustering analysis.

Data-mining algorithms for clinical big data

Data mining based on clinical big data can produce effective and valuable knowledge, which is essential for accurate clinical decision-making and risk assessment [ 34 ]. Data-mining algorithms enable realization of these goals.

Supervised learning

A concept often mentioned in supervised learning is the partitioning of datasets. To prevent overfitting of a model, a dataset can generally be divided into two or three parts: a training set, validation set, and test set. Ripley [ 35 ] defined these parts as a set of examples used for learning and used to fit the parameters (i.e., weights) of the classifier, a set of examples used to tune the parameters (i.e., architecture) of a classifier, and a set of examples used only to assess the performance (generalized) of a fully-specified classifier, respectively. Briefly, the training set is used to train the model or determine the model parameters, the validation set is used to perform model selection, and the test set is used to verify model performance. In practice, data are generally divided into training and test sets, whereas the verification set is less involved. It should be emphasized that the results of the test set do not guarantee model correctness but only show that similar data can obtain similar results using the model. Therefore, the applicability of a model should be analysed in combination with specific problems in the research. Classical statistical methods, such as linear regression, generalized linear regression, and a proportional risk model, have been widely used in medical research. Notably, most of these classical statistical methods have certain data requirements or assumptions; however, in face of complicated clinical data, assumptions about data distribution are difficult to make. In contrast, some ML methods (algorithmic models) make no assumptions about the data and cross-verify the results; thus, they are likely to be favoured by clinical researchers [ 36 ]. For these reasons, this chapter focuses on ML methods that do not require assumptions about data distribution and classical statistical methods that are used in specific situations.

Decision tree

A decision tree is a basic classification and regression method that generates a result similar to the tree structure of a flowchart, where each tree node represents a test on an attribute, each branch represents the output of an attribute, each leaf node (decision node) represents a class or class distribution, and the topmost part of the tree is the root node [ 37 ]. The decision tree model is called a classification tree when used for classification and a regression tree when used for regression. Studies have demonstrated the utility of the decision tree model in clinical applications. In a study on the prognosis of breast cancer patients, a decision tree model and a classical logistic regression model were constructed, respectively, with the predictive performance of the different models indicating that the decision tree model showed stronger predictive power when using real clinical data [ 38 ]. Similarly, the decision tree model has been applied to other areas of clinical medicine, including diagnosis of kidney stones [ 39 ], predicting the risk of sudden cardiac arrest [ 40 ], and exploration of the risk factors of type II diabetes [ 41 ]. A common feature of these studies is the use of a decision tree model to explore the interaction between variables and classify subjects into homogeneous categories based on their observed characteristics. In fact, because the decision tree accounts for the strong interaction between variables, it is more suitable for use with decision algorithms that follow the same structure [ 42 ]. In the construction of clinical prediction models and exploration of disease risk factors and patient prognosis, the decision tree model might offer more advantages and practical application value than some classical algorithms. Although the decision tree has many advantages, it recursively separates observations into branches to construct a tree; therefore, in terms of data imbalance, the precision of decision tree models needs improvement.

The RF method

The RF algorithm was developed as an application of an ensemble-learning method based on a collection of decision trees. The bootstrap method [ 43 ] is used to randomly retrieve sample sets from the training set, with decision trees generated by the bootstrap method constituting a “random forest” and predictions based on this derived from an ensemble average or majority vote. The biggest advantage of the RF method is that the random sampling of predictor variables at each decision tree node decreases the correlation among the trees in the forest, thereby improving the precision of ensemble predictions [ 44 ]. Given that a single decision tree model might encounter the problem of overfitting [ 45 ], the initial application of RF minimizes overfitting in classification and regression and improves predictive accuracy [ 44 ]. Taylor et al. [ 46 ] highlighted the potential of RF in correctly differentiating in-hospital mortality in patients experiencing sepsis after admission to the emergency department. Nowhere in the healthcare system is the need more pressing to find methods to reduce uncertainty than in the fast, chaotic environment of the emergency department. The authors demonstrated that the predictive performance of the RF method was superior to that of traditional emergency medicine methods and the methods enabled evaluation of more clinical variables than traditional modelling methods, which subsequently allowed the discovery of clinical variables not expected to be of predictive value or which otherwise would have been omitted as a rare predictor [ 46 ]. Another study based on the Medical Information Mart for Intensive Care (MIMIC) II database [ 47 ] found that RF had excellent predictive power regarding intensive care unit (ICU) mortality [ 48 ]. These studies showed that the application of RF to big data stored in the hospital healthcare system provided a new data-driven method for predictive analysis in critical care. Additionally, random survival forests have recently been developed to analyse survival data, especially right-censored survival data [ 49 , 50 ], which can help researchers conduct survival analyses in clinical oncology and help develop personalized treatment regimens that benefit patients [ 51 ].

The SVM is a relatively new classification or prediction method developed by Cortes and Vapnik and represents a data-driven approach that does not require assumptions about data distribution [ 52 ]. The core purpose of an SVM is to identify a separation boundary (called a hyperplane) to help classify cases; thus, the advantages of SVMs are obvious when classifying and predicting cases based on high dimensional data or data with a small sample size [ 53 , 54 ].

In a study of drug compliance in patients with heart failure, researchers used an SVM to build a predictive model for patient compliance in order to overcome the problem of a large number of input variables relative to the number of available observations [ 55 ]. Additionally, the mechanisms of certain chronic and complex diseases observed in clinical practice remain unclear, and many risk factors, including gene–gene interactions and gene-environment interactions, must be considered in the research of such diseases [ 55 , 56 ]. SVMs are capable of addressing these issues. Yu et al. [ 54 ] applied an SVM for predicting diabetes onset based on data from the National Health and Nutrition Examination Survey (NHANES). Furthermore, these models have strong discrimination ability, making SVMs a promising classification approach for detecting individuals with chronic and complex diseases. However, a disadvantage of SVMs is that when the number of observation samples is large, the method becomes time- and resource-intensive, which is often highly inefficient.

Competitive risk model

Kaplan–Meier marginal regression and the Cox proportional hazards model are widely used in survival analysis in clinical studies. Classical survival analysis usually considers only one endpoint, such as the impact of patient survival time. However, in clinical medical research, multiple endpoints usually coexist, and these endpoints compete with one another to generate competitive risk data [ 57 ]. In the case of multiple endpoint events, the use of a single endpoint-analysis method can lead to a biased estimation of the probability of endpoint events due to the existence of competitive risks [ 58 ]. The competitive risk model is a classical statistical model based on the hypothesis of data distribution. Its main advantage is its accurate estimation of the cumulative incidence of outcomes for right-censored survival data with multiple endpoints [ 59 ]. In data analysis, the cumulative risk rate is estimated using the cumulative incidence function in single-factor analysis, and Gray’s test is used for between-group comparisons [ 60 ].

Multifactor analysis uses the Fine-Gray and cause-specific (CS) risk models to explore the cumulative risk rate [ 61 ]. The difference between the Fine-Gray and CS models is that the former is applicable to establishing a clinical prediction model and predicting the risk of a single endpoint of interest [ 62 ], whereas the latter is suitable for answering etiological questions, where the regression coefficient reflects the relative effect of covariates on the increased incidence of the main endpoint in the target event-free risk set [ 63 ]. Currently, in databases with CS records, such as Surveillance, Epidemiology, and End Results (SEER), competitive risk models exhibit good performance in exploring disease-risk factors and prognosis [ 64 ]. A study of prognosis in patients with oesophageal cancer from SEER showed that Cox proportional risk models might misestimate the effects of age and disease location on patient prognosis, whereas competitive risk models provide more accurate estimates of factors affecting patient prognosis [ 65 ]. In another study of the prognosis of penile cancer patients, researchers found that using a competitive risk model was more helpful in developing personalized treatment plans [ 66 ].

Unsupervised learning

In many data-analysis processes, the amount of usable identified data is small, and identifying data is a tedious process [ 67 ]. Unsupervised learning is necessary to judge and categorize data according to similarities, characteristics, and correlations and has three main applications: data clustering, association analysis, and dimensionality reduction. Therefore, the unsupervised learning methods introduced in this section include clustering analysis, association rules, and PCA.

Clustering analysis

The classification algorithm needs to “know” information concerning each category in advance, with all of the data to be classified having corresponding categories. When the above conditions cannot be met, cluster analysis can be applied to solve the problem [ 68 ]. Clustering places similar objects into different categories or subsets through the process of static classification. Consequently, objects in the same subset have similar properties. Many kinds of clustering techniques exist. Here, we introduced the four most commonly used clustering techniques.

Partition clustering

The core idea of this clustering method regards the centre of the data point as the centre of the cluster. The k-means method [ 69 ] is a representative example of this technique. The k-means method takes n observations and an integer, k , and outputs a partition of the n observations into k sets such that each observation belongs to the cluster with the nearest mean [ 70 ]. The k-means method exhibits low time complexity and high computing efficiency but has a poor processing effect on high dimensional data and cannot identify nonspherical clusters.

Hierarchical clustering

The hierarchical clustering algorithm decomposes a dataset hierarchically to facilitate the subsequent clustering [ 71 ]. Common algorithms for hierarchical clustering include BIRCH [ 72 ], CURE [ 73 ], and ROCK [ 74 ]. The algorithm starts by treating every point as a cluster, with clusters grouped according to closeness. When further combinations result in unexpected results under multiple causes or only one cluster remains, the grouping process ends. This method has wide applicability, and the relationship between clusters is easy to detect; however, the time complexity is high [ 75 ].

Clustering according to density

The density algorithm takes areas presenting a high degree of data density and defines these as belonging to the same cluster [ 76 ]. This method aims to find arbitrarily-shaped clusters, with the most representative algorithm being DBSCAN [ 77 ]. In practice, DBSCAN does not need to input the number of clusters to be partitioned and can handle clusters of various shapes; however, the time complexity of the algorithm is high. Furthermore, when data density is irregular, the quality of the clusters decreases; thus, DBSCAN cannot process high dimensional data [ 75 ].

Clustering according to a grid

Neither partition nor hierarchical clustering can identify clusters with nonconvex shapes. Although a dimension-based algorithm can accomplish this task, the time complexity is high. To address this problem, data-mining researchers proposed grid-based algorithms that changed the original data space into a grid structure of a certain size. A representative algorithm is STING, which divides the data space into several square cells according to different resolutions and clusters the data of different structure levels [ 78 ]. The main advantage of this method is its high processing speed and its exclusive dependence on the number of units in each dimension of the quantized space.

In clinical studies, subjects tend to be actual patients. Although researchers adopt complex inclusion and exclusion criteria before determining the subjects to be included in the analyses, heterogeneity among different patients cannot be avoided [ 79 , 80 ]. The most common application of cluster analysis in clinical big data is in classifying heterogeneous mixed groups into homogeneous groups according to the characteristics of existing data (i.e., “subgroups” of patients or observed objects are identified) [ 81 , 82 ]. This new information can then be used in the future to develop patient-oriented medical-management strategies. Docampo et al. [ 81 ] used hierarchical clustering to reduce heterogeneity and identify subgroups of clinical fibromyalgia, which aided the evaluation and management of fibromyalgia. Additionally, Guo et al. [ 83 ] used k-means clustering to divide patients with essential hypertension into four subgroups, which revealed that the potential risk of coronary heart disease differed between different subgroups. On the other hand, density- and grid-based clustering algorithms have mostly been used to process large numbers of images generated in basic research and clinical practice, with current studies focused on developing new tools to help clinical research and practices based on these technologies [ 84 , 85 ]. Cluster analysis will continue to have extensive application prospects along with the increasing emphasis on personalized treatment.

Association rules

Association rules discover interesting associations and correlations between item sets in large amounts of data. These rules were first proposed by Agrawal et al. [ 86 ] and applied to analyse customer buying habits to help retailers create sales plans. Data-mining based on association rules identifies association rules in a two-step process: 1) all high frequency items in the collection are listed and 2) frequent association rules are generated based on the high frequency items [ 87 ]. Therefore, before association rules can be obtained, sets of frequent items must be calculated using certain algorithms. The Apriori algorithm is based on the a priori principle of finding all relevant adjustment items in a database transaction that meet a minimum set of rules and restrictions or other restrictions [ 88 ]. Other algorithms are mostly variants of the Apriori algorithm [ 64 ]. The Apriori algorithm must scan the entire database every time it scans the transaction; therefore, algorithm performance deteriorates as database size increases [ 89 ], making it potentially unsuitable for analysing large databases. The frequent pattern (FP) growth algorithm was proposed to improve efficiency. After the first scan, the FP algorithm compresses the frequency set in the database into a FP tree while retaining the associated information and then mines the conditional libraries separately [ 90 ]. Association-rule technology is often used in medical research to identify association rules between disease risk factors (i.e., exploration of the joint effects of disease risk factors and combinations of other risk factors). For example, Li et al. [ 91 ] used the association-rule algorithm to identify the most important stroke risk factor as atrial fibrillation, followed by diabetes and a family history of stroke. Based on the same principle, association rules can also be used to evaluate treatment effects and other aspects. For example, Guo et al. [ 92 ] used the FP algorithm to generate association rules and evaluate individual characteristics and treatment effects of patients with diabetes, thereby reducing the readability rate of patients with diabetes. Association rules reveal a connection between premises and conclusions; however, the reasonable and reliable application of information can only be achieved through validation by experienced medical professionals and through extensive causal research [ 92 ].

PCA is a widely used data-mining method that aims to reduce data dimensionality in an interpretable way while retaining most of the information present in the data [ 93 , 94 ]. The main purpose of PCA is descriptive, as it requires no assumptions about data distribution and is, therefore, an adaptive and exploratory method. During the process of data analysis, the main steps of PCA include standardization of the original data, calculation of a correlation coefficient matrix, calculation of eigenvalues and eigenvectors, selection of principal components, and calculation of the comprehensive evaluation value. PCA does not often appear as a separate method, as it is often combined with other statistical methods [ 95 ]. In practical clinical studies, the existence of multicollinearity often leads to deviation from multivariate analysis. A feasible solution is to construct a regression model by PCA, which replaces the original independent variables with each principal component as a new independent variable for regression analysis, with this most commonly seen in the analysis of dietary patterns in nutritional epidemiology [ 96 ]. In a study of socioeconomic status and child-developmental delays, PCA was used to derive a new variable (the household wealth index) from a series of household property reports and incorporate this new variable as the main analytical variable into the logistic regression model [ 97 ]. Additionally, PCA can be combined with cluster analysis. Burgel et al. [ 98 ] used PCA to transform clinical data to address the lack of independence between existing variables used to explore the heterogeneity of different subtypes of chronic obstructive pulmonary disease. Therefore, in the study of subtypes and heterogeneity of clinical diseases, PCA can eliminate noisy variables that can potentially corrupt the cluster structure, thereby increasing the accuracy of the results of clustering analysis [ 98 , 99 ].

The data-mining process and examples of its application using common public databases

Open-access databases have the advantages of large volumes of data, wide data coverage, rich data information, and a cost-efficient method of research, making them beneficial to medical researchers. In this chapter, we introduced the data-mining process and methods and their application in research based on examples of utilizing public databases and data-mining algorithms.

The data-mining process

Figure  1 shows a series of research concepts. The data-mining process is divided into several steps: (1) database selection according to the research purpose; (2) data extraction and integration, including downloading the required data and combining data from multiple sources; (3) data cleaning and transformation, including removal of incorrect data, filling in missing data, generating new variables, converting data format, and ensuring data consistency; (4) data mining, involving extraction of implicit relational patterns through traditional statistics or ML; (5) pattern evaluation, which focuses on the validity parameters and values of the relationship patterns of the extracted data; and (6) assessment of the results, involving translation of the extracted data-relationship model into comprehensible knowledge made available to the public.

figure 1

The steps of data mining in medical public database

Examples of data-mining applied using public databases

Establishment of warning models for the early prediction of disease.

A previous study identified sepsis as a major cause of death in ICU patients [ 100 ]. The authors noted that the predictive model developed previously used a limited number of variables, and that model performance required improvement. The data-mining process applied to address these issues was, as follows: (1) data selection using the MIMIC III database; (2) extraction and integration of three types of data, including multivariate features (demographic information and clinical biochemical indicators), time series data (temperature, blood pressure, and heart rate), and clinical latent features (various scores related to disease); (3) data cleaning and transformation, including fixing irregular time series measurements, estimating missing values, deleting outliers, and addressing data imbalance; (4) data mining through the use of logical regression, generation of a decision tree, application of the RF algorithm, an SVM, and an ensemble algorithm (a combination of multiple classifiers) to established the prediction model; (5) pattern evaluation using sensitivity, precision, and the area under the receiver operating characteristic curve to evaluate model performance; and (6) evaluation of the results, in this case the potential to predicting the prognosis of patients with sepsis and whether the model outperformed current scoring systems.

Exploring prognostic risk factors in cancer patients

Wu et al. [ 101 ] noted that traditional survival-analysis methods often ignored the influence of competitive risk events, such as suicide and car accident, on outcomes, leading to deviations and misjudgements in estimating the effect of risk factors. They used the SEER database, which offers cause-of-death data for cancer patients, and a competitive risk model to address this problem according to the following process: (1) data were obtained from the SEER database; (2) demography, clinical characteristics, treatment modality, and cause of death of cecum cancer patients were extracted from the database; (3) patient data were deleted when there were no demographic, clinical, therapeutic, or cause-of-death variables; (4) Cox regression and two kinds of competitive risk models were applied for survival analysis; (5) the results were compared between three different models; and (6) the results revealed that for survival data with multiple endpoints, the competitive risk model was more favourable.

Derivation of dietary patterns

A study by Martínez Steele et al. [ 102 ] applied PCA for nutritional epidemiological analysis to determine dietary patterns and evaluate the overall nutritional quality of the population based on those patterns. Their process involved the following: (1) data were extracted from the NHANES database covering the years 2009–2010; (2) demographic characteristics and two 24 h dietary recall interviews were obtained; (3) data were weighted and excluded based on subjects not meeting specific criteria; (4) PCA was used to determine dietary patterns in the United States population, and Gaussian regression and restricted cubic splines were used to assess associations between ultra-processed foods and nutritional balance; (5) eigenvalues, scree plots, and the interpretability of the principal components were reviewed to screen and evaluate the results; and (6) the results revealed a negative association between ultra-processed food intake and overall dietary quality. Their findings indicated that a nutritionally balanced eating pattern was characterized by a diet high in fibre, potassium, magnesium, and vitamin C intake along with low sugar and saturated fat consumption.

The use of “big data” has changed multiple aspects of modern life, with its use combined with data-mining methods capable of improving the status quo [ 86 ]. The aim of this study was to aid clinical researchers in understanding the application of data-mining technology on clinical big data and public medical databases to further their research goals in order to benefit clinicians and patients. The examples provided offer insight into the data-mining process applied for the purposes of clinical research. Notably, researchers have raised concerns that big data and data-mining methods were not a perfect fit for adequately replicating actual clinical conditions, with the results potentially capable of misleading doctors and patients [ 86 ]. Therefore, given the rate at which new technologies and trends progress, it is necessary to maintain a positive attitude concerning their potential impact while remaining cautious in examining the results provided by their application.

In the future, the healthcare system will need to utilize increasingly larger volumes of big data with higher dimensionality. The tasks and objectives of data analysis will also have higher demands, including higher degrees of visualization, results with increased accuracy, and stronger real-time performance. As a result, the methods used to mine and process big data will continue to improve. Furthermore, to increase the formality and standardization of data-mining methods, it is possible that a new programming language specifically for this purpose will need to be developed, as well as novel methods capable of addressing unstructured data, such as graphics, audio, and text represented by handwriting. In terms of application, the development of data-management and disease-screening systems for large-scale populations, such as the military, will help determine the best interventions and formulation of auxiliary standards capable of benefitting both cost-efficiency and personnel. Data-mining technology can also be applied to hospital management in order to improve patient satisfaction, detect medical-insurance fraud and abuse, and reduce costs and losses while improving management efficiency. Currently, this technology is being applied for predicting patient disease, with further improvements resulting in the increased accuracy and speed of these predictions. Moreover, it is worth noting that technological development will concomitantly require higher quality data, which will be a prerequisite for accurate application of the technology.

Finally, the ultimate goal of this study was to explain the methods associated with data mining and commonly used to process clinical big data. This review will potentially promote further study and aid doctors and patients.

Abbreviations

Biologic Specimen and Data Repositories Information Coordinating Center

China Health and Retirement Longitudinal Study

China Health and Nutrition Survey

China Kadoorie Biobank

Cause-specific risk

Comparative Toxicogenomics Database

EICU Collaborative Research Database

Frequent pattern

Global burden of disease

Gene expression omnibus

Health and Retirement Study

International Cancer Genome Consortium

Medical Information Mart for Intensive Care

  • Machine learning

National Health and Nutrition Examination Survey

Principal component analysis

Paediatric intensive care

Random forest

Surveillance, epidemiology, and end results

Support vector machine

The Cancer Genome Atlas

Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health informatics. J Big Data. 2014;1(1):1–35.

Article   Google Scholar  

Wang F, Zhang P, Wang X, Hu J. Clinical risk prediction by exploring high-order feature correlations. AMIA Annu Symp Proc. 2014;2014:1170–9.

PubMed   PubMed Central   Google Scholar  

Xu R, Li L, Wang Q. dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text. BMC Bioinform. 2014;15:105. https://doi.org/10.1186/1471-2105-15-105 .

Article   CAS   Google Scholar  

Ramachandran S, Erraguntla M, Mayer R, Benjamin P, Editors. Data mining in military health systems-clinical and administrative applications. In: 2007 IEEE international conference on automation science and engineering; 2007. https://doi.org/10.1109/COASE.2007.4341764 .

Vie LL, Scheier LM, Lester PB, Ho TE, Labarthe DR, Seligman MEP. The US army person-event data environment: a military-civilian big data enterprise. Big Data. 2015;3(2):67–79. https://doi.org/10.1089/big.2014.0055 .

Article   PubMed   Google Scholar  

Mohan A, Blough DM, Kurc T, Post A, Saltz J. Detection of conflicts and inconsistencies in taxonomy-based authorization policies. IEEE Int Conf Bioinform Biomed. 2012;2011:590–4. https://doi.org/10.1109/BIBM.2011.79 .

Luo J, Wu M, Gopukumar D, Zhao Y. Big data application in biomedical research and health care: a literature review. Biomed Inform Insights. 2016;8:1–10. https://doi.org/10.4137/BII.S31559 .

Article   CAS   PubMed   PubMed Central   Google Scholar  

Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform. 2008;77(2):81–97.

Sahu H, Shrma S, Gondhalakar S. A brief overview on data mining survey. Int J Comput Technol Electron Eng. 2011;1(3):114–21.

Google Scholar  

Obermeyer Z, Emanuel EJ. Predicting the future - big data, machine learning, and clinical medicine. N Engl J Med. 2016;375(13):1216–9.

Article   PubMed   PubMed Central   Google Scholar  

Doll KM, Rademaker A, Sosa JA. Practical guide to surgical data sets: surveillance, epidemiology, and end results (SEER) database. JAMA Surg. 2018;153(6):588–9.

Johnson AE, Pollard TJ, Shen L, Lehman LW, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3: 160035. https://doi.org/10.1038/sdata.2016.35 .

Ahluwalia N, Dwyer J, Terry A, Moshfegh A, Johnson C. Update on NHANES dietary data: focus on collection, release, analytical considerations, and uses to inform public policy. Adv Nutr. 2016;7(1):121–34.

Vos T, Lim SS, Abbafati C, Abbas KM, Abbasi M, Abbasifard M, et al. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet. 2020;396(10258):1204–22. https://doi.org/10.1016/S0140-6736(20)30925-9 .

Palmer LJ. UK Biobank: Bank on it. Lancet. 2007;369(9578):1980–2. https://doi.org/10.1016/S0140-6736(07)60924-6 .

Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, et al. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45(10):1113–20. https://doi.org/10.1038/ng.2764 .

Davis S, Meltzer PS. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics. 2007;23(14):1846–7.

Article   PubMed   CAS   Google Scholar  

Zhang J, Bajari R, Andric D, Gerthoffert F, Lepsa A, Nahal-Bose H, et al. The international cancer genome consortium data portal. Nat Biotechnol. 2019;37(4):367–9.

Article   CAS   PubMed   Google Scholar  

Chen Z, Chen J, Collins R, Guo Y, Peto R, Wu F, et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int J Epidemiol. 2011;40(6):1652–66.

Davis AP, Grondin CJ, Johnson RJ, Sciaky D, McMorran R, Wiegers J, et al. The comparative toxicogenomics database: update 2019. Nucleic Acids Res. 2019;47(D1):D948–54. https://doi.org/10.1093/nar/gky868 .

Zeng X, Yu G, Lu Y, Tan L, Wu X, Shi S, et al. PIC, a paediatric-specific intensive care database. Sci Data. 2020;7(1):14.

Giffen CA, Carroll LE, Adams JT, Brennan SP, Coady SA, Wagner EL. Providing contemporary access to historical biospecimen collections: development of the NHLBI Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC). Biopreserv Biobank. 2015;13(4):271–9.

Zhang B, Zhai FY, Du SF, Popkin BM. The China Health and Nutrition Survey, 1989–2011. Obes Rev. 2014;15(Suppl 1):2–7. https://doi.org/10.1111/obr.12119 .

Zhao Y, Hu Y, Smith JP, Strauss J, Yang G. Cohort profile: the China Health and Retirement Longitudinal Study (CHARLS). Int J Epidemiol. 2014;43(1):61–8.

Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU collaborative research database, a freely available multi-centre database for critical care research. Sci Data. 2018;5:180178. https://doi.org/10.1038/sdata.2018.178 .

Fisher GG, Ryan LH. Overview of the health and retirement study and introduction to the special issue. Work Aging Retire. 2018;4(1):1–9.

Iavindrasana J, Cohen G, Depeursinge A, Müller H, Meyer R, Geissbuhler A. Clinical data mining: a review. Yearb Med Inform. 2009:121–33.

Zhang Y, Guo SL, Han LN, Li TL. Application and exploration of big data mining in clinical medicine. Chin Med J. 2016;129(6):731–8. https://doi.org/10.4103/0366-6999.178019 .

Ngiam KY, Khor IW. Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 2019;20(5):e262–73.

Huang C, Murugiah K, Mahajan S, Li S-X, Dhruva SS, Haimovich JS, et al. Enhancing the prediction of acute kidney injury risk after percutaneous coronary intervention using machine learning techniques: a retrospective cohort study. PLoS Med. 2018;15(11):e1002703.

Rahimian F, Salimi-Khorshidi G, Payberah AH, Tran J, Ayala Solares R, Raimondi F, et al. Predicting the risk of emergency admission with machine learning: development and validation using linked electronic health records. PLoS Med. 2018;15(11):e1002695.

Kantardzic M. Data Mining: concepts, models, methods, and algorithms. Technometrics. 2003;45(3):277.

Jothi N, Husain W. Data mining in healthcare—a review. Procedia Comput Sci. 2015;72:306–13.

Piatetsky-Shapiro G, Tamayo P. Microarray data mining: facing the challenges. SIGKDD. 2003;5(2):1–5. https://doi.org/10.1145/980972.980974 .

Ripley BD. Pattern recognition and neural networks. Cambridge: Cambridge University Press; 1996.

Book   Google Scholar  

Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Stat Surv. 2010;4:40–79. https://doi.org/10.1214/09-SS054 .

Shouval R, Bondi O, Mishan H, Shimoni A, Unger R, Nagler A. Application of machine learning algorithms for clinical predictive modelling: a data-mining approach in SCT. Bone Marrow Transp. 2014;49(3):332–7.

Momenyan S, Baghestani AR, Momenyan N, Naseri P, Akbari ME. Survival prediction of patients with breast cancer: comparisons of decision tree and logistic regression analysis. Int J Cancer Manag. 2018;11(7):e9176.

Topaloğlu M, Malkoç G. Decision tree application for renal calculi diagnosis. Int J Appl Math Electron Comput. 2016. https://doi.org/10.18100/ijamec.281134.

Li H, Wu TT, Yang DL, Guo YS, Liu PC, Chen Y, et al. Decision tree model for predicting in-hospital cardiac arrest among patients admitted with acute coronary syndrome. Clin Cardiol. 2019;42(11):1087–93.

Ramezankhani A, Hadavandi E, Pournik O, Shahrabi J, Azizi F, Hadaegh F. Decision tree-based modelling for identification of potential interactions between type 2 diabetes risk factors: a decade follow-up in a Middle East prospective cohort study. BMJ Open. 2016;6(12):e013336.

Carmona-Bayonas A, Jiménez-Fonseca P, Font C, Fenoy F, Otero R, Beato C, et al. Predicting serious complications in patients with cancer and pulmonary embolism using decision tree modelling: the EPIPHANY Index. Br J Cancer. 2017;116(8):994–1001.

Efron B. Bootstrap methods: another look at the jackknife. In: Kotz S, Johnson NL, editors. Breakthroughs in statistics. New York: Springer; 1992. p. 569–93.

Chapter   Google Scholar  

Breima L. Random forests. Mach Learn. 2010;1(45):5–32. https://doi.org/10.1023/A:1010933404324 .

Franklin J. The elements of statistical learning: data mining, inference and prediction. Math Intell. 2005;27(2):83–5.

Taylor RA, Pare JR, Venkatesh AK, Mowafi H, Melnick ER, Fleischman W, et al. Prediction of in-hospital mortality in emergency department patients with sepsis: a local big data-driven, machine learning approach. Acad Emerg Med. 2016;23(3):269–78.

Lee J, Scott DJ, Villarroel M, Clifford GD, Saeed M, Mark RG. Open-access MIMIC-II database for intensive care research. Annu Int Conf IEEE Eng Med Biol Soc. 2011:8315–8. https://doi.org/10.1109/IEMBS.2011.6092050 .

Lee J. Patient-specific predictive modelling using random forests: an observational study for the critically Ill. JMIR Med Inform. 2017;5(1):e3.

Wongvibulsin S, Wu KC, Zeger SL. Clinical risk prediction with random forests for survival, longitudinal, and multivariate (RF-SLAM) data analysis. BMC Med Res Methodol. 2019;20(1):1.

Taylor JMG. Random survival forests. J Thorac Oncol. 2011;6(12):1974–5.

Hu C, Steingrimsson JA. Personalized risk prediction in clinical oncology research: applications and practical issues using survival trees and random forests. J Biopharm Stat. 2018;28(2):333–49.

Dietrich R, Opper M, Sompolinsky H. Statistical mechanics of support vector networks. Phys Rev Lett. 1999;82(14):2975.

Verplancke T, Van Looy S, Benoit D, Vansteelandt S, Depuydt P, De Turck F, et al. Support vector machine versus logistic regression modelling for prediction of hospital mortality in critically ill patients with haematological malignancies. BMC Med Inform Decis Mak. 2008;8:56. https://doi.org/10.1186/1472-6947-8-56 .

Yu W, Liu T, Valdez R, Gwinn M, Khoury MJ. Application of support vector machine modelling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Med Inform Decis Mak. 2010;10:16. https://doi.org/10.1186/1472-6947-10-16 .

Son YJ, Kim HG, Kim EH, Choi S, Lee SK. Application of support vector machine for prediction of medication adherence in heart failure patients. Healthc Inform Res. 2010;16(4):253–9.

Schadt EE, Friend SH, Shaywitz DA. A network view of disease and compound screening. Nat Rev Drug Discov. 2009;8(4):286–95.

Austin PC, Lee DS, Fine JP. Introduction to the analysis of survival data in the presence of competing risks. Circulation. 2016;133(6):601–9.

Putter H, Fiocco M, Geskus RB. Tutorial in biostatistics: competing risks and multi-state models. Stat Med. 2007;26(11):2389–430. https://doi.org/10.1002/sim.2712 .

Klein JP. Competing risks. WIREs Comp Stat. 2010;2(3):333–9. https://doi.org/10.1002/wics.83 .

Haller B, Schmidt G, Ulm K. Applying competing risks regression models: an overview. Lifetime Data Anal. 2013;19(1):33–58. https://doi.org/10.1007/s10985-012-9230-8 .

Fine JP, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. J Am Stat Assoc. 1999;94(446):496–509.

Koller MT, Raatz H, Steyerberg EW, Wolbers M. Competing risks and the clinical community: irrelevance or ignorance? Stat Med. 2012;31(11–12):1089–97.

Lau B, Cole SR, Gange SJ. Competing risk regression models for epidemiologic data. Am J Epidemiol. 2009;170(2):244–56.

Yang J, Li Y, Liu Q, Li L, Feng A, Wang T, et al. Brief introduction of medical database and data mining technology in big data era. J Evid Based Med. 2020;13(1):57–69.

Yu Z, Yang J, Gao L, Huang Q, Zi H, Li X. A competing risk analysis study of prognosis in patients with esophageal carcinoma 2006–2015 using data from the surveillance, epidemiology, and end results (SEER) database. Med Sci Monit. 2020;26:e918686.

Yang J, Pan Z, He Y, Zhao F, Feng X, Liu Q, et al. Competing-risks model for predicting the prognosis of penile cancer based on the SEER database. Cancer Med. 2019;8(18):7881–9.

Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform. 2018;19(6):1236–46.

Alashwal H, El Halaby M, Crouse JJ, Abdalla A, Moustafa AA. The application of unsupervised clustering methods to Alzheimer’s disease. Front Comput Neurosci. 2019;13:31.

Macqueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA: University of California Press;1967.

Forgy EW. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics. 1965;21:768–9.

Johnson SC. Hierarchical clustering schemes. Psychometrika. 1967;32(3):241–54.

Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Rec. 1996;25(2):103–14.

Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Rec. 1998;27(2):73–84.

Guha S, Rastogi R, Shim K. ROCK: a robust clustering algorithm for categorical attributes. Inf Syst. 2000;25(5):345–66.

Xu D, Tian Y. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–93.

Kriegel HP, Kröger P, Sander J, Zimek A. Density-based clustering. WIRES Data Min Knowl. 2011;1(3):231–40. https://doi.org/10.1002/widm.30 .

Ester M, Kriegel HP, Sander J, Xu X, editors. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd international conference on knowledge discovery and data mining Portland, Oregon: AAAI Press; 1996. p. 226–31.

Wang W, Yang J, Muntz RR. STING: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd international conference on very large data bases, Morgan Kaufmann Publishers Inc.; 1997. p. 186–95.

Iwashyna TJ, Burke JF, Sussman JB, Prescott HC, Hayward RA, Angus DC. Implications of heterogeneity of treatment effect for reporting and analysis of randomized trials in critical care. Am J Respir Crit Care Med. 2015;192(9):1045–51.

Ruan S, Lin H, Huang C, Kuo P, Wu H, Yu C. Exploring the heterogeneity of effects of corticosteroids on acute respiratory distress syndrome: a systematic review and meta-analysis. Crit Care. 2014;18(2):R63.

Docampo E, Collado A, Escaramís G, Carbonell J, Rivera J, Vidal J, et al. Cluster analysis of clinical data identifies fibromyalgia subgroups. PLoS ONE. 2013;8(9):e74873.

Sutherland ER, Goleva E, King TS, Lehman E, Stevens AD, Jackson LP, et al. Cluster analysis of obesity and asthma phenotypes. PLoS ONE. 2012;7(5):e36631.

Guo Q, Lu X, Gao Y, Zhang J, Yan B, Su D, et al. Cluster analysis: a new approach for identification of underlying risk factors for coronary artery disease in essential hypertensive patients. Sci Rep. 2017;7:43965.

Hastings S, Oster S, Langella S, Kurc TM, Pan T, Catalyurek UV, et al. A grid-based image archival and analysis system. J Am Med Inform Assoc. 2005;12(3):286–95.

Celebi ME, Aslandogan YA, Bergstresser PR. Mining biomedical images with density-based clustering. In: International conference on information technology: coding and computing (ITCC’05), vol II. Washington, DC, USA: IEEE; 2005. https://doi.org/10.1109/ITCC.2005.196 .

Agrawal R, Imieliński T, Swami A, editors. Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD conference on management of data. Washington, DC, USA: Association for Computing Machinery; 1993. p. 207–16. https://doi.org/10.1145/170035.170072 .

Sethi A, Mahajan P. Association rule mining: A review. TIJCSA. 2012;1(9):72–83.

Kotsiantis S, Kanellopoulos D. Association rules mining: a recent overview. GESTS Int Trans Comput Sci Eng. 2006;32(1):71–82.

Narvekar M, Syed SF. An optimized algorithm for association rule mining using FP tree. Procedia Computer Sci. 2015;45:101–10.

Verhein F. Frequent pattern growth (FP-growth) algorithm. Sydney: The University of Sydney; 2008. p. 1–16.

Li Q, Zhang Y, Kang H, Xin Y, Shi C. Mining association rules between stroke risk factors based on the Apriori algorithm. Technol Health Care. 2017;25(S1):197–205.

Guo A, Zhang W, Xu S. Exploring the treatment effect in diabetes patients using association rule mining. Int J Inf Pro Manage. 2016;7(3):1–9.

Pearson K. On lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11):559–72.

Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psychol. 1933;24(6):417.

Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci. 2016;374(2065):20150202.

Zhang Z, Castelló A. Principal components analysis in clinical studies. Ann Transl Med. 2017;5(17):351.

Apio BRS, Mawa R, Lawoko S, Sharma KN. Socio-economic inequality in stunting among children aged 6–59 months in a Ugandan population based cross-sectional study. Am J Pediatri. 2019;5(3):125–32.

Burgel PR, Paillasseur JL, Caillaud D, Tillie-Leblond I, Chanez P, Escamilla R, et al. Clinical COPD phenotypes: a novel approach using principal component and cluster analyses. Eur Respir J. 2010;36(3):531–9.

Vogt W, Nagel D. Cluster analysis in diagnosis. Clin Chem. 1992;38(2):182–98.

Layeghian Javan S, Sepehri MM, Layeghian Javan M, Khatibi T. An intelligent warning model for early prediction of cardiac arrest in sepsis patients. Comput Methods Programs Biomed. 2019;178:47–58. https://doi.org/10.1016/j.cmpb.2019.06.010 .

Wu W, Yang J, Li D, Huang Q, Zhao F, Feng X, et al. Competitive risk analysis of prognosis in patients with cecum cancer: a population-based study. Cancer Control. 2021;28:1073274821989316. https://doi.org/10.1177/1073274821989316 .

Martínez Steele E, Popkin BM, Swinburn B, Monteiro CA. The share of ultra-processed foods and the overall nutritional quality of diets in the US: evidence from a nationally representative cross-sectional study. Popul Health Metr. 2017;15(1):6.

Download references

This study was supported by the National Social Science Foundation of China (No. 16BGL183).

Author information

Wen-Tao Wu and Yuan-Jie Li have contributed equally to this work

Authors and Affiliations

Department of Clinical Research, The First Affiliated Hospital of Jinan University, Tianhe District, 613 W. Huangpu Avenue, Guangzhou, 510632, Guangdong, China

Wen-Tao Wu, Ao-Zi Feng, Li Li, Tao Huang & Jun Lyu

School of Public Health, Xi’an Jiaotong University Health Science Center, Xi’an, 710061, Shaanxi, China

Department of Human Anatomy, Histology and Embryology, School of Basic Medical Sciences, Xi’an Jiaotong University Health Science Center, Xi’an, 710061, Shaanxi, China

Yuan-Jie Li

Department of Neurology, The First Affiliated Hospital of Jinan University, Tianhe District, 613 W. Huangpu Avenue, Guangzhou, 510632, Guangdong, China

You can also search for this author in PubMed   Google Scholar

Contributions

WTW, YJL and JL designed the review. JL, AZF, TH, LL and ADX reviewed and criticized the original paper. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to An-Ding Xu or Jun Lyu .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Wu, WT., Li, YJ., Feng, AZ. et al. Data mining in clinical big data: the frequently used databases, steps, and methodological models. Military Med Res 8 , 44 (2021). https://doi.org/10.1186/s40779-021-00338-z

Download citation

Received : 24 January 2020

Accepted : 03 August 2021

Published : 11 August 2021

DOI : https://doi.org/10.1186/s40779-021-00338-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Clinical big data
  • Data mining
  • Medical public database

Military Medical Research

ISSN: 2054-9369

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

research papers related to data mining

A comprehensive survey of data mining

  • Original Research
  • Published: 06 February 2020
  • Volume 12 , pages 1243–1257, ( 2020 )

Cite this article

research papers related to data mining

  • Manoj Kumar Gupta   ORCID: orcid.org/0000-0002-4481-8432 1 &
  • Pravin Chandra 1  

4655 Accesses

57 Citations

Explore all metrics

Data mining plays an important role in various human activities because it extracts the unknown useful patterns (or knowledge). Due to its capabilities, data mining become an essential task in large number of application domains such as banking, retail, medical, insurance, bioinformatics, etc. To take a holistic view of the research trends in the area of data mining, a comprehensive survey is presented in this paper. This paper presents a systematic and comprehensive survey of various data mining tasks and techniques. Further, various real-life applications of data mining are presented in this paper. The challenges and issues in area of data mining research are also presented in this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Similar content being viewed by others

research papers related to data mining

Big data in healthcare: management, analysis and future prospects

A survey of transfer learning.

research papers related to data mining

Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective

Fayadd U, Piatesky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases. AAAI Press/The MIT Press, Massachusetts Institute of Technology. ISBN 0–262 56097–6 Fayap

Fayadd U, Piatesky-Shapiro G, Smyth P (1996) Knowledge discovery and data mining: towards a unifying framework. In: Proceedings of the 2nd ACM international conference on knowledge discovery and data mining (KDD), Portland, pp 82–88

Heikki M (1996) Data mining: machine learning, statistics, and databases. In: SSDBM ’96: proceedings of the eighth international conference on scientific and statistical database management, June 1996, pp 2–9

Arora RK, Gupta MK (2017) e-Governance using data warehousing and data mining. Int J Comput Appl 169(8):28–31

Google Scholar  

Morik K, Bhaduri K, Kargupta H (2011) Introduction to data mining for sustainability. Data Min Knowl Discov 24(2):311–324

Han J, Kamber M, Pei J (2012) Data mining concepts and techniques, 3rd edn. Elsevier, Netherlands

MATH   Google Scholar  

Friedman JH (1997) Data mining and statistics: What is the connection? in: Keynote Speech of the 29th Symposium on the Interface: Computing Science and Statistics, Houston, TX, 1997

Turban E, Aronson JE, Liang TP, Sharda R (2007) Decision support and business intelligence systems. 8 th edn, Pearson Education, UK

Gheware SD, Kejkar AS, Tondare SM (2014) Data mining: tasks, tools, techniques and applications. Int J Adv Res Comput Commun Eng 3(10):8095–8098

Kiranmai B, Damodaram A (2014) A review on evaluation measures for data mining tasks. Int J Eng Comput Sci 3(7):7217–7220

Sharma M (2014) Data mining: a literature survey. Int J Emerg Res Manag Technol 3(2):1–4

Venkatadri M, Reddy LC (2011) A review on data mining from past to the future. Int J Comput Appl 15(7):19–22

Chen M, Han J, Yu PS (1996) Data mining: an overview from a database perspective. IEEE Trans Knowl Data Eng 8(6):866–883

Gupta MK, Chandra P (2019) A comparative study of clustering algorithms. In: Proceedings of the 13th INDIACom-2019; IEEE Conference ID: 461816; 6th International Conference on “Computing for Sustainable Global Development”

Ponniah P (2001) Data warehousing fundamentals. Wiley, USA

Chandra P, Gupta MK (2018) Comprehensive survey on data warehousing research. Int J Inform Technol 10(2):217–224

Weiss SH, Indurkhya N (1998) Predictive data mining: a practical guide. Morgan Kaufmann Publishers, San Francisco

Fu Y (1997) Data mining: tasks, techniques, and applications. IEEE Potentials 16(4):18–20

Abuaiadah D (2015) Using bisect k-means clustering technique in the analysis of arabic documents. ACM Trans Asian Low-Resour Lang Inf Process 15(3):1–17

Algergawy A, Mesiti M, Nayak R, Saake G (2011) XML data clustering: an overview. ACM Comput Surv 43(4):1–25

Angiulli F, Fassetti F (2013) Exploiting domain knowledge to detect outliers. Data Min Knowl Discov 28(2):519–568

MathSciNet   MATH   Google Scholar  

Angiulli F, Fassetti F (2016) Toward generalizing the unification with statistical outliers: the gradient outlier factor measure. ACM Trans Knowl Discov Data 10(3):1–26

Bhatnagar V, Ahuja S, Kaur S (2015) Discriminant analysis-based cluster ensemble. Int J Data Min Modell Manag 7(2):83–107

Bouguessa M (2013) Clustering categorical data in projected spaces. Data Min Knowl Discov 29(1):3–38

MathSciNet   Google Scholar  

Campello RJGB, Moulavi D, Zimek A, Sander J (2015) Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans Knowl Discov Data 10(1):1–51

Carpineto C, Osinski S, Romano G, Weiss D (2009) A survey of web clustering engines. ACM Comput. Surv. 41(3):1–38

Ceglar A, Roddick JF (2006) Association mining. ACM Comput Surv 38(2):1–42

Chen YL, Weng CH (2009) Mining fuzzy association rules from questionnaire data. Knowl Based Syst 22(1):46–56

Fan Chin-Yuan, Fan Pei-Shu, Chan Te-Yi, Chang Shu-Hao (2012) Using hybrid data mining and machine learning clustering analysis to predict the turnover rate for technology professionals. Expert Syst Appl 39:8844–8851

Das R, Kalita J, Bhattacharya (2011) A pattern matching approach for clustering gene expression data. Int J Data Min Model Manag 3(2):130–149

Dincer E (2006) The k-means algorithm in data mining and an application in medicine. Kocaeli Univesity, Kocaeli

Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3):1–32

Gupta MK, Chandra P (2019) P-k-means: k-means using partition based cluster initialization method. In: Proceedings of the international conference on advancements in computing and management (ICACM 2019), Elsevier SSRN, pp 567–573

Gupta MK, Chandra P (2019) An empirical evaluation of k-means clustering algorithm using different distance/similarity metrics. In: Proceedings of the international conference on emerging trends in information technology (ICETIT-2019), emerging trends in information technology, LNEE 605 pp 884–892 DOI: https://doi.org/10.1007/978-3-030-30577-2_79

Hea Z, Xua X, Huangb JZ, Denga S (2004) Mining class outliers: concepts, algorithms and applications in CRM. Expert Syst Appl 27(4):681e97

Hung LN, Thu TNT, Nguyen GC (2015) An efficient algorithm in mining frequent itemsets with weights over data stream using tree data structure. IJ Intell Syst Appl 12:23–31

Hung LN, Thu TNT (2016) Mining frequent itemsets with weights over data stream using inverted matrix. IJ Inf Technol Comput Sci 10:63–71

Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput. Surv 31(3):1–60

Jin H, Wang S, Zhou Q, Li Y (2014) An improved method for density-based clustering. Int J Data Min Model Manag 6(4):347–368

Khandare A, Alvi AS (2017) Performance analysis of improved clustering algorithm on real and synthetic data. IJ Comput Netw Inf Secur 10:57–65

Koh YS, Ravana SD (2016) Unsupervised rare pattern mining: a survey. ACM Trans Knowl Discov Data 10(4):1–29

Kosina P, Gama J (2015) Very fast decision rules for classification in data streams. Data Min Knowl Discov 29(1):168–202

Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. Informatica 31:249–268

Kumar D, Bezdek JC, Rajasegarar S, Palaniswami M, Leckie C, Chan J, Gubbi J (2016) Adaptive cluster tendency visualization and anomaly detection for streaming data. ACM Trans Knowl Discov Data 11(2):1–24

Lee G, Yun U (2017) A new efficient approach for mining uncertain frequent patterns using minimum data structure without false positives. Future Gener Comput Syst 68:89–110

Li G, Zaki MJ (2015) Sampling frequent and minimal boolean patterns: theory and application in classification. Data Min Knowl Discov 30(1):181–225. https://doi.org/10.1007/s10618-015-0409-y

Article   MathSciNet   MATH   Google Scholar  

Liao TW, Triantaphyllou E (2007) Recent advances in data mining of enterprise data: algorithms and applications. World Scientific Publishing, Singapore, pp 111–145

Mabroukeh NR, Ezeife CI (2010) A taxonomy of sequential pattern mining algorithms. ACM Comput Surv 43:1

Mampaey M, Vreeken J (2011) Summarizing categorical data by clustering attributes. Data Min Knowl Discov 26(1):130–173

Menardi G, Torelli N (2012) Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 28(1):4–28. https://doi.org/10.1007/s10618-012-0295-5

Mukhopadhyay A, Maulik U, Bandyopadhyay S (2015) A survey of multiobjective evolutionary clustering. ACM Comput Surv 47(4):1–46

Pei Y, Fern XZ, Tjahja TV, Rosales R (2016) ‘Comparing clustering with pairwise and relative constraints: a unified framework. ACM Trans Knowl Discov Data 11:2

Rafalak M, Deja M, Wierzbicki A, Nielek R, Kakol M (2016) Web content classification using distributions of subjective quality evaluations. ACM Trans Web 10:4

Reddy D, Jana PK (2014) A new clustering algorithm based on Voronoi diagram. Int J Data Min Model Manag 6(1):49–64

Rustogi S, Sharma M, Morwal S (2017) Improved Parallel Apriori Algorithm for Multi-cores. IJ Inf Technol Comput Sci 4:18–23

Shah-Hosseini H (2013) Improving K-means clustering algorithm with the intelligent water drops (IWD) algorithm. Int J Data Min Model Manag 5(4):301–317

Silva JA, Faria ER, Barros RC, Hruschka ER, de Carvalho ACPLF, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):1–31

Silva A, Antunes C (2014) Multi-relational pattern mining over data streams. Data Min Knowl Discov 29(6):1783–1814. https://doi.org/10.1007/s10618-014-0394-6

Sim K, Gopalkrishnan V, Zimek A, Cong G (2012) A survey on enhanced subspace clustering. Data Min Knowl Discov 26(2):332–397

Sohrabi MK, Roshani R (2017) Frequent itemset mining using cellular learning automata. Comput Hum Behav 68:244–253

Craw Susan, Wiratunga Nirmalie, Rowe Ray C (2006) Learning adaptation knowledge to improve case-based reasoning. Artif Intell 170:1175–1192

Tan KC, Teoh EJ, Yua Q, Goh KC (2009) A hybrid evolutionary algorithm for attribute selection in data mining. Expert Syst Appl 36(4):8616–8630

Tew C, Giraud-Carrier C, Tanner K, Burton S (2013) Behavior-based clustering and analysis of interestingness measures for association rule mining. Data Min Knowl Discov 28(4):1004–1045

Wang L, Dong M (2015) Exemplar-based low-rank matrix decomposition for data clustering. Data Min Knowl Discov 29:324–357

Wang F, Sun J (2014) Survey on distance metric learning and dimensionality reduction in data mining. Data Min Knowl Discov 29:534–564

Wang B, Rahal I, Dong A (2011) Parallel hierarchical clustering using weighted confidence affinity. Int J Data Min Model Manag 3(2):110–129

Zacharis NZ (2018) Classification and regression trees (CART) for predictive modeling in blended learning. IJ Intell Syst Appl 3:1–9

Zhang W, Li R, Feng D, Chernikov A, Chrisochoides N, Osgood C, Ji S (2015) Evolutionary soft co-clustering: formulations, algorithms, and applications. Data Min Knowl Discov 29:765–791

Han J, Fu Y (1996) Exploration of the power of attribute-oriented induction in data mining. Adv Knowl Discov Data Min. AAAI/MIT Press, pp 399-421

Gupta A, Mumick IS (1995) Maintenance of materialized views: problems, techniques, and applications. IEEE Data Eng Bull 18(2):3

Sawant V, Shah K (2013) A review of distributed data mining using agents. Int J Adv Technol Eng Res 3(5):27–33

Gupta MK, Chandra P (2019) An efficient approach for selection of initial cluster centroids for k-means clustering algorithm. In: Proceedings international conference on recent developments in science engineering and technology (REDSET-2019), November 15–16 2019

Gupta MK, Chandra P (2019) MP-K-means: modified partition based cluster initialization method for k-means algorithm. Int J Recent Technol Eng 8(4):1140–1148

Gupta MK, Chandra P (2019) HYBCIM: hypercube based cluster initialization method for k-means. IJ Innov Technol Explor Eng 8(10):3584–3587. https://doi.org/10.35940/ijitee.j9774.0881019

Article   Google Scholar  

Enke David, Thawornwong Suraphan (2005) The use of data mining and neural networks for forecasting stock market returns. Expert Syst Appl 29:927–940

Mezyk Edward, Unold Olgierd (2011) Machine learning approach to model sport training. Comput Hum Behav 27:1499–1506

Esling P, Agon C (2012) Time-series data mining. ACM Comput Surv 45(1):1–34

Hüllermeier Eyke (2005) Fuzzy methods in machine learning and data mining: status and prospects. Fuzzy Sets Syst 156:387–406

Hullermeier Eyke (2011) Fuzzy sets in machine learning and data mining. Appl Soft Comput 11:1493–1505

Gengshen Du, Ruhe Guenther (2014) Two machine-learning techniques for mining solutions of the ReleasePlanner™ decision support system. Inf Sci 259:474–489

Smith Kate A, Gupta Jatinder ND (2000) Neural networks in business: techniques and applications for the operations researcher. Comput Oper Res 27:1023–1044

Huang Mu-Jung, Tsou Yee-Lin, Lee Show-Chin (2006) Integrating fuzzy data mining and fuzzy artificial neural networks for discovering implicit knowledge. Knowl Based Syst 19:396–403

Padhraic S (2000) Data mining: analysis on grand scale. Stat Method Med Res 9(4):309–327. https://doi.org/10.1191/096228000701555181

Article   MATH   Google Scholar  

Saeed S, Ali M (2012) Privacy-preserving back-propagation and extreme learning machine algorithms. Data Knowl Eng 79–80:40–61

Singh Y, Bhatia PK, Sangwan OP (2007) A review of studies on machine learning techniques. Int J Comput Sci Secur 1(1):70–84

Yahia ME, El-taher ME (2010) A new approach for evaluation of data mining techniques. Int J Comput Sci Issues 7(5):181–186

Jackson J (2002) Data mining: a conceptual overview. Commun Assoc Inf Syst 8:267–296

Heckerman D (1998) A tutorial on learning with Bayesian networks. Learning in graphical models. Springer, Netherlands, pp 301–354

Politano PM, Walton RO (2017) Statistics & research methodol. Lulu. com

Wetherill GB (1987) Regression analysis with application. Chapman & Hall Ltd, UK

Anderberg MR (2014) Cluster analysis for applications: probability and mathematical statistics: a series of monographs and textbooks, vol 19. Academic Press, USA

Mihoci A (2017) Modelling limit order book volume covariance structures. In: Hokimoto T (ed) Advances in statistical methodologies and their application to real problems. IntechOpen, Croatia. https://doi.org/10.5772/66152

Chapter   Google Scholar  

Thompson B (2004) Exploratory and confirmatory factor analysis: understanding concepts and applications. American Psychological Association, Washington, DC (ISBN:1-59147-093-5)

Kuzey C, Uyar A, Delen (2014) The impact of multinationality on firm value: a comparative analysis of machine learning techniques. Decis Support Syst 59:127–142

Chan Philip K, Salvatore JS (1997) On the accuracy of meta-learning for scalable data mining. J Intell Inf Syst 8:5–28

Tsai Chih-Fong, Hsu Yu-Feng, Lin Chia-Ying, Lin Wei-Yang (2009) Intrusion detection by machine learning: a review. Expert Syst Appl 36:11994–12000

Liao SH, Chu PH, Hsiao PY (2012) Data mining techniques and applications—a decade review from 2000 to 2011. Expert Syst Appl 39:11303–11311

Kanevski M, Parkin R, Pozdnukhov A, Timonin V, Maignan M, Demyanov V, Canu S (2004) Environmental data mining and modelling based on machine learning algorithms and geostatistics. Environ Model Softw 19:845–855

Jain N, Srivastava V (2013) Data mining techniques: a survey paper. Int J Res Eng Technol 2(11):116–119

Baker RSJ (2010) Data mining for education. In: McGaw B, Peterson P, Baker E (eds) International encyclopedia of education, 3rd edn. Elsevier, Oxford, UK

Lew A, Mauch H (2006) Introduction to data mining and its applications. Springer, Berlin

Mukherjee S, Shaw R, Haldar N, Changdar S (2015) A survey of data mining applications and techniques. Int J Comput Sci Inf Technol 6(5):4663–4666

Data mining examples: most common applications of data mining (2019). https://www.softwaretestinghelp.com/data-mining-examples/ . Accessed 27 Dec 2019

Devi SVSG (2013) Applications and trends in data mining. Orient J Comput Sci Technol 6(4):413–419

Data mining—applications & trends. https://www.tutorialspoint.com/data_mining/dm_applications_trends.htm

Keleş MK (2017) An overview: the impact of data mining applications on various sectors. Tech J 11(3):128–132

Top 14 useful applications for data mining. https://bigdata-madesimple.com/14-useful-applications-of-data-mining/ . Accessed 20 Aug 2014

Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inf Technol Decis Making 5(4):597–604

Padhy N, Mishra P, Panigrahi R (2012) A survey of data mining applications and future scope. Int J Comput Sci Eng Inf Technol 2(3):43–58

Gibert K, Sanchez-Marre M, Codina V (2010) Choosing the right data mining technique: classification of methods and intelligent recommendation. In: International Congress on Environment Modelling and Software Modelling for Environment’s Sake, Fifth Biennial Meeting, Ottawa, Canada

Download references

Author information

Authors and affiliations.

University School of Information, Communication and Technology, Guru Gobind Singh Indraprastha University, Sector-16C, Dwarka, Delhi, 110078, India

Manoj Kumar Gupta & Pravin Chandra

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Manoj Kumar Gupta .

Rights and permissions

Reprints and permissions

About this article

Gupta, M.K., Chandra, P. A comprehensive survey of data mining. Int. j. inf. tecnol. 12 , 1243–1257 (2020). https://doi.org/10.1007/s41870-020-00427-7

Download citation

Received : 29 June 2019

Accepted : 20 January 2020

Published : 06 February 2020

Issue Date : December 2020

DOI : https://doi.org/10.1007/s41870-020-00427-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data mining techniques
  • Data mining tasks
  • Data mining applications
  • Classification
  • Find a journal
  • Publish with us
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Healthcare (Basel)

Logo of healthcare

A Systematic Review on Healthcare Analytics: Application and Theoretical Perspective of Data Mining

Md saiful islam.

1 Mechanical and Industrial Engineering, Northeastern University, Boston, MA 02115, USA; [email protected] (M.S.I.); [email protected] (M.M.H.); [email protected] (X.W.); [email protected] (H.D.G.)

Md Mahmudul Hasan

Xiaoyi wang, hayley d. germack.

2 National Clinician Scholars Program, Yale University School of Medicine, New Haven, CT 06511, USA

3 Bouvé College of Health Sciences, Northeastern University, Boston, MA 02115, USA

Md Noor-E-Alam

Associated data.

The growing healthcare industry is generating a large volume of useful data on patient demographics, treatment plans, payment, and insurance coverage—attracting the attention of clinicians and scientists alike. In recent years, a number of peer-reviewed articles have addressed different dimensions of data mining application in healthcare. However, the lack of a comprehensive and systematic narrative motivated us to construct a literature review on this topic. In this paper, we present a review of the literature on healthcare analytics using data mining and big data. Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, we conducted a database search between 2005 and 2016. Critical elements of the selected studies—healthcare sub-areas, data mining techniques, types of analytics, data, and data sources—were extracted to provide a systematic view of development in this field and possible future directions. We found that the existing literature mostly examines analytics in clinical and administrative decision-making. Use of human-generated data is predominant considering the wide adoption of Electronic Medical Record in clinical care. However, analytics based on website and social media data has been increasing in recent years. Lack of prescriptive analytics in practice and integration of domain expert knowledge in the decision-making process emphasizes the necessity of future research.

1. Introduction

Healthcare is a booming sector of the economy in many countries [ 1 ]. With its growth, come challenges including rising costs, inefficiencies, poor quality, and increasing complexity [ 2 ]. U.S. healthcare expenditures increased by 123% between 2010 and 2015—from $2.6 trillion to $3.2 trillion [ 3 ]. Inefficient—non-value added tasks (e.g., readmissions, inappropriate use of antibiotics, and fraud)—constitutes 21–47% of this enormous expenditure [ 4 ]. Some of these costs were associated with low quality care—researchers found that approximately 251,454 patients in the U.S. die each year due to medical errors [ 5 ]. Better decision-making based on available information could mitigate these challenges and facilitate the transition to a value-based healthcare industry [ 4 ]. Healthcare institutions are adopting information technology in their management system [ 6 ]. A large volume of data is collected through this system on a regular basis. Analytics provides tools and techniques to extract information from this complex and voluminous data [ 2 ] and translate it into information to assist decision-making in healthcare.

Analytics is the way of developing insights through the efficient use of data and application of quantitative and qualitative analysis [ 7 ]. It can generate fact-based decisions for “planning, management, measurement, and learning” purposes [ 2 ]. For instance, the Centers for Medicare and Medicaid Services (CMS) used analytics to reduce hospital readmission rates and avert $115 million in fraudulent payment [ 8 ]. Use of analytics—including data mining, text mining, and big data analytics—is assisting healthcare professionals in disease prediction, diagnosis, and treatment, resulting in an improvement in service quality and reduction in cost [ 9 ]. According to some estimates, application of data mining can save $450 billion each year from the U.S. healthcare system [ 10 ]. In the past ten years, researchers have studied data mining and big data analytics from both applied (e.g., applied to pharmacovigilance or mental health) and theoretical (e.g., reflecting on the methodological or philosophical challenges of data mining) perspectives.

In this review, we systematically organize and summarize the published peer-reviewed literature related to the applied and theoretical perspectives of data mining. We classify the literature by types of analytics (e.g., descriptive, predictive, prescriptive), healthcare application areas (i.e., clinical decision support, mental health), and data mining techniques (i.e., classification, sequential pattern mining); and we report the data source used in each review paper which, to our best knowledge, has never done before.

Motivation and Scope

There is a large body of recently published review/conceptual studies on healthcare and data mining. We outline the characteristics of these studies—e.g., scope/healthcare sub-area, timeframe, and number of papers reviewed—in Table 1 . For example, one study reviewed awareness effect in type 2 diabetes published between 2001 and 2005, identifying 18 papers [ 11 ]. This current review literature is limited—most of the papers listed in Table 1 did not report the timeframe and/or number of papers reviewed (expressed as N/A).

Characteristics of existing review/conceptual studies on the related topics.

N/A represents Not Reported.

There is no comprehensive review available which presents the complete picture of data mining application in the healthcare industry. The existing reviews (16 out of 21) are either focused on a specific area of healthcare, such as clinical medicine (three reviews) [ 16 , 17 , 19 ], adverse drug reaction signal detection (two reviews) [ 25 , 26 ], big data analytics (four reviews) [ 8 , 10 , 22 , 24 ], or the application and performance of data mining algorithms (five reviews) [ 9 , 13 , 14 , 20 , 21 ]. Two studies focused on specific diseases (diabetes [ 11 ], skin diseases [ 18 ]). To the best of our knowledge, none of these studies present the universe of research that has been done in this field. These studies are also limited in the rigor of their methodology except for four articles [ 11 , 16 , 22 , 25 ], which provide key insights including the timeframe covered in the study, database search, and literature inclusion or exclusion criteria, but they are limited in their scope of topics covered (see Table 1 ).

Beyond condensing the applied literature, our review also adds to the body of theoretical reviews in the analytics literature. Current theoretical reviews are limited to methodological challenges and techniques to overcome those challenges [ 15 , 16 , 27 ] and application and impact of big data analytics in healthcare [ 23 ]. In summary, the current reviews listed in Table 1 lacks in (1) width of coverage in terms of application areas, (2) breadth of data mining techniques, (3) assessment of literature quality, and (4) systematic selection and analysis of papers. In this review, we aim to fill the above-mentioned gaps. We add to this literature by covering the applied and theoretical perspective of data mining and big data analytics in healthcare with a more comprehensive and systematic approach.

2. Methodology

The methodology of our review followed the checklist proposed by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [ 28 ]. We assessed the quality of the selected articles using JBI Critical Appraisal Checklist for analytical cross sectional studies [ 29 ] and Critical Appraisal Skills Programme (CASP) qualitative research checklist [ 30 ].

2.1. Input Literature

Selected literature and their selection process for the review are described in this section. Initially a two phase advance keyword search was conducted on the database Web of Science and one phase (Phase 2) search in PubMed and Google Scholar with time filter 1 January 2005 to 31 December 2016 in “All Fields”. Journal articles written in English was added as additional filters. Keywords listed in Table 2 were used in different phases. The complete search procedure was conducted using the following procedure:

An external file that holds a picture, illustration, etc.
Object name is healthcare-06-00054-g001.jpg

Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow chart [ 28 ] illustrating the literature search process.

  • Exclusion criteria: This included articles reporting on results of: qualitative study, survey, focus group study, feasibility study, monitoring device, team relationship measurement, job satisfaction, work environment, “what-if” analysis, data collection technique, editorials or short report, merely mention data mining, and articles not published in international journals . Duplicates were removed (33 articles). Finally, 117 articles were retained for the review. Figure 1 provides a PRISMA [ 28 ] flow diagram of the review process and Supplementary Information File S1 (Table S1) provides the PRISMA checklist.

Keywords for database search.

1 A logical operator used between the keywords during database search. 2 Cancer was listed independently because other dominant associations have the word “disease” associated with them (i.e., heart disease, skin disease, mental disease etc.).

2.2. Quality Assessment and Processing Steps

The full text of each of the 117 articles was reviewed separately by two researchers to eliminate bias [ 28 ]. To assess the quality of the cross sectional studies, we applied the JBI Critical Appraisal Checklist for Analytical Cross Sectional Studies [ 29 ]. For theoretical papers, we applied the Critical Appraisal Skills Programme (CASP) qualitative research checklist [ 30 ]. We modified the checklist items, as not all items specified in the JBI or CASP checklists were applicable to studies on healthcare analytics ( Supplementary Materials Table S2 ). We evaluated each article’s quality based on inclusion of: (1) clear objective and inclusion criteria; (2) detailed description of sample population and variables; (3) data source (e.g., hospital, database, survey) and format (e.g., structured Electronic Medical Record (EMR), International Classification of Diseases code, unstructured text, survey response); (4) valid and reliable data collection; (5) consideration of ethical issues; (6) detailed discussion of findings and implications; (7) valid and reliable measurement of outcomes; and (8) use of an appropriate data mining tool for cross-sectional studies and (1) clear statement of aims; (2) appropriateness of qualitative methodology; (3) appropriateness of research design; (4) clearly stated findings; and (5) value of research for the theoretical papers. Summary characteristics from any study fulfilling these criteria were included in the final data aggregation ( Supplementary Materials Table S3 ).

To summarize the body of knowledge, we adopted the three-step processing methodology outlined by Levy and Ellis [ 31 ] and Webster and Watson [ 32 ] ( Figure 2 ). During the review process, information was extracted by identifying and defining the problem, understanding the solution process and listing the important findings (“Know the literature”). We summarized and compared each article with the articles associated with the similar problems (“Comprehend the literature”). This simultaneously ensured that any irrelevant information was not considered for the analysis. The summarized information was stored in a spreadsheet in the form of a concept matrix as described by Webster and Watson [ 32 ]. We updated the concept matrix periodically, after completing every 20% of the articles which is approximately 23 articles, to include new findings (“Apply”). Based on the concept matrix, we developed a classification scheme (see Figure 3 ) for further comparison and contrast. We established an operational definition (see Table 3 ) for each class and same class articles were separated from the pool (“Analyze and Synthesis”). We compared classifications between researchers and we resolved disagreements (on six articles) by discussion. The final classification provided distinguished groups of articles with summary, facts, and remarks made by the reviewers (“Evaluate”).

An external file that holds a picture, illustration, etc.
Object name is healthcare-06-00054-g002.jpg

Three stages of effective literature review process, adapted from Levy and Ellis [ 31 ].

An external file that holds a picture, illustration, etc.
Object name is healthcare-06-00054-g003.jpg

Classification scheme of the literature.

Operational definition of the classes.

* Most of the definitions listed in this table are well established in literature and well know. Therefore, we did not use any specific reference. However, for some classes, specifically for types of analytics and data, varying definitions are available in the literature. We cited the sources of those definitions.

2.3. Results

The network diagram of selected articles and the keywords listed by authors in Figure 4 represents the outcome of the methodological review process. We elaborate on the resulting output in the subsequent sections using the structure of the developed classification scheme ( Figure 3 ). We also report the potential future research areas.

An external file that holds a picture, illustration, etc.
Object name is healthcare-06-00054-g004.jpg

Visualization of high-frequency keywords of the reviewed papers. The white circles symbolize the articles and the blue circles represent keywords. The keywords that occurred only once are eliminated as well as the corresponding articles. The size of the blue circles and the texts represent how often that keyword is found. The size of the white circles is proportional to the number of keywords used in that article. The links represents the connections between the keywords and the articles. For example, if a blue circle has three links (e.g., Decision-Making) that means that keyword was used in three articles. The diagram is created with the open source software Gephi [ 34 ].

2.3.1. Methodological Quality of the Studies

Out of 117 papers included in this review, 92 applied analytics and 25 were qualitative/conceptual. The methodological quality of the analytical studies (92 out of 117) were evaluated by a modified version of 8 yes/no questions suggested in JBI Critical Appraisal Checklist for Analytical Cross Sectional Studies [ 29 ]. Each question contains 1 point (1 if the answer is Yes or 0 for No). The score achieved by each paper is provided in the final column of Supplementary Materials Table S3 . On average, each paper applying analytics scored 7.6 out of 8, with a range of 6–8 points. Major drawbacks were the absence of data source and performance measure of data mining algorithms. Out of 92 papers, 23 did not evaluate or mention the performance of the applied algorithms and eight did not mention the source of the data. However, all the papers in healthcare analytics had a clear objective and a detailed discussion of sample population and variables. Data used in each paper was either de-identified/anonymized or approved by institute’s ethical committee to ensure patient confidentiality.

We applied the Critical Appraisal Skills Programme (CASP) qualitative research checklist [ 30 ] to evaluate the quality of the 25 theoretical papers. Five questions (out of ten) in that checklist were not applicable to the theoretical studies. Therefore, we evaluated the papers in this section in a five-point scale (1 if the answer is Yes or 0 for No). Papers included in this review showed high methodological quality as 21 papers (out of 25) scored 5. The last column in the Supplementary Materials Table S3 provides the score achieved by individual papers.

2.3.2. Distribution by Publication Year

The distribution of articles published related to data mining and big data analytics in healthcare across the timeline of the study (2005–2016) is presented in Figure 5 . The distribution shows an upward trend with at least two articles in each year and more than ten articles in the last four years. Additionally, this trend represents the growing interest of government agencies, healthcare practitioners, and academicians in this interdisciplinary field of research. We anticipate that the use of analytics will continue in the coming years to address rising healthcare costs and need of improved quality of care.

An external file that holds a picture, illustration, etc.
Object name is healthcare-06-00054-g005.jpg

Distribution of publication by year (117 articles).

2.3.3. Distribution by Journal

Articles published in 74 different journals were included in this study. Table 4 lists the top ten journals in terms of number of papers published. Expert System with Application was the dominant source of literature on data mining application in healthcare with 7 of the 117 articles. Journals were interdisciplinary in nature and spanned computational journals like IEEE Transection on Information Technology in Biomedicine to policy focused journal like Health Affairs . Articles published in Expert System with Application, Journal of Medical Systems, Journal of the American Medical Informatics Association, Healthcare Informatics Research were mostly related to analytics applied in clinical decision-making and healthcare administration. On the other hand, articles published in Health Affairs were predominantly conceptual in nature addressing policy issues, challenges, and potential of this field.

Top 10 journals on application of data mining in healthcare.

3. Healthcare Analytics

Out of 117 articles, 92 applied analytics for decision-making in healthcare. We discuss the types of analytics, the application area, the data, and the data mining techniques used in these articles and summarize them in Supplementary Materials Table S4 .

3.1. Types of Analytics

We identified three types of analytics in the literature: descriptive (i.e., exploration and discovery of information in the dataset), predictive (i.e., prediction of upcoming events based on historical data) and prescriptive (i.e., utilization of scenarios to provide decision support). Five of the 92 studies employed both descriptive and predictive analytics. In Figure 6 , which displays the percentage of healthcare articles using each analytics type, we show that descriptive analytics is the most commonly used in healthcare (48%). Descriptive analytics was dominant in all the application areas except in clinical decision support. Among the application areas, pharmacovigilance studies only used descriptive analytics as this application area is focused on identifying an association between adverse drug effects with medication. Predictive analytics was used in 43% articles. Among application areas, clinical decision support had the highest application of predictive analytics as many studies in this area are involved in risk and morbidity prediction of chest pain, heart attack, and other diseases. In contrast, use of prescriptive analytics was very uncommon (only 9%) as most of these studies were focused on either a specific population base or a specific disease scenario. However, some evidence of prescriptive analytics was found in public healthcare, administration, and mental health (see Supplementary Materials Table S4 ). These studies create a data repository and/or analytical platform to facilitate decision-making for different scenarios.

An external file that holds a picture, illustration, etc.
Object name is healthcare-06-00054-g006.jpg

Types of analytics used in literature. ( a ) Percentage of analytics type; ( b ) Analytics type by application area.

3.2. Types of Data

To identify types of data, we adopted the classification scheme identified by Raghupathi and Raghupathi [ 23 ] which takes into account the nature (i.e., text, image, number, electronic signal), source, and collection method of data together. Table 3 provides the operational definitions of taxonomy adopted in this paper. Figure 7 a presents the percentage of data type used and Figure 7 b, the number of usage by application area. As expected, human generated (HG) data, including EMR, Electronic Health Record (HER), and Electronic Patient Record (EPR), is the most commonly (77%) used form. Web or Social media (WS) data is the second dominant (11%) type of data, as increasingly more people are using social media now and ongoing digital revolution in the healthcare sector [ 35 ]. In addition, recent development in Natural Language Processing (NLP) techniques is making the use of WS data easier than before [ 36 ]. The other three types of data (SD, BT, and BM) consist of only about 12% of total data usage, but popularity and market growth of wearable personal health tracking devices [ 37 ] may increase the use of SD and BM data.

An external file that holds a picture, illustration, etc.
Object name is healthcare-06-00054-g007.jpg

Percentage of data type used ( a ) and type of data used by application area ( b ).

3.3. Data Mining Techniques

Data mining techniques used in the articles reviewed include classification, clustering, association, anomaly detection, sequential pattern mining, regression, and data warehousing. While elaborate description of each technique and available algorithms is out of scope of this review, we report the frequency of each technique and its sector wise distribution in Figure 8 a,b, respectively. Among the articles included in the review, 57 used classification techniques to analyze data. Association and clustering were used in 21 and 18 articles, respectively. Use of other techniques was less frequent.

An external file that holds a picture, illustration, etc.
Object name is healthcare-06-00054-g008.jpg

Utilization of data mining techniques, ( a ) by percentage and ( b ) by application area.

A high proportion (8 out of 9) of pharmacovigilance papers used association. Use of classification was dominant in every sector except pharmacovigilance ( Figure 8 b). Data warehousing was mostly used in healthcare administration ( Figure 8 b).

We delved deeper into classification as it was utilized in the majority (57 out of 92) of the papers. There are a number of algorithms used for classification, which we present in a word cloud in Figure 9 . Support Vector Machine (SVM), Artificial Neural Network (ANN), Logistic Regression (LR), Decision Tree (DT), and DT based algorithms were the most commonly used. Random Forest (RF), Bayesian Network and Fuzzy-based algorithms were also often used. Some papers (three papers) introduced novel algorithms for specific applications. For example, Yeh et al. [ 38 ] developed discrete particle swarm optimization based classification algorithm to classify breast cancer patients from a pool of general population. Self-organizing maps and K-means were the most commonly used clustering algorithm in healthcare. Performance (e.g., accuracy, sensitivity, specificity, area under the ROC curve, positive predictive value, negative predictive value etc.) of each of these algorithms varied by application and data type. We recommend applying multiple algorithms and choosing the one which achieves the best accuracy.

An external file that holds a picture, illustration, etc.
Object name is healthcare-06-00054-g009.jpg

Word cloud [ 39 ] with classification algorithms.

4. Application of Analytics in Healthcare

Table 3 provides the operational definitions of the six application areas (i.e., clinical decision support, healthcare administration, privacy and fraud detection, mental health, public health, and pharmacovigilance) identified in this review. Figure 10 shows the percentage of articles in each area. Among different classes in healthcare analytics, data mining application is mostly applied in clinical decision support (42%) and administrative purposes (32%). This section discusses the application of data mining in these areas and identifies the main aims of these studies, performance gaps, and key features.

An external file that holds a picture, illustration, etc.
Object name is healthcare-06-00054-g010.jpg

Percentage of papers utilized healthcare analytics by application area (92 articles out of 117).

4.1. Clinical Decision Support

Clinical decision support consists of descriptive and/or predictive analysis mostly related to cardiovascular disease (CVD), cancer, diabetes, and emergency/critical care unit patients. Some studies developed novel data mining algorithms which we review. Table 5 describes the topics investigated and data sources used by papers using clinical decision-making, organized by major diseases category.

Topics and data sources of papers using clinical decision-making, organized by major disease category.

4.1.1. Cardiovascular Disease (CVD)

CVD is one of the most common causes of death globally [ 45 , 77 ]. Its public health relevance is reflected in the literature—it was addressed by seven articles (18% of articles in clinical decision support).

Risk factors related to Coronary Heart Disease (CHD) were distilled into a decision tree based classification system by researchers [ 40 ]. The authors investigated three events: Coronary Artery Bypass Graft Surgery (CABG), Percutaneous Coronary Intervention (PCI), and Myocardial Infarction (MI). They developed three models: CABG vs. non-CABG, PCI vs. non-PCI, and MI VS non-MI. The risk factors for each event were divided into four groups in two stages. The risk factors were separated into before and after the event at the 1st stage and modifiable (e.g., smoking habit or blood pressure) and non-modifiable (e.g., age or sex) at the 2nd stage for each group. After classification, the most important risk factors were identified by extracting the classification rules. The Framingham equation [ 78 ]—which is widely used to calculate global risk for CHD was used to calculate the risk for each event. The most important risk factors identified were age, smoking habit, history of hypertension, family history, and history of diabetes. Other studies on CHD show similar results [ 79 , 80 , 81 ]. This study had implications for healthcare providers and patients by identifying risk factors to specifically target, identify and in the case of modifiable factors, reduce CHD risk [ 40 ].

Data mining has also been applied to diagnose Coronary Artery Disease (CAD) [ 41 ]. Researchers showed that in lieu of existing diagnostic methods (i.e., Coronary Angiography (CA))—which are costly and require high technical skill—data mining using existing data like demographics, medical history, simple physical examination, blood tests, and noninvasive simple investigations (e.g., heart rate, glucose level, body mass index, creatinine level, cholesterol level, arterial stiffness) is simple, less costly, and can be used to achieve a similar level of accuracy. Researchers used a four-step classification process: (1) Decision tree was used to classify the data; (2) Crisp classification rules were generated; (3) A fuzzy model was created by fuzzifying the crisp classifier rules; and (4) Fuzzy model parameters were optimized and the final classification was made. The proposed optimized fuzzy model achieved 73% of prediction accuracy and improved upon an existing Artificial Neural Network (ANN) by providing better interpretability.

Traditional data mining and machine learning algorithms (e.g., probabilistic neural networks and SVM) may not be advanced enough to handle the data used for CVD diagnosis, which is often uncertain and highly dimensional in nature. To tackle this issue, researchers [ 42 ] proposed a Fuzzy standard additive model (SAM) for classification. They used adaptive vector quantization clustering to generate unsupervised fuzzy rules which were later optimized (minimized the number of rules) by Genetic Algorithm (GA). They then used the incremental form of a supervised technique, Gradient Descent, to fine tune the rules. Considering the highly time consuming process of the fuzzy system given large number of features in the data, the number of features was reduced with wavelet transformation. The proposed algorithm achieved better accuracy (78.78%) than the probabilistic neural network (73.80%), SVM (74.27%), fuzzy ARTMAP (63.46%), and adaptive neuro-fuzzy inference system (74.90%). Another common issue in cardiovascular event risk prediction is the censorship of data (i.e., the patient’s condition is not followed up after they leave hospital and until a new event occurs; the available data becomes right-censored). Elimination and exclusion of the censored data create bias in prediction results. To address the censorship of the data in their study on CVD event risk prediction after time, two studies [ 43 , 44 ] used Inverse Probability Censoring Weighting (IPCW). IPCW is a pre-processing step used to calculate the weights on data which are later classified using Bayesian Network. One of these studies [ 43 ] provided an IPCW based system which is compatible with any machine learning algorithm.

Electrocardiography (ECG)—non-invasive measurement of the electrical activity of the heartbeat—is the most commonly used medical studies in the assessment of CVD. Machine learning offers potential optimization of traditional ECG assessment which requires decompressing before making any diagnosis. This process takes time and large space in computers. In one study, researchers [ 45 ] developed a framework for real-time diagnosis of cardiovascular abnormalities based on compressed ECG. To reduce diagnosis time—which is critical for clinical decision-making regarding appropriate and timely treatment—they proposed and tested a mobile based framework and applied it to wireless monitoring of the patient. The ECG was sent to the hospital server where the ECG signals were divided into normal and abnormal clusters. The system detected cardiac abnormality with 97% accuracy. The cluster information was sent to patient’s mobile phone; and if any life-threatening abnormality was detected, the mobile phone alerted the hospital or the emergency personnel.

Data analytics have also been applied to more rare CVDs. One study [ 46 ] developed an intervention prediction model for Hypoplastic Left Heart Syndrome (HLHS). HLHS is a rare form of fatal heart disease in infants, which requires surgery. Post-surgical evaluation is critical as patient condition can shift very quickly. Indicators of wellness of the patients are not easily or directly measurable, but inferences can be made based on measurable physiological parameters including pulse, heart rhythm, systemic blood pressure, common atrial filling pressure, urine output, physical exam, and systemic and mixed venous oxygen saturations. A subtle physiological shift can cause death if not noticed and intervened upon. To help healthcare providers in decision-making, the researchers developed a prediction model by identifying the correlation between physiological parameters and interventions. They collected 19,134 records of 17 patients in Pediatric Intensive Care Units (PICU). Each record contained different physiological parameters measured by devices and noted by nurses. For each record, a wellness score was calculated by the domain experts. After classifying the data using a rough set algorithm, decision rules were extracted for each wellness score to aid in making intervention plans. A new measure for feature selection—Combined Classification Quality (CCQ)—was developed by considering the effect of variations in a feature values and distinct outcome each feature value leads to. Authors showed that higher value of CCQ leads to higher classification accuracy which is not always true for commonly used measure classification quality (CQ). For example, two features with CQ value of 1 leads to very different classification accuracy—35.5% and 75%. Same two features had CCQ value 0.25 and 0.40, features with 0.40 CCQ produced 75% classification accuracy. By using CCQ instead of CQ, researchers can avoid such inconsistency.

4.1.2. Diabetes

The disease burden related to diabetes is high and rising in every country. According to the World Health Organization’s (WHO) prediction, it will become the seventh leading cause of death by 2030 [ 82 ]. Data mining has been applied to identify rare forms of diabetes, identify the important factors to control diabetes, and explore patient history to extract knowledge. We reviewed 7 studies that applied healthcare analytics to diabetes.

Researchers extracted knowledge about diabetes treatment pathways and identified rare forms and complications of diabetes using a three level clustering framework from examination history of diabetic patients [ 48 ]. In this three-level clustering framework, the first level clustered patients who went through regular tests for monitoring purposes (e.g., checkup visit, glucose level, urine test) or to diagnose diabetes-related complications (e.g., eye tests for diabetic retinopathy). The second level explored patients who went through diagnosis for specific or different diabetic complications only (e.g., cardiovascular, eye, liver, and kidney related complications). These two level produced 2939 outliers out of 6380 patients. At the third level, authors clustered these outlier patients to gain insight about rare form of diabetes or rare complications. A density based clustering algorithm, DBSCAN, was used for clustering as it doesn’t require to specify the number of clusters apriori and is less sensitive to noise and outliers. This framework for grouping patients by treatment pathway can be utilized to evaluate treatment plans and costs. Another group of researchers [ 49 ] investigated the important factors related to type 2 diabetes control. They used feature selection via supervised model construction (FSSMC) to select the important factors with rank/order. They applied naïve bayes, IB1 and C4.5 algorithm with FSSMC technique to classify patients having poor or good diabetes control and evaluate the classification efficiency for different subsets of features. Experiments performed with physiological and laboratory information collected from 3857 patients showed that the classifier algorithms performed best (1–3% increase in accuracy) with the features selected by FSSMC. Age, diagnosis duration, and Insulin treatment were the top three important factors.

Data analytics have also been applied to identify patients with type 2 diabetes. In one study [ 52 ], using fragmented data from two different healthcare centers, researchers evaluated the effect of data fragmentation on a high throughput clinical phenotyping (HTCP) algorithm to identify patients at risk of developing type 2 diabetes. When a patient visits multiple healthcare centers during a study period, his/her data is stored in different EMRs and is called fragmented. In such cases, using HTPC algorithm can lead to improper classification. An experiment performed in a rural setting showed that using data from two healthcare centers instead of one decreased the false negative rate from 32.9% to 0%. In another study, researchers [ 51 ] utilized sparse logistic regression to predict type 2 diabetes risk from insurance claims data. They developed a model that outperformed the traditional risk prediction methods for large data sets and data sets with missing value cases by increasing the AUC value from 0.75 to 0.80. The dataset contained more than 500 features including demography, specific medical conditions, and comorbidity. And in another study, researchers [ 53 ] developed prediction and risk diagnosis model using a hybrid system with SVM. Using features like blood pressure, fasting blood sugar, two-hour post-glucose tolerance, cholesterol level along with other demographic and anthropometric features, the SVM algorithm was able to predict diabetes risk with 97% accuracy. One reason for achieving high accuracy compared to the study using insurance claims data [ 51 ] is the structured nature of the data which came from a cross-sectional survey on diabetes.

Different statistical and machine learning algorithms are available for classification purposes. Researchers [ 50 ] compared the performance of two statistical method (LR and Fisher linear discriminant analysis) and four machine learning algorithms (SVM (using radial basis function kernel), ANN, Random Forest, and Fuzzy C-mean) for predicting diabetes diagnosis. Ten features (age, gender, BMI, waist circumference, smoking, job, hypertension, residential region (rural/urban), physical activity, and family history of diabetes) were used to test the classification performance (diabetes or no diabetes). Parameters for ANN and SVM were optimized through Greedy search. SVM showed best performance in all performance measures. SVM was at least 5% more accurate than other classification techniques. Statistical methods performed similar to the other machine learning algorithms. This study was limited by a low prevalence of diabetes in the dataset, however, which can cause poor classification performance. Researchers [ 47 ] also proposed a novel pattern recognition algorithm by using convolutional nonnegative matrix factorization. They considered a patient as an entity and each of patients’ visit to the doctor, prescriptions, test result, and diagnosis are considered as an event over time. Finding such patterns can be helpful to group similar patients, identify their treatment pathway as well as patient management. Though they did not compare the pattern recognition accuracy with existing methods like single value decomposition (SVD), the matrix-like representation makes it intuitive.

4.1.3. Cancer

Cancer is another major threat to public health [ 83 ]. Machine learning has been applied to cancer patients to predict survival, and diagnosis. We reviewed five studies that applied healthcare analytics to cancer.

Despite many advances in treatment, accurate prediction of survival in patients with cancer remains challenging considering the heterogeneity of cancer complexity, treatment options, and patient population. Survival of prostate cancer patients has been predicted using a classification model [ 54 ]. The model used a public database-SEER (Surveillance, Epidemiology, and End Result) and applied a stratified ten-fold sampling approach. Survival prediction among prostate cancer patients was made using DT, ANN and SVM algorithm. SVM outperformed other algorithms with 92.85% classification accuracy wherein DT and ANN achieved 90% and 91.07% accuracy respectively. This same database has been used to predict survival of lung cancer patients [ 56 ]. After preprocessing the 11 features available in the data set, authors identified two features (1. removed and examined regional lymph node count and 2. malignant/in-situ tumor count) which had the strongest predictive power. They used several supervised classification methods on the preprocessed data; ensemble voting of five decision tree based classifiers and meta-classifiers (J48 DT, RF, LogitBoost, Random Subspace, and Alternating DT) provided the best performance—74% for 6 months, 75% for 9 months, 77% for 1 year, 86% for 2 years, and 92% for 5 years survival. Using this technique, they developed an online lung cancer outcome calculator to estimate the risk of mortality after 6 months, 9 months, 1 year, 2 years and 5 years of diagnosis.

In addition to predicting survival, machine learning techniques have also been used to identify patients with cancer. Among patients with breast cancer, researchers [ 38 ] have proposed a new hybrid algorithm to classify breast cancer patient from patients who do not have breast cancer. They used correlation and regression to select the significant features at the first stage. Then, at the second stage, they used discrete Particle Swarm Optimization (PSO) to classify the data. This hybrid algorithm was applied to Wisconsin Breast Cancer Data set available at UCI machine learning repository. It achieved better accuracy (98.71%) compared to a genetic algorithm (GA) (96.14%) [ 84 ] and another PSO-based algorithm (93.4%) [ 85 ].

Machine learning has also been used to identify the nature of cancer (benign or malignant) and to understand demographics related to cancer. Among patients with breast cancer, researchers [ 42 ] applied the Fuzzy standard additive model (SAM) with GA (discussed earlier in relation to CVD)-predicting the nature of breast cancer (benign or malignant). They used a UCI machine learning repository which was capable of classifying uncertain and high dimensional data with greater accuracy (by 1–2%). Researchers have also used big data [ 55 ] to create a visualization tool to provide a dynamic view of cancer statistics (e.g., trend, association with other diseases), and how they are associated with different demographic variables (e.g., age, sex) and other diseases (e.g., diabetes, kidney infection). Use of data mining provided a better understanding of cancer patients both at demographic and outcome level which in terms provides an opportunity of early identification and intervention.

4.1.4. Emergency Care

The Emergency department (ED) is the primary route to hospital admission [ 58 ]. In 2011, 20% of US population had at least one or more visits to the ED [ 86 ]. EDs are experiencing significant financial pressure to increase efficiency and throughput of patients. Discrete event simulation (i.e., modeling system operations with sequence of isolated events) is a useful tool to understand and improve ED operations by simulating the behavior and performance of EDs. Certain features of the ED (e.g., different types of patients, treatments, urgency, and uncertainty) can complicate simulation. One way to handle the complexity is to group the patients according to required treatment. Previously, the “casemix” principle, which was developed by expert clinicians to groups of similar patients in case-specific settings (e.g., telemetry or nephrology units), was used, but it has limitations in the ED setting [ 58 ]. Researchers applied [ 58 ] data mining (clustering) to the ED setting to group the patients based on treatment pattern (e.g., full ward test, head injury observation, ECG, blood glucose, CT scan, X-ray). The clustering model was verified and validated by ED clinicians. These grouping data were then used in discrete event simulation to understand and improve ED operations (mainly length of stay) and process flows for each group.

Chest pain admissions to the ED have also been examined using decision-making framework. Researchers [ 57 ] proposed a three stage decision-making framework for classifying severity of chest pain as: AMI, angina pectoris, or other. At the first stage, lab tests and diagnoses were collected and the association between them were extracted. In the second stage, experts developed association rules between lab tests diagnosis to help physicians make quick diagnostic decisions based diagnostic tests and avoid further unnecessary lab tests. In the third stage, authors developed a classification tree to classify the chest pain diagnosis based on selected lab test, diagnosis and medical record. This hybrid model was applied to the emergency department at one hospital. They developed the classification system using 327 association rules to selected lab tests using C5.0, Neural Network (NN) and SVM. C5.0 algorithm achieved 94.18% accuracy whereas NN and SVM achieved 88.89% and 85.19% accuracy respectively.

4.1.5. Intensive Care

Intensive care units cater to patients with severe and life-threatening illness and injury which require constant, close monitoring and support to ensure normal bodily function. Death is a much more common event in an ICU compared to a general medical unit—one study showed that 22.4% of total death in hospitals occurred in the ICU [ 87 ]. Survival predictions and identification of important factors related to mortality can help healthcare providers plan care. We identified two papers [ 59 , 60 ] that developed prediction models for ICU mortality rate prediction. Using a large amount of ICU patient data (specifically from the first 24 h of the stay) collected from University of Kentucky Hospital from 1998 to 2007 (38,474 admissions), one group of researchers identified 15 out of 40 significant features using Pearson’s Chi-square test (for categorical variables) and Student-t test (for continuous variable) [ 59 ]. The mortality rate was predicted by DT, ANN, SVM and APACHE III, a logistic regression based approach. Compared to the other methods applied, DT’s AUC value was higher by 0.02. The study was limited, however, by only considering the first 24 h of admission to the ICU, which may not be enough to make prediction on mortality rate. Another team of researchers [ 60 ] applied a similarity metric to predict 30-day mortality prediction in 17,152 ICU admissions data extracted from MIMIC-II database [ 88 ]. Their analysis concluded that a large group of similar patient data (e.g., vital sign, laboratory test result) instead of all patient data would lead to slightly better prediction accuracy. The logistic regression model for mortality prediction achieved 0.83 AUC value when 5000 similar patients were used for training but, its performance declined to 0.81 AUC when all the available patient data were used.

4.1.6. Other Applications

In addition to CVD, diabetes, cancer, emergency care, and ICU care, data mining has been applied to various clinical decision-making problems like pressure ulcer risk prediction, general problem lists, and personalized medical care. To predict pressure ulcer formation (localized skin and tissue damage because of shear, friction, pressure or any combination of these factors), researchers [ 62 ] developed two classification-based predictive models. One included all 14 features (including age, sex, course, Anesthesia, body position during operation, and skin status) and another, reduced model, including significant features only (5 in DT model, 7 in SVM, LR and Mahalanobis Taguchi System model). Mahalanobis Taguchi System (MTS), SVM, DT, and LR were used for both classification and feature selection (in the second model only) purposes. LR and SVM performed slightly better when all the features were included, but MTS achieved better sensitivity and specificity in the reduced model (+10% to +15%). These machine learning techniques can provide better assistance in pressure ulcer risk prediction than the traditional Norton and Braden medical scale [ 62 ]. Though the study provides the advantages of using data mining algorithms, the data set used here was imbalanced as it only had 8 cases of pressure ulcer in 168 patients. Also among patients with pressure ulcers, another team of researchers [ 63 ] recommended a data mining based alternative to the Braden scale for prediction. They applied data mining algorithms to four years of longitudinal patient data to identify the most important factors related to pressure ulcer prediction (i.e., days of stay in the hospital, serum albumin, and age). In terms of C-statistics, RF (0.83) provided highest predictive accuracy over DT (0.63), LR (0.82), and multivariate adaptive regression splines (0.78).

For data mining algorithms, which often show poor performance with imbalanced (i.e., low occurrence of one class compared to other classes) data, researchers [ 70 ] developed a sub-sampling technique. They designed two experiments, one considered sub-sampling technique and another one did not. For a highly imbalanced data set, Random Forest (RF), SVM, and Bagging and Boosting achieved better classification accuracy with this sub-sampling technique in classifying eight diseases (male genital disease, testis cancer, encephalitis, aneurysm, breast cancer, peripheral atherosclerosis, and diabetes mellitus) that had less than 5% occurrences in the National Inpatient Sample (NIS) data of Healthcare Cost and Utilization Project (HCUP). Surprisingly, possibly due to balancing the dataset through sub-sampling, RF slightly outperformed (+0.01 AUC) the other two methods.

The patient problem list is a vital component of clinical medicine. It enables decision support and quality measurement. But, it is often incomplete. Researchers have [ 64 ] suggested that a complete list of problems leads to better quality treatment in terms of final outcome [ 64 ]. Complete problem lists enable clinicians to get a better understanding of the issue and influence diagnostic reasoning. One group of researchers proposed a data mining model to find an association between patient problems and prescribed medications and laboratory tests which can act as a support to clinical decision-making [ 64 ]. Currently, domain experts spend a large amount of time for this purpose but, association rule mining can save both time and other resources. Additionally, consideration of unstructured data like doctor’s and/or nurse’s written comments and notes can provide additional information. These association rules can aid clinicians in preventing errors in diagnosis and reduce treatment complexity. For example, a set of problems and medications can co-occur frequently. If a clinician has knowledge about this relation, he/she can prescribe similar medications when faced with a similar set of problems. One group of researchers [ 61 ] developed an approach which achieved 90% accuracy in finding association between medications and problems, and 55% accuracy between laboratory tests and problems. Among outpatients diagnosed with respiratory infection, 92.79% were treated with drugs. Physicians could choose any of the 100,013 drugs available in the inventory. Moreover, in an attempt to examine the treatment plan patterns, they identified the 78 most commonly used drugs which could be prescribed, regardless of patient’s complaints and demography. The classification model used to identify the most common drugs achieved 74.73% accuracy and most importantly found variables like age, race, gender, and complaints of patients were insignificant.

Personalized medicine—tailored treatment based on a patient’s predicted response or risk of disease—is another venue for data mining algorithms. One group of researchers [ 66 ] used a big data framework to create personalized care system. One patient’s medical history is compared with other available patient data. Based on that comparison, possibility of a disease of an individual was calculated. All the possible diseases were ranked from high risk to low risk diseases. This approach is very similar to how online giants Netflix and Amazon suggest movies and books to the customer [ 66 ]. Another group of researchers [ 67 ] used the Electronic Patient Records (EPR), which contains structured data (e.g., disease code) and unstructured data (e.g., notes and comments made by doctors and nurses at different stages of treatment) to develop personalized care. From the unstructured text data, the researchers extracted clinical terms and mapped them to an ontology. Using this mapped codes and existing structured data (disease code), they created a phenotypic profile for each patient. The patients were divided into different clusters (with 87.78% precision) based on the similarity of their phenotypic profile. Correlation of diseases were captured by counting the occurrences of two or more diseases in patient phenotype. Then, the protein/gene structure associated with the diseases was identified and a protein network was created. From the sharing of specific protein structure by the diseases, correlation was identified.

Among patients with asthma, researchers [ 65 ] used environmental and patient physiological data to develop a prediction model for asthma attack to give doctors and patients a chance for prevention. They used data from a home-care institute where patients input their physical condition online; and environmental data (air pollutant and weather data). Their data mining model involved feature selection through sequential pattern mining and risk prediction using DT and association rule mining. This model can make asthma attack risk prediction with 86.89% accuracy. Real implementation showed that patients found risk prediction helpful to avoid severe asthma attacks.

Among patients with Parkinson’s disease, researchers [ 73 ] introduced a comprehensive end-to-end protocol for complex and heterogeneous data characterization, manipulation, processing, cleaning, analysis and validation. Specifically, the researchers used a Synthetic Minority Over-sampling Technique (SMOTE) to rebalance the data set. Rebalancing the dataset using SMOTE improved SVM’s classification accuracy from 76% to 96% and AdaBoost’s classification accuracy from 96% to 99%. Moreover, the study found that traditional statistical classification approaches (e.g., generalized linear model) failed to generate reliable predictions but machine learning-based classification methods performed very well in terms of predictive precision and reliability.

Among patients with kidney disease, researchers [ 71 ] developed a prediction model to forecast survival. Data collected from four facilities of University of Iowa Hospital and Clinics contains 188 patients with over 707 visits and features like blood pressure measures, demographic variables, and dialysis solution contents. Data was transformed using functional relation (i.e., the similarity between two or more features when two features have same values for a set of patients, they are combined to form a single feature) between the features. The data set was randomly divided into eight sub-sets. Sixteen classification rules were generated for the eight sub-sets using two classification algorithms—Rough Set (RS) and DT. Classes represented survival beyond three years, less than three years and undetermined. To make predictions, each classification rule (out of 16) had one vote and the majority vote decided the final predictive class. Transformed data increased predictive accuracy by 11% than raw data and DT (67% accuracy) performed better than RS (56% accuracy). The researchers suggested that this type of predictive analysis can be helpful in personalized treatment selection, resource allocation for patients, and designing clinical study. Among patients on kidney dialysis, another group of researchers [ 74 ] applied temporal pattern mining to predict hospitalization using biochemical data. Their result showed that amount of albumin—a type of protein float in blood—is the most important predictor of hospitalization due to kidney disease.

Among patients over 50 years of age, researchers [ 75 ] developed a data mining model to predict five years mortality using the EHR of 7463 patients. They used Ensemble Rotating Forest algorithm with alternating decision tree to classify the patients into two classes of life expectancy: (1) less than five years and (2) equal or greater than five years. Age, comorbidity count, previous record of hospitalization record, and blood urea nitrogen were a few of the significant features selected by correlation feature selection along with greedy stepwise search method. Accuracy achieved by this approach (AUC 0.86) was greater than the standard modified Charlson Index (AUC 0.81) and modified Walter Index (AUC 0.78). Their study showed that age, hospitalization prior the visit, and highest blood urea nitrogen were the most important factors for predicting five years morbidity. This five-year morbidity prediction model can be very helpful to optimally use resources like cancer screening for those patients who are more likely to be benefit from the resources.

Another group of researchers [ 76 ] addressed the limitations of existing software technology for disease diagnosis and prognosis, such as inability to handle data stream (DT), impractical for complex and large systems (Bayesian Network), exhaustive training process (NN). To overcome these restriction, authors proposed a decision tree based algorithm called “Very Fast Decision Tree (VFDT)”. Comparison with a similar system developed by IBM showed that VFDT utilizes lesser amount of system resources and it can perform real time classification.

Researchers have also used data mining to optimize the glaucoma diagnosis process [ 68 ]. Traditional approaches including Optical Coherence Tomography, Scanning Laser Polarimetry (SLP), and Heidelberg Retina Tomography (HRT) scanning methods are costly. This group used Fundus image data which is less costly and classified patient as either normal or glaucoma patient using SVM classifier. Before classification, authors selected significant features by using Higher Order Spectra (HOS) and Discrete Wavelet Transform (DWT) method combined and separately. Several kernel functions for SVM—all delivering similar levels of accuracy—were applied. Their approach produced 95% accuracy in glaucoma prediction. For diagnostic evaluation of chest imaging for suspicion for malignancy, researchers [ 69 ] designed trigger criteria to identify potential follow-up delays. The developed trigger predicted the patients who didn’t require follow-up evaluation. The analysis of the experiment result indicated that the algorithm to identify patients’ delays in follow-up of abnormal imaging is effective with 99% sensitivity and 38% specificity.

Data mining has also been applied to [ 72 ] compare three metrics to identify health care associated infections—Catheter Associated Bloodstream Infections, Catheter Associated Urinary Tract Infections and Ventilator Associated Pneumonia. Researchers compared traditional surveillance using National Healthcare Safety Network methodology to data mining using MedMined Data Mining Surveillance (CareFusion Corporation, San Diego, CA, USA), and administrative coding using ICD-9-CM. Traditional surveillance proved to be superior than data mining in terms of sensitivity, positive predictive value and rate estimation.

Data mining has been used in 38 studies of clinical decision-making CVD (7 articles), diabetes (seven articles), cancer (five articles), emergency care (two articles), intensive care (two articles), and other applications (16 articles). Most of the studies developed predictive models to facilitate decision-making and some developed decision support system or tools. Authors often tested their models with multiple algorithms; SVM was at the top of that list and often outperformed other algorithms. However, 15 [ 38 , 40 , 42 , 45 , 47 , 51 , 54 , 56 , 58 , 60 , 61 , 66 , 73 , 74 , 76 ] of the studies did not incorporate expert opinion from doctors, clinician, or appropriate healthcare personals in building models and interpreting results (see the study characteristics in Supplementary Materials Table S3 ). We also noted that there is an absence of follow-up studies on the predictive models, and specifically, how the models performed in dynamic decision-making situations, if doctors and healthcare professionals comfortable in using these predictive models, and what are the challenges in implementing the models if any exist? Existing literature does not focus on these salient issues.

4.2. Healthcare Administration

Data mining was applied to administrative purposes in healthcare in 32% (29 articles) of the articles reviewed. Researchers have applied data mining to: data warehousing and cloud computing; quality improvement; cost reduction; resource utilization; patient management; and other areas. Table 6 provides a list of these articles with major focus areas, problems analyzed and the data source.

Problem analyzed and data sources in healthcare administration.

4.2.1. Data Warehousing and Cloud Computing

Data warehousing [ 90 ] and cloud computing are used to securely and cost-effectively store the growing volume of electronic patient data [ 1 ] and to improve hospital outcomes including readmissions. To identify cause of readmission, researchers [ 89 ] developed an open source software—Analytic Information Warehouse (AIW). Users can design a virtual data model (VDM) using this software. Required data to test the model can be extracted in terms of a temporal ontology from the data warehouse and analysis can be performed using any standard analyzing tool. Another group of researchers took a similar approach to develop a Clinical Data Warehouse (CDW) for traditional Chinese medicine (TCM). The warehouse contains clinical information (e.g., symptoms, disease, and treatment) for 20,000 inpatients and 20,000 outpatients. Data was collected in a structured way using pre-specified ontology in electronic form. CDW provides an interface for online data mining, online analytical processing (OLAP) and network analysis to discover knowledge and provide clinical decision support. Using these tools, classification, association and network analysis between symptoms, diseases and medications (i.e., herbs) can be performed.

Apart from clinical purposes, data warehouses can be used for research, training, education, and quality control purposes. Such a data repository was created using the basic idea of Google search engine [ 92 ]. Users can pull the radiology report files by searching keywords like a simple google search following the predefined patient privacy protocol. Another data repository was created as a part of collaborative study between IBM and University of Virginia and its partner, Virginia Commonwealth University Health System was created [ 93 ]. The repository contains 667,000 patient record with 208 attributes. HealthMiner—a data mining package for healthcare created by IBM—was used to perform unsupervised analysis like finding associations, pattern and knowledge discovery. This study also showed the research benefits of this type of large data repository. Researchers [ 91 ] proposed a framework based on cloud computing and big data to unify data collected from different sources like public databases and personal health devices. The architecture was divided into 3 layers. The first layer unified heterogeneous data from different sources, the second layer provided storage support and facilitated data processing and analytics access, and the third layer provided result of analysis and platform for professionals to develop analytical tools. Some researchers [ 94 ] used mobile devices to collect personal health data. Users took part in a survey on their mobile devices and got a diagnosis report based on their health parameters input in the survey. Each survey data were saved in a cloud-based interface for effective storage and management. From user input stored in cloud, interactive geo-spatial maps were developed to provide effective data visualization facility.

4.2.2. Healthcare Cost, Quality and Resource Utilization

Ten articles applied data mining to cost reduction, quality improvement and resource utilization issues. One group of researchers predicted healthcare costs using an algorithmic approach [ 96 ]. They used medical claim data of 800,000 people collected by an insurance company over the period of 2004–2007. The data included diagnoses, procedures, and drugs. They used classification and clustering algorithms and found that these data mining algorithms improve the absolute prediction error more than 16%. Two prediction models were developed, one using both cost and medical information and the other used only cost information. Both models had similar accuracy on predicting healthcare costs but performed better than traditional regression methods. The study also showed that including medical information does not improve cost prediction accuracy. Risk-adjusted health care cost predictions, with diagnostic groups and demographic variables as inputs, have also been assessed using regression tree boosting [ 100 ]. Boosted regression tree and main effects linear models were used and fitted to predict current (2001) and prospective (2002) total health care costs per patient. The authors concluded that the combination of regression tree boosting and a diagnostic grouping scheme are a competitive alternative to commonly used risk-adjustment systems.

A sizable amount ($37.6 billion) of healthcare costs is attributable to medical errors, 45% of which stems from preventable errors [ 95 ]. To aid in physician decision-making and reduce medical errors, researchers [ 95 ] proposed a data mining-based framework-Sequential Clustering Algorithm. They identified patterns of treatment plans, tests, medication types and dosages prescribed for specific diseases, and other services provided to treat a patient throughout his/her stay in the hospital. The proposed framework was based on cloud computing so that the knowledge extracted from the data could be shared among hospitals without sharing the actual record. They proposed to share models using Virtual Machine (VM) images to facilitate collaboration among international institutions and prevent the threat of data leakage. This model was implemented in two hospitals, one in Taiwan and another in Mongolia. To identify best practices for specific diseases and prevent medical errors, another group of researchers [ 101 ] proposed a decision support system using information extraction from online documents through text and data mining. They focused on evidence based management, quality control, and best practice recommendations for medical prescriptions.

Length of Stay (LOS) is another important indicator of cost and quality of care. Accurate prediction of LOS can lead to efficient management of hospital beds and resources. To predict LOS for CAD patients, researchers [ 98 ] compared multiple models—SVM, ANN, DT and an ensemble algorithm, combing SVM, C5.0, and ANN. Ensemble algorithm and SVM produced highest accuracy, 95.9% and 96.4% respectively. In contrast, ANN was least accurate with 53.9% accuracy wherein DT achieved 83.5% accuracy. Anticoagulant drugs, nitrate drugs, and diagnosis were the top three predictors along with diastolic blood pressure, marital status, sex, presence of comorbidity, and insurance status.

To predict healthcare quality, researchers [ 104 ] used sentiment analysis (computationally categorizing opinions into categories like positive, negative and neutral) on patients’ online comments about their experience. They found above 80% agreement between sentiment analysis from online forums and traditional paper based surveys on quality prediction (e.g., cleanliness, good behavior, recommendation). Proposed approach can be an inexpensive alternative to traditional surveys and reports to measure healthcare quality.

Identification of influential factors in insurance coverage using data mining can aid insurance providers and regulators to design targeted service, additional service or proper allocation of resources to increase coverage rates. To develop a classification model to identify health insurance coverage, researchers [ 103 ] used data mining techniques. Based on 23 socio-economic, lifestyle and demographic factors, they developed a classification model with two classes, Insured and uninsured. The model was solved by ANN and DT. ANN provided 4% more accuracy than DT in predicting health insurance coverage. Among the factors, income, employment status, education, and marital status were the most important predictive factors of insurance coverage.

Among patients with lung cancer, researchers [ 97 ] investigated healthcare resource utilization (i.e., the number of visits to the medical oncologists) characteristics. They used DT, ANN and LR separately and an ensemble algorithm combining DT and ANN which resulted in the greatest accuracy (60% predictive accuracy). DT was employed to identify the important predictive features (among demographics, diagnosis, and other medical information) and ANN for classification. Data mining revealed that the utilization of healthcare resources by lung cancer patients is “supply-sensitive and patient sensitive” where supply represents availability of resources in certain region and patient represents patient preference and comorbidity. A resource allocation monitoring model for better management of primary healthcare network has also been developed [ 99 ]. Researchers considered the primary-care network as a collection of hierarchically connected modules given that patients could visit multiple physicians and physicians could have multiple care location, which is an indication of imbalanced resource distribution (e.g., number of physicians, care locations). The first level of the hierarchy consisted of three modules: health activities, population, and health resources. The second level monitored the healthcare provider availability and dispersion. The third level considered the actual visits, physicians and their availability, accessibility, and unlisted (i.e., without any assigned physician) patients. The top level of this network conducted an overall assessment of the network and made allocation accordingly. This hierarchical model was developed for a specific region in Slovenia, however, it could be easily adapted for any other region.

Overuse of screening and tests by physicians also contributes to inefficiencies and excess costs [ 102 ]. Current practice in pathology diagnosis is limited by disease focus. As an alternative to disease based system, researchers [ 102 ] used data mining in cooperation with case-based reasoning to develop an evidence based decision support system to decrease the use of unnecessary tests and reduce costs.

4.2.3. Patient Management

Patient management involves activities related to efficient scheduling and providing care to patients during their stay in a healthcare institute. Researchers [ 105 ] developed an efficient scheduling system for a rural free clinic in the United States. They proposed a hybrid system where data mining was used to classify the patients and association rule mining was used to assign a “no-show” probability. Results obtained from data mining were used to simulate and evaluate different scheduling techniques. On the other hand, these schedules could be divided into visits with administrative purposes and medical purposes. Researchers [ 108 ] suggested that patients who visit the health center for administrative purposes take less time than the patients with medical reasons. They proposed a predictive model to forecast the number of visits for administrative purposes. Their model improved the scheduling system with time saving of 21.73% (660,538 min). In contrast to administrative information/task seeking patients, some patients come for medical care very frequently and consume a large percentage of clinical workload [ 107 ]. Identifying the risk factors for frequent visit to health centers can help in reducing cost and resource utilization. A study among 85 working age “frequent attenders” identified the primary risk factors using Bayesian classification technique. The risk factors are, “high body mass index, alcohol abstinence, irritable bowel syndrome, low patient satisfaction, and fear of death” [ 107 ].

Improving publicly reported patient safety outcomes is also critical to healthcare institutions. Falls are one such outcome and are the most common and costly source of injury during hospitalization [ 110 ]. Researchers [ 109 ] analyzed the important factors related to patient falls during hospitalization. First, the authors selected significant features by Chi-square test (10 features out of 72 fall related variables were selected) and then applied ANN to develop a predictive model which achieves 0.77 AUC value. Stepwise logistic regression achieved 0.42 AUC value with 3 important variables. Both models showed that the fall assessment by nurses and use of anti-psychotic medication are associated with a lower risk of falls, and the use of diuretics is associated with an increased risk of falls. Another group of researchers [ 110 ] used fall related injury data to validate the structured information in EMR from clinical notes with the help of text mining. A group of nurses manually reviewed the electronic records to separate the correct documents from the erroneous ones which was considered as the basis of comparison. Authors employed both supervised (using a portion of manually labeled files as training set) and unsupervised technique (without considering the file labels) to classify and cluster the records. The unsupervised technique failed to separate the fare documents from the erroneous ones, wherein supervised technique performed better with 86% of fare documents in one cluster. This method can be applicable to semi-automate the EMR entry system.

4.2.4. Other Applications

Data mining has beed applied [ 111 ] to investigate the relationship between physician’s training at specific schools, procedures performed, and costs of the procedure. Researchers explored this relationship at three level: (1) they explored the distribution of procedures performed; (2) the relationship between procedures performed by physician and their alma mater—the institute that a doctor attended or got his/her degree from; and (3) geographic distribution of amount billed and payment received. This study suggested that medical school training does relate to practice in terms of procedures performed and bill charged. Patients can also provide useful information about physicians and their performance. Another group of researchers [ 112 ] used topic modeling algorithm—Latent Dirichlet Allocation (LDA)—to understand patients’ review of physicians and their concerns.

Data mining has also been applied [ 115 ] to analyze the information seeking behavior of health care professionals, and to assess the feasibility of measuring drug safety alert response from the usage logs of online medical information resources. Researchers analyzed two years of user log-in data in UpToDate website to measure the volume of searches associated with medical conditions and the seasonal distribution of those searches. In addition, they used a large collection of online media articles and web log posts as they characterized food and drug alert through the changes in UpToDate search activity compared to the general media activity. Some researchers [ 113 ] examined changes of key performance indicators (KPIs) and clinical workload indicators in Greek National Health System (NHS) hospitals with the help of data mining. They found significant changes in KPIs when necessary adjustments (e.g., workload) were made according to the diagnostic related group. The results remained for general hospitals like cancer hospitals, cardiac surgery as well as small health centers and regional hospitals. Their findings suggested that the assessment methodology of Greek NHS hospitals should be re-evaluated in order to identify the weaknesses in the system, and improve overall performance. And in home healthcare, another group of researchers [ 116 ] reviewed why traditional statistical analysis fails to evaluate the performance of home healthcare agencies. The authors proposed to use data mining to identify the drivers of home healthcare service among patients with heart failure, hip replacement, and chronic obstructive pulmonary disease using length of stay and discharge destination.

The relationship between epidemiological and genetic evidence and post market medical device performance has been evaluated using HCUPNet data [ 114 ]. This feasibility study explored the potential of using publicly accessible data for identifying genetic evidence (e.g., comorbidity of genetic factors like race, sex, body structure, and pneumothorax or fibrosis) related to devices. It focused on the ventilation-associated iatrogenic pneumothorax outcome in discharge of mechanical ventilation and continuous positive airway pressure (CPAP). The results demonstrated that genetic evidence-based epidemiologic analysis could lead to both cost and time efficient identification of predictive features. The literature of data mining applications in healthcare administration encompasses efficient patient management, healthcare cost reduction, quality of care, and data warehousing to facilitate analytics. We identified four studies that used cloud-based computing and analytical platforms. Most of the research proposed promising ideas, however, they do not provide the results and/or challenges during and after implementation. An ideal example of implementation could be the study of efficient appointment scheduling of patients [ 108 ].

4.3. Healthcare Privacy and Fraud Detection

Health data privacy and medical fraud are issues of prominent importance [ 118 ]. We reviewed four articles—displayed and described in Table 7 —that discussed healthcare privacy and fraud detection.

List of papers in healthcare privacy and fraud detection.

The challenges of privacy protection have been addressed by a group of researchers [ 122 ] who proposed a new anonymization algorithm for both distributed and centralized anonymization. Their proposed model performed better than K-anonymization model in terms of retaining data utility without losing much data privacy (for K = 20, the discernibility ratio—a normalized measure of data quality—of the proposed approach and traditional K-anonymization method were 0.1 and 0.4 respectively). Moreover, their proposed algorithm could handle large scale, high dimensional datasets. To address the limitations of today’s healthcare information systems—EHR data systems limited by lack of inter-operability, data size, and security—a mobile cloud computing-based big data framework has been proposed [ 119 ]. This novel cloud-based framework proposed storing EHR data from different healthcare providers in an Internet provider’s facility, offering providers and patients different levels of access and authority. Security would be ensured by using encryption algorithms, one-time passwords, or 2-factor authentication. Big data analytics would be handled using Google big query or MapReduce software. This framework could reduce cost, increase efficiency, and ensure security compared to the traditional technique which uses de-identification or anonymization technique. This traditional technique leaves healthcare data vulnerable to re-identification. In a case study, researchers demonstrated that hackers can make association between small pieces of information and can identify patients [ 120 ]. The case study made use of personal information provided in two Medicare social networking sites, MedHelp and Mp and Th1 to identify an individual.

Detection of fraud and abuse (i.e., suspicious care activity, intentional misrepresentation of information, and unnecessary repetitive visits) uses big data analytics. Using gynecological hospital data, researchers [ 121 ] developed a framework from two domain experts manually identifying features of fraudulent cases from a data pool of treatment plans doctors frequently follow. They applied this framework to Bureau of National Health Insurance (BNHI) data from Taiwan; their proposed framework detected 69% of the fraudulent cases, which improved the existing model that detected 63% of the fraudulent cases.

In summary, patient data privacy and fraud detection are of major concern given increasing use of social media and people’s tendency to put personal information on social media. Existing data anonymization or de-identification techniques can become less effective if they are not designed considering the fact that a large portion of our personal information is now available on social media.

4.4. Mental Health

Mental illness is a global and national concern [ 123 ]. According to the National Survey on Drug Use and Health (NSDUH) data from 2010 to 2012, 52.2% of U.S. population had either mental illness, or substance abuse/dependence [ 124 ]. Additionally, nearly 30 million people in the U.S. suffer from anxiety disorders [ 125 ]. Table 8 summarizes the four articles we reviewed that apply data mining in analyzing, diagnosing, and treating mental health issues.

List of data mining application in mental health with data sources.

To classify developmental delays of children based on illness, researchers [ 126 ] examined the association between illness diagnosis and delays by building a decision tree and finding association between cognitive, language, motor, and social emotional developmental delays. This study has implications for healthcare professionals to identify and intervene on delays at an early stage. To assist physicians in monitoring anxiety disorder, another group of researchers [ 125 ] developed a data mining based personalized treatment. The researchers used Context Awareness Information including static (personal information like, age, sex, family status etc.) and dynamic (stress, environmental, and symptoms context) information to build static and dynamic user models. The static model contained personal information and the dynamic model contained four treatment-supportive services (i.e., lifestyle and habits pattern detection service, context and stress level pattern detection service, symptoms and stress level pattern detection service, and stress level prediction service). Relations between different dynamic parameters were identified in first three services and the last service was used for stress level prediction under different scenarios. The model was validated using data from 27 volunteers who were selected by anxiety measuring test.

To predict early diagnosis for mental disorders (e.g., insomnia, dementia), researchers developed a model detecting abnormal physical activity recorded by a wearable device [ 127 ]. They performed two experiments to compare the development of a reference model using historical user physical movement data. In the first experiment, users wore the watch for one day and based on that day, a reference behavior model was developed. After 22 days, the same user used it again for a day and abnormality was detected if the user’s activities were significantly different from the reference model. In the second experiment, users used the watch regularly for one month. Abnormality was detected with a fuzzy valuation function and validated with user’s reported activity level. In both experiments, users manually reported their activity level, which was used as a validating point, only two out of 26 abnormal events were undetected. Through these two experiments, the researchers claimed that their model could be useful for both online and offline abnormal behavior detection as the model was able to detect 92% of the unusual events.

To classify schizophrenia, another study [ 128 ] used free speech (transcribed text) written or verbalized by psychiatric patients. In a pool of patients with schizophrenia and control subjects, using supervised algorithms (SVM and DT), they discriminated between patients with schizophrenia and normal control patients. SVM achieved 77% classification accuracy whereas DT achieved 78% accuracy. When they added patients with mania to the pool, they were unable to differentiate patients with schizophrenia.

Use of data analytics in diagnosing, analyzing, or treating mental health patients is quite different than applying analytics to predict cancer or diabetes. Context of data (static, dynamic, or unobservable environment) seemed more important than volume in this case [ 125 ], however, this is not always adopted in literature. A model without situational awareness (a context independent model) may lose predictive accuracy due to the confounding effect of surrounding environment [ 129 ].

4.5. Public Health

Seven articles addressed issues that were not limited to any specific disease or a demographic group, which we classified as public health problems. Table 9 contains the list of papers considering public health problems with data sources.

List of data mining application in public health with data sources.

To make data mining accessible to non-expert users, specifically public health decision makers who manage public cancer treatment programs in Brazil, researchers [ 134 ] developed a framework for an automated data mining system. This system performed a descriptive analysis (i.e., identifying relationships between demography, expenditure, and tumor or cancer type) for public decision makers with little or no technical knowledge. The automation process was done by creating pre-processed database, ontology, analytical platform and user interface.

Analysis of disease outbreaks has also applied data analytics. [ 131 , 133 ] Influenza, a highly contagious disease, is associated with seasonal outbreaks. The ability to predict peak outbreaks in advance would allow for anticipatory public health planning and interventions to lessen the effect of the outbreaks. To predict peak influenza visits to U.S. military health centers, researchers [ 131 ] developed a method to create models using environmental and epidemiological data. They compared six classification algorithms—One-Classifier 1, One-Classifier 2 [ 137 ], a fusion of the One-Classifiers, DT, RF, and SVM. Among them, One-Classifier 1 was the most efficient with F-score 0.672 and SVM was second best with F-score 0.652. To examine the factors that drive public and professional search patterns for infectious disease outbreaks another group of researchers [ 133 ] used online behavior records and media coverage. They identified distinct factors that drive professional and layperson search patterns with implications for tailored messaging during outbreaks and emergencies for public health agencies.

To store and integrate multidimensional and heterogeneous data (e.g., diabetes, food, nutrients) applied to diabetes management, but generalizable to other diseases researchers [ 130 ] proposed an intelligent information management framework. Their proposed methodology is a robust back-end application for web-based patient-doctor consultation and e-Health care management systems with implications for cost savings.

A real-time medical emergency response system using the Internet of Things (networking of devices to facilitate data flow) based body area networks (BANs)—a wireless network of wearable computing devices was proposed by researchers [ 136 ]. The system consists of “Intelligent Building”—a data analysis model which processes the data collected from the sensors for analysis and decision. Though the author claims that the proposed system had the capability of efficiently processing wireless BAN data from millions of users to provide real-time response for emergencies, they did not provide any comparison with the state-of-the-art methods.

Decision support tools for regional health institutes in Slovenia [ 135 ] have been developed using descriptive data mining methods and visualization techniques. These visualization methods could analyze resource availability, utilization and aid to assist in future planning of public health service.

To build better customer relations management at an Iranian hospital, researchers [ 132 ] applied data mining techniques on demographic and transactions information. The authors extended the traditional Recency, Frequency, and Monetary (RFM) model by adapting a new parameter “Length” to estimate the customer life time value (CLV) of each patient. Patients were separated into classes according to estimated CLV with a combination of clustering and classification algorithms. Both DT and ANN performed similarly in classification with approximately 90% accuracy. This type of stratification of patient groups with CLV values would help hospitals to introduce new marketing strategies to attract new customers and retain existing ones.

The application of data mining to public health decision-making has become increasingly common. Researchers utilized data mining to design healthcare programs and emergency response, to identify resource utilization, patient satisfaction as well as to develop automated analytics tool for non-expert users. Continuation of this effort could lead to a patient-centered, robust healthcare system.

4.6. Pharmacovigilance

Pharmacovigilance involves post-marketing monitoring and detection of adverse drug reactions (ADRs) to ensure patient safety [ 138 ]. The estimated annual social cost of ADR events exceeds one billion dollars, making it an important part of healthcare system [ 139 ]. Characteristics of the nine papers addressing pharmacovigilance are displayed in Table 10 .

List of data mining application in pharmacovigilance with data sources.

Researchers considered muscular and renal AEs caused by pravastatin, simvastatin, atorvastatin, and rosuvastatin by applying data mining techniques to the FDA’s Adverse Event Reporting System (FAERS) database reports from 2004 to 2009 [ 143 ]. They found that all statins except simvastatin were associated with muscular AE; rosuvastatin had the strongest association. All statins, besides atorvastatin, were associated with acute renal failure. The criteria used to identify significant association were: proportional reporting ratio (PRR), reporting odds ratio (ROR), information component (IC), and empirical Bayes geometric mean (EBGM). In another study of AEs related to statin family, researchers used a Korean claims database [ 145 ] and showed that a relative risk-based data-mining approach successfully detected signals for rosuvastatin.

Three more studies used the FDA’s AERS report database. In an examination of ADR “hypersensitivity” to six anticancer agents [ 142 ] data mining results showed that Paclitaxel is associated with mild to lethal reaction wherein Docetaxel is associated to lethal reaction, and the other four drugs were not associated to hypersensitivity [ 142 ]. Another researcher [ 139 ] argued that AEs can be caused not only by a single drug, but also by a combination of drugs [ 140 ]. They showed that that 84% of the AERs reports contain an association between at least one drug and two AEs or two drugs and one AE. Another group [ 138 ] increased precision in detecting ADRs by considering multiple data sources together. They achieved 31% (on average) improvement in identification by using publicly available EHRs in combination with the FDA’s AERS reports.

Furthermore, dose-dependent ADRs have been identified by researchers using models developed from structured and unstructured EHR data [ 141 ]. Among the top five drugs associated with ADRs, four were found to be related to dose [ 141 ]. Pharmacovigilance activity has also been prioritized using unstructured text data in EHRs [ 144 ]. In traditional pharmacovigilance, ADRs are unknown. While looking for association between a drug and any possible ADR, it is possible to get false signals. Such false signals can be avoided if a list of possible ADRs is already known. Researchers [ 144 ] developed an ordered list of 23 ADRs which can be very helpful for future pharmacovigilance activities. To detect unexpected and rare ADRs in real-world healthcare administrative databases, another group of researchers [ 146 ] designed an algorithm—Unexpected Temporal Association Rules (UTARs)—that performs more effectively than existing techniques.

We identified one study that used data outside of adverse event reports or HER data. For early detection of ADR, one group of researchers used online forums [ 140 ]. They identified the side effect of a specific drug called “Erlotinib” used for lung cancer. Sentiment analysis—a technique of categorizing opinions—on data collected from different cancer discussion forums showed that 70% of users had a positive experience after using this drug. Users most frequently reported were acne and rash. Apart from pharmacovigilance, this type of analysis can be very helpful for the pharmaceutical companies to analyze customer feedback. Researchers can take advantage of the popularity of social media and online forums for identifying adverse events. These sources can provide signals of AEs quicker than FDA database as it takes time to update the database. By the time AE reports are available in the FDA database, there could already be significant damage to patient and society. Moreover, it can help to avoid the limitations of FDA AERS database like biased reporting and underreporting [ 141 ].

5. Theoretical Study

Twenty-five of the articles we reviewed focus on the theoretical aspects of the application of data mining in healthcare including designing the database framework, data collection, and management to algorithmic development. These intellectual contributions extend beyond the analytical perspective of data—descriptive, predictive or prescriptive analytics—to the sectors and problems highlighted in Table 11 .

Problem analyzed in theoretical studies.

The existing theoretical literature on disease control highlighted the current state of epidemics, cancer and mental health. To help physicians make real-time decisions about patient care, one group of researchers [ 147 ] proposed a real-time EMR data mining based clinical decision support system. They emphasized the need to have an anonymized EMR database which can be explored by using a search engine similar to web search engine. In addition, they focused on designing a framework for next generation EMR-based database that can facilitate the clinical decision-making process, and is also capable of updating a central population database once patients’ recent (new) clinical records are available. Another researcher [ 148 ] forecasted future challenges in infection control that entails the importance of having timely surveillance system and prevention programs in place. To that end, they necessitate the formation, control and utilization of fully computerized patient record and data-mining-derived epidemiology. Finally, they recommended performance feedback to caregivers, wide accessibility of infection prevention tools, and access to documents like lessons learned and evidence-based best practices to strengthen the infection control, surveillance, and prevention scheme. Authors in [ 150 ] addressed the activities executed by national Institute of Mental Health (NIMH) in collaboration with other state organizations (e.g., Substance Abuse and Mental Health Service Administration (SAMSHSA), Center for Mental Health Service (CMHS) to promote optimal collection, pooling/aggregation, and use of big data to support ongoing and future researches of mental health practices. The outcome summary showcased that effective pooling/aggregation of state-level data from different sources can be used as a dashboard to set priorities to improve service qualities, measure system performance and to gain specific context-based insights that are generalizable and scalable across other systems, leading to a successful learning-based mental health care system. Another group of researchers [ 150 ] outlined the barriers and potential benefits of using big data from CancerLinQ (a quality and measurement reporting system as an initiative of the American Society of Clinical Oncology (ASCO) that collects information from EHRs of cancer patients for oncologists to improve the outcome and quality of care they provide to their patients). However, the authors also mentioned that these benefits are contingent upon the confidence of the patients, encouraging them to share their data out of the belief that their health records would be used appropriately as a knowledge base to improve the quality of the health care of others, as it is for themselves. This motivated ASCO to ensure that proper policies and procedures are in place to deal with the data quality, data security and data access, and adopt a comprehensive regulatory framework to ensure patients’ data privacy and security.

Another group of researchers [ 151 ] data quality and database management to quantify, and consequentially understand the inherent uncertainty originating from radiology reporting system. They discussed the necessity of having a structured reporting system and emphasized the use of standardize language, leading to Natural Language Processing (NLP). Furthermore, they also indicated the need for creating a Knowledge Discovery Database (KDD) which will be consistent to facilitate the data-driven and automated decision support technologies to help improving the care provided to patients based on enhanced diagnosis quality and clinical outcome. A group of authors in [ 152 ] pointed that the success derived from the current trend of big-data analytics largely depends on how better the quality of the data collected from variety of sources are ensured. Their findings imply that the data quality should be assessed across the entire lifecycle of health data by considering the errors and inaccuracies stemmed from multiple of sources, and should also quantify the impact that data collection purpose on the knowledge and insights derived from the big data analytics. For that to ensure, they recommend that enterprises who deal with healthcare big data should develop a systematic framework including custom software or data quality rule engines, leading to an effective management of specific data-quality related problems. Researchers in [ 155 ] uncovered the lack of connection between phenomenological and mechanistic models in computational biomedicines. They emphasized the importance of big data which, when successfully extracted and analyzed, followed by the combination with Virtual Physiological Human (VPH)—an initiative to encourage personalized healthcare—can afford with effective and robust medicine solutions. In order for that to happen, they mentioned some challenges (e.g., confidentiality, volume and complexity of big data; integration of bioinformatics, systems biology and phenomics data; efficient storage of partial or complete data within organization to maximize the performance of overall predictive analytics) and concluded that these need to be addressed for successful development of big data technologies in computational medicines, enabling their adoption in clinical settings. Even though big data can generate significant value in modern healthcare system, researchers in [ 154 ] stated that without a set of proper IT infrastructures, analytical and visualization tools, and interactive interfaces to represent the work flows, the insights generated from big data will not be able to reach its full potential. To overcome this, they recommended that health care organizations engaging in data sharing devise new policies to protect patients’ data against potential data breaches.

Three papers [ 155 , 156 , 157 ] considered health care policies and ethical and legal issues. One [ 155 ] outlined a national action plan to incorporate sharable and comparable nursing data beyond documentation of care into quality reporting and translational research. The plan advocates for standardized nursing terminologies, common data models, and information structures within EHRs. Another paper [ 157 ] analyzed the major policy, ethical, and legal challenges of performing predictive analytics on health care big data. Their proposed recommendations for overcoming challenges raised in the four-phase life cycle of a predictive analytics model (i.e., data acquisition, model formulation and validation, testing in real-world setting and implementation and use in broader scale) included developing a governance structure at the earliest phase of model development to guide patients and participating stakeholders across the process (from data acquisition to model implementation). They also recommended that model developers strictly comply with the federal laws and regulations in concert with human subject research and patients information privacy when using patients’ data. And another paper [ 156 ] explored four central questions regarding: (i) aspects of big-data most relevant to health care, (ii) policy implications, (iii) potential obstacles in achieving policy objectives, and (iv) availability of policy levers, particularly for policy makers to consider when developing public policy for using big data in healthcare. They discussed barriers (including ensuring transparency among patients and health care providers during data collection) to achieve policy objectives based on a recent UK policy experiment, and argued for providing real-life examples of ways in which data sharing can improve healthcare.

Three papers [ 158 , 159 , 160 ] offered examples of realistic ways such as establishing policy leadership and risk management framework combining commercial and health care entities to recognize existing privacy related problem and devise pragmatic and actionable strategies of maintaining patient privacy in big data analytics. One paper [ 158 ] provided a policy overview of health care and data analytics, outlined the utility of health care data from a policy perspective, reviewed a variety of methods for data collection from public and private sources, mobile devices and social media, examined laws and regulations that protect data and patients’ privacy, and discussed a dynamic interplay among three aspects of today’s big data driven personal health care—policy goals to tackle both cost, population health problem and eliminate disparity in patient care while maintaining their privacy. Another study [ 159 ] proposed a Secure and Privacy Preserving Opportunistic Computing (SPOC) framework to be used in healthcare emergencies focused on collecting intensive personal health information (through mobile devices like smart phone or wireless sensors) with minimal privacy disclosure. The premise of this framework is that when a user of this system (called medical user) faces any emergency, other users in the vicinity with similar disease or symptom (if available) can come to help that user before professional help arrives. It is assumed that two persons with similar disease are skilled enough to help each other and the threshold of similarity is controlled by the user. And in physician prescribing—another paper [ 160 ] identified strategies for data mining from physicians’ prescriptions while maintaining patient privacy.

Theoretical research on personalized-health care services—treatment plans designed for someone based on the susceptibility of his/her genomic structure to a disease—also emerged from the literature review. One study [ 161 ] highlighted the potential of powerful analytical tools to open an avenue for predictive, preventive, participatory, and personalized (P4) medicine. They suggested a more nuanced understanding of the human systems to design an accurate computational model for P4 medicine. Reviewing the research paradgims of current person-centered approaches and traditions, another study [ 162 ] advocated a transdisciplinary and complex systems approach to improve the field. They synthesized the emerging aproaches and methodologies and highlighted the gaps between academic research and accessibility of evaluation, informatics, and big data from health information systems. Another paper [ 163 ] reviewed the availability of big data and the role of biomedical informatics in personalized medicine, emphasizing the ethical concerns related to personalized medicines and health equity. Personalized medicine has a potential to reduce healthcare cost, however, the researchers think it can create race, income, and educational disparity. Certain socioeconomic and demographic groups currently have less or no access to healthcare and data driven personalized medicine will exclude those groups, increasing disparities. They also highlighted the impact of EHRs and CDWs on the field of personalized medicine through acclerated research and decreased the delivery time of new technologies.

A myriad of extant theoretical points has also been identified in the literature. These topics range from exploiting big data to: study the paradigm shift in healthcare policy and management from prioritizing volume to value [ 164 , 167 ]; aid medical device consumers in their decision-making [ 166 ]; improve emergency departments [ 169 ]; perform command surveillance and policy analysis for Army leadership [ 170 ]; to comparing different simulation methods (i.e., systems dynamics, discrete event simulation and agent based modeling) for specific health care system problems like resource allocation, length of stay [ 165 ]; to the ethical challenges of security, management, and ownership [ 170 ]. Another researcher outlined the challenges the E.U. is facing in data mining given numerous historical, technical, legal, and political barriers [ 168 ].

6. Future Research and Challenges

Data mining has been applied in many fields including finance, marketing, and manufacturing [ 172 ]. Its application in healthcare is becoming increasingly popular [ 173 ]. A growing literature addresses the challenges of data mining including noisy data, heterogeneity, high dimensionality, dynamic nature, computational time. In this section, we focus on future research applications including personalized care, information loss in preprocessing, collecting healthcare data for research purposes, automation for non-experts, interdisciplinarity of study and domain expert knowledge, integration into the healthcare system, and prediction-specific to data mining application and integration in healthcare.

  • Personalized care

The EMR is increasingly used to document demographic and clinician patient information [ 1 ]. EMR data can be utilized to develop personalized care plans, enhancing patient experience [ 162 ] and improving care quality.

  • Loss of information in pre-processing

Pre-processing of data, including handling missing data, is the most time-consuming and costly part of data mining. The most common method used in the papers reviewed was deletion or elimination of missing data. In one study, approximately 46.5% of the data and 363 of 410 features were eliminated due to missing values [ 49 ]. In another, researchers [ 98 ] were only able to use 2064 of 4948 observations (42%) [ 98 ]. By eliminating missing value cases and outliers, we are losing a significant amount of information. Future research should focus on finding a better method of missing value estimation than elimination. Moreover, data collection techniques should be developed or modified to avoid this issue.

Similar to missing data, deletion or elimination is a common way to handle outliers [ 174 ]. However, as illustrated in one of the studies we reviewed [ 48 ], outliers can be used to gain information about rare forms of diseases. Instead of neglecting the outliers, future research should analyze them to gain insight.

  • Collecting healthcare data for research purpose

Traditionally, the primary objective of data collection in healthcare is documentation of patient condition and care planning [ 109 ]. Including research objectives in the data collection process through structured fields could yield more structured data with fewer cases of error and missing values [ 64 ]. A successful example of data collection for research purpose is the Study of Health in Pomerania (SHIP) [ 175 ]. The objective of SHIP was to identify common diseases, population level risk factors, and overall health of people living in the north-east region of Germany. This study only suffered from one “mistake” for every 1000 data entries [ 175 ] which ensures a structured form of data with high reliability, less noise and fewer missing values. We can take advantage of current documentation processes (EMR or EHR) by modifying them to collect more reliable and structured data. Long-term vision and planning is required to introduce research purpose in healthcare data collection.

  • Automation of data mining process for non-expert users

The end users of data mining in healthcare are doctors, nurses, and healthcare professionals with limited training in analytics. One solution for this problem is to develop an automated (i.e., without human supervision) system for the end users [ 134 ]. A cloud-based automated structure to prevent medical errors could also be developed [ 95 ]; but the task would be challenging as it involves different application areas and one algorithm will not have similar accuracy for all applications [ 134 ].

  • Interdisciplinary nature of study and domain expert knowledge

Healthcare analytics is an interdisciplinary research field [ 134 ]. As a form of analytics, data mining should be used in combination with expert opinion from specific domains—healthcare and problem specific (i.e., oncologist for cancer study, cardiologist for CVD) [ 106 ]. Approximately 32% of the articles in analytics did not utilize expert opinion in any form. Future research should include members from different disciplines including healthcare.

  • Integration in healthcare system

Very few articles reviewed made an effort to integrate the data mining process into the actual decision-making framework. The impact of knowledge discovery through data mining on healthcare professional’s workload and time is unclear. Future studies should consider the integration of the developed system and explore the effect on work environments.

  • Prediction error and “The Black Swan” effect

In healthcare, it is better not to predict than making an erroneous prediction [ 46 ]. A little under half of the literature we identified in analytics is dedicated to prediction but, none of the articles discussed the consequence of a prediction error. High prediction accuracy for cancer or any other disease does not ensure an accurate application to decision-making.

Moreover, prediction models may be better at predicting commonplace events than rare ones [ 176 ]. Researchers should develop more sophisticated models to address the unpredictable, “The Black Swan” [ 176 ]. One study [ 101 ] addressed a similar issue in evidence based recommendations for medical prescriptions. Their concern was, how much evidence should be sufficient to make a recommendation. Many of the studies in this review do not address these salient issues. Future research should address the implementation challenges of predictive models, especially how the decision-making process should adapt in case of errors and unpredictable incidents.

7. Conclusions

The development of an informed decision-making framework stems from the growing concern of ensuring a high value and patient-focused health care system. Concurrently, the availability of big data has created a promising research avenue for academicians and practitioners. As highlighted in our review, the increased number of publications in recent years corroborates the importance of health care analytics to build improved health care systems world-wide. The ultimate goal is to facilitate coordinated and well-informed health care systems capable of ensuring maximum patient satisfaction.

This paper adds to the literature on healthcare and data mining ( Table 1 ) as it is the first, to our knowledge, to take a comprehensive review approach and offer a holistic picture of health care analytics and data mining. The comprehensive and methodologically rigorous approach we took covers the application and theoretical perspective of analytics and data mining in healthcare. Our systematic approach starting with the review process and categorizing the output as analytics or theoretical provides readers with a more widespread review with reference to specific fields.

We also shed light on some promising recommendations for future areas of research including integration of domain-expert knowledge, approaches to decrease prediction error, and integration of predictive models in actual work environments. Future research should recommend ways so that the analytic decision can effectively adapt with the predictive model subject to errors and unpredictable incidents. Regardless of these insightful outcomes, we are not constrained to mention some limitations of our proposed review approach. The sole consideration of academic journals and exclusion of conference papers, which may have some good coverage in this sector is the prime limitation of this review. In addition to this, the search span was narrowed to three databases for 12 years which may have ignored some prior works in this area, albeit the increasing trend since 2005 and less number of publications before 2008 can minimize this limitation. The omission of articles published in languages other than English can also restrict the scope of this review as related papers written in other languages might be evident in the literature. Moreover, we did not conduct forward (reviewing the papers which cited the selected paper) and backward (reviewing the references in the selected paper and authors’ prior works) search as suggested by Levy and Ellis [ 31 ].

Despite these limitations, the systematic methodology followed in this review can be used in the universe of healthcare areas.

Supplementary Materials

The following are available online at http://www.mdpi.com/2227-9032/6/2/54/s1 , Table S1: PRISMA checklist, Table S2: Modified checklists and comparison, Table S3: Study characteristics, Table S4: Classification of reviewed papers by analytics type, application area, data type, and data mining techniques.

Author Contributions

Contribution of the authors can be summarized in following manner. Conceptualization: M.S.I., M.N.-E.-A.; Formal analysis: M.S.I., M.M.H., X.W.; Investigation: M.S.I., M.M.H., X.W.; Methodology: M.S.I.; Project administration: M.S.I., M.N.-E.-A.; Supervision: M.N.-E.-A.; Visualization: M.S.I., X.W.; Writing—draft: M.S.I., M.M.H., H.D.G.; Writing—review and editing: M.S.I., M.M.H., H.D.G., M.N.-E.-A.

Germack is supported by CTSA Grant Number TL1 TR001864 from the National Center for Advancing Translational Science (NCATS), a component of the National Institutes of Health (NIH). The content is solely the responsibility of the authors and does not necessarily represent the official views of this organization.

Conflicts of Interest

The authors declare no conflict of interest.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • NEWS FEATURE
  • 17 July 2019
  • Correction 19 July 2019

The plan to mine the world’s research papers

  • Priyanka Pulla 0

Priyanka Pulla is a freelance journalist based in Bengaluru, India.

You can also search for this author in PubMed   Google Scholar

Carl Malamud in front of the data store of 73 million articles that he plans to let scientists text mine. Credit: Smita Sharma for Nature

Carl Malamud is on a crusade to liberate information locked up behind paywalls — and his campaigns have scored many victories. He has spent decades publishing copyrighted legal documents, from building codes to court records, and then arguing that such texts represent public-domain law that ought to be available to any citizen online. Sometimes, he has won those arguments in court. Now, the 60-year-old American technologist is turning his sights on a new objective: freeing paywalled scientific literature. And he thinks he has a legal way to do it.

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 51 print issues and online access

185,98 € per year

only 3,65 € per issue

Rent or buy this article

Prices vary by article type

Prices may be subject to local taxes which are calculated during checkout

Nature 571 , 316-318 (2019)

doi: https://doi.org/10.1038/d41586-019-02142-1

Updates & Corrections

Correction 19 July 2019 : An earlier version of this feature used the term ‘fair use’ inappropriately — the term isn’t relevant under Indian law.

Reprints and permissions

Related Articles

research papers related to data mining

Text-mining block prompts online response

Text-mining spat heats up

  • Developing world
  • Computer science

A guide to the Nature Index

A guide to the Nature Index

Nature Index 05 JUN 24

The AI revolution is coming to robots: how will it change them?

The AI revolution is coming to robots: how will it change them?

News Feature 28 MAY 24

Standardized metadata for biological samples could unlock the potential of collections

Correspondence 14 MAY 24

China seeks global impact and recognition

China seeks global impact and recognition

Chinese research collaborations shift to the Belt and Road

Chinese research collaborations shift to the Belt and Road

Why China has been a growing study destination for African students

Why China has been a growing study destination for African students

How ignorance and gender inequality thwart treatment of a widespread illness

How ignorance and gender inequality thwart treatment of a widespread illness

Outlook 09 MAY 24

Professor/Associate Professor/Assistant Professor/Senior Lecturer/Lecturer

The School of Science and Engineering (SSE) at The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen) sincerely invites applications for mul...

Shenzhen, China

The Chinese University of Hong Kong, Shenzhen (CUHK Shenzhen)

research papers related to data mining

Faculty Positions& Postdoctoral Research Fellow, School of Optical and Electronic Information, HUST

Job Opportunities: Leading talents, young talents, overseas outstanding young scholars, postdoctoral researchers.

Wuhan, Hubei, China

School of Optical and Electronic Information, Huazhong University of Science and Technology

research papers related to data mining

Postdoctoral Fellowships at West China Hospital/West China School of Medicine of Sichuan University

Open to PhD students, PhD, Post-Doc and residents.

Chengdu, Sichuan, China

West China School of Medicine/West China Hospital

research papers related to data mining

Welcome Global Talents to West China Hospital/West China School of Medicine of Sichuan University

Top Talents; Leading Talents; Excellent Overseas Young Talents on National level; Overseas Young Talents

research papers related to data mining

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

IMAGES

  1. Full article: Paper Review On Data Mining, components, And Big Data

    research papers related to data mining

  2. (PDF) Data Mining Techniques and Trends

    research papers related to data mining

  3. Et2016 4 07

    research papers related to data mining

  4. (PDF) A Study of Data Mining Techniques And Its Applications

    research papers related to data mining

  5. (PDF) Operations Research in Data Mining

    research papers related to data mining

  6. (PDF) Business Intelligence using Data Mining techniques and Business

    research papers related to data mining

VIDEO

  1. Data Mining: Lecture no. one

  2. Data Mining Explained

  3. All Major Data Mining Techniques Explained With Examples

  4. What Is Data Mining

  5. What is Data Mining?

  6. Data Mining with Weka (1.1: Introduction)

COMMENTS

  1. data mining Latest Research Papers

    Epidemic diseases can be extremely dangerous with its hazarding influences. They may have negative effects on economies, businesses, environment, humans, and workforce. In this paper, some of the factors that are interrelated with COVID-19 pandemic have been examined using data mining methodologies and approaches.

  2. 345193 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on DATA MINING. Find methods information, sources, references or conduct a literature review on DATA MINING

  3. Home

    Overview. Data Mining and Knowledge Discovery is a leading technical journal focusing on the extraction of information from vast databases. Publishes original research papers and practice in data mining and knowledge discovery. Provides surveys and tutorials of important areas and techniques. Offers detailed descriptions of significant ...

  4. Knowledge Discovery: Methods from data mining and machine learning

    Abstract. The interdisciplinary field of knowledge discovery and data mining emerged from a necessity of big data requiring new analytical methods beyond the traditional statistical approaches to discover new knowledge from the data mine. This emergent approach is a dialectic research process that is both deductive and inductive.

  5. (PDF) Data mining techniques and applications

    Data mining is a process which finds useful patterns from large amount of data. The paper discusses few of the data mining techniques, algorithms and some of the organizations which have adapted ...

  6. Data mining

    Data mining is the process of extracting potentially useful information from data sets. It uses a suite of methods to organise, examine and combine large data sets, including machine learning ...

  7. Adaptations of data mining methodologies: a systematic literature

    The main research objective of this article is to study how data mining methodologies are applied by researchers and practitioners. To this end, we use systematic literature review (SLR) as scientific method for two reasons. Firstly, systematic review is based on trustworthy, rigorous, and auditable methodology.

  8. (PDF) Trends in data mining research: A two-decade review using topic

    The research direction related to practical Applications of data mining also shows a tendency to grow. The last two topics, Text Mining and Data Streams have attracted steady interest from ...

  9. Recent advances in domain-driven data mining

    Data mining research has been significantly motivated by and benefited from real-world applications in novel domains. This special issue was proposed and edited to draw attention to domain-driven data mining and disseminate research in foundations, frameworks, and applications for data-driven and actionable knowledge discovery. Along with this special issue, we also organized a related ...

  10. Data Mining for the Internet of Things: Literature Review and

    A variety of researches focusing on knowledge view, technique view, and application view can be found in the literature. However, no previous effort has been made to review the different views of data mining in a systematic way, especially in nowadays big data [5-7]; mobile internet and Internet of Things [8-10] grow rapidly and some data mining researchers shift their attention from data ...

  11. Advances in Artificial Intelligence (AI)-Driven Data Mining

    Feature papers represent the most advanced research with significant potential for high impact in the field. ... AI-driven data mining explores algorithms and techniques that can handle numerous data and extract useful pattern information with little human intervention. ... Research in stream classification is related to developing methods that ...

  12. A comprehensive survey of data mining

    Data mining plays an important role in various human activities because it extracts the unknown useful patterns (or knowledge). Due to its capabilities, data mining become an essential task in large number of application domains such as banking, retail, medical, insurance, bioinformatics, etc. To take a holistic view of the research trends in the area of data mining, a comprehensive survey is ...

  13. Data mining

    Read the latest Research articles in Data mining from Scientific Reports. ... Anoikis-related gene signatures in colorectal cancer: implications for cell differentiation, immune infiltration, and ...

  14. 50 selected papers in Data Mining and Machine Learning

    Active Sampling for Feature Selection, S. Veeramachaneni and P. Avesani, Third IEEE Conference on Data Mining, 2003. Heterogeneous Uncertainty Sampling for Supervised Learning, D. Lewis and J. Catlett, In Proceedings of the 11th International Conference on Machine Learning, 148-156, 1994. Learning When Training Data are Costly: The Effect of ...

  15. Data Mining in Healthcare: Applying Strategic Intelligence Techniques

    The Computer Science's themes related to data mining and the medical research concepts, depicted, respectively, in the grey and blue areas of the thematic evolution diagram (Figure 6), demonstrates the evolution of the research field over the different sub-periods addressed in this study. In this way, each individual theme relevance is ...

  16. Data mining in clinical big data: the frequently used databases, steps

    In the third part of the paper, we offered an overview of data mining, including introducing an appropriate model, tasks, and processes, and summarized the specific methods of data mining. ... A public database describes a data repository used for research and dedicated to housing data related to scientific research on an open platform. Such ...

  17. Review Paper on Data Mining Techniques and Applications

    Abstract. Data mining is the process of extracting hidden and useful patterns and information from data. Data mining is a new technology that helps businesses to predict future trends and behaviors, allowing them to make proactive, knowledge driven decisions. The aim of this paper is to show the process of data mining and how it can help ...

  18. Data Mining Research

    Interactive Exploration of Time Series Data. Harry Hochheiser, Ben Shneiderman, in The Craft of Information Visualization, 2003. 2 Related Work. Data mining research has led to the development of useful techniques for analyzing time series data, including dynamic time warping [10] and Discrete Fourier Transforms (DFT) in combination with spatial queries [5].

  19. PDF A comprehensive survey of data mining

    To take a holistic view of the research trends in the area of data mining, a comprehensive survey is presented in this paper. This paper presents a systematic and comprehensive survey of various data mining tasks and techniques. Further, various real-life applications of data mining are presented in this paper.

  20. The Secondary Use of Electronic Health Records for Data Mining: Data

    Our analysis unveils stable growth in the number of publications prior to 2018. This highlights the recognition of data mining in resolving healthcare research problems. Figure 3(b) shows data used in research papers (abstracts) in last 10 years. An increase in the use of clinical notes stems from recent development in the state-of-the-art ...

  21. A Systematic Review on Healthcare Analytics: Application and

    Motivation and Scope. There is a large body of recently published review/conceptual studies on healthcare and data mining. We outline the characteristics of these studies—e.g., scope/healthcare sub-area, timeframe, and number of papers reviewed—in Table 1.For example, one study reviewed awareness effect in type 2 diabetes published between 2001 and 2005, identifying 18 papers [].

  22. The plan to mine the world's research papers

    The power of data mining. The JNU data store could sweep aside barriers that still deter scientists from using software to analyse research, says Max Häussler, a bioinformatics researcher at the ...

  23. (PDF) DATA MINING IN HEALTHCARE

    Data mining is a powerful new tec hnology with gr eat potential t o help c ompanies. focus on the m ost important information in the data they have collected about the behavior. of their customers ...