research papers related to data mining

data mining Recently Published Documents

Total documents.

Latest Documents
Most Cited Documents
Contributed Authors
Related Sources
Related Keywords

Distance Based Pattern Driven Mining for Outlier Detection in High Dimensional Big Dataset

Detection of outliers or anomalies is one of the vital issues in pattern-driven data mining. Outlier detection detects the inconsistent behavior of individual objects. It is an important sector in the data mining field with several different applications such as detecting credit card fraud, hacking discovery and discovering criminal activities. It is necessary to develop tools used to uncover the critical information established in the extensive data. This paper investigated a novel method for detecting cluster outliers in a multidimensional dataset, capable of identifying the clusters and outliers for datasets containing noise. The proposed method can detect the groups and outliers left by the clustering process, like instant irregular sets of clusters (C) and outliers (O), to boost the results. The results obtained after applying the algorithm to the dataset improved in terms of several parameters. For the comparative analysis, the accurate average value and the recall value parameters are computed. The accurate average value is 74.05% of the existing COID algorithm, and our proposed algorithm has 77.21%. The average recall value is 81.19% and 89.51% of the existing and proposed algorithm, which shows that the proposed work efficiency is better than the existing COID algorithm.

Implementation of Data Mining Technology in Bonded Warehouse Inbound and Outbound Goods Trade

For the taxed goods, the actual freight is generally determined by multiplying the allocated freight for each KG and actual outgoing weight based on the outgoing order number on the outgoing bill. Considering the conventional logistics is insufficient to cope with the rapid response of e-commerce orders to logistics requirements, this work discussed the implementation of data mining technology in bonded warehouse inbound and outbound goods trade. Specifically, a bonded warehouse decision-making system with data warehouse, conceptual model, online analytical processing system, human-computer interaction module and WEB data sharing platform was developed. The statistical query module can be used to perform statistics and queries on warehousing operations. After the optimization of the whole warehousing business process, it only takes 19.1 hours to get the actual freight, which is nearly one third less than the time before optimization. This study could create a better environment for the development of China's processing trade.

Multi-objective economic load dispatch method based on data mining technology for large coal-fired power plants

User activity classification and domain-wise ranking through social interactions.

Twitter has gained a significant prevalence among the users across the numerous domains, in the majority of the countries, and among different age groups. It servers a real-time micro-blogging service for communication and opinion sharing. Twitter is sharing its data for research and study purposes by exposing open APIs that make it the most suitable source of data for social media analytics. Applying data mining and machine learning techniques on tweets is gaining more and more interest. The most prominent enigma in social media analytics is to automatically identify and rank influencers. This research is aimed to detect the user's topics of interest in social media and rank them based on specific topics, domains, etc. Few hybrid parameters are also distinguished in this research based on the post's content, post’s metadata, user’s profile, and user's network feature to capture different aspects of being influential and used in the ranking algorithm. Results concluded that the proposed approach is well effective in both the classification and ranking of individuals in a cluster.

A data mining analysis of COVID-19 cases in states of United States of America

Epidemic diseases can be extremely dangerous with its hazarding influences. They may have negative effects on economies, businesses, environment, humans, and workforce. In this paper, some of the factors that are interrelated with COVID-19 pandemic have been examined using data mining methodologies and approaches. As a result of the analysis some rules and insights have been discovered and performances of the data mining algorithms have been evaluated. According to the analysis results, JRip algorithmic technique had the most correct classification rate and the lowest root mean squared error (RMSE). Considering classification rate and RMSE measure, JRip can be considered as an effective method in understanding factors that are related with corona virus caused deaths.

Exploring distributed energy generation for sustainable development: A data mining approach

A comprehensive guideline for bengali sentiment annotation.

Sentiment Analysis (SA) is a Natural Language Processing (NLP) and an Information Extraction (IE) task that primarily aims to obtain the writer’s feelings expressed in positive or negative by analyzing a large number of documents. SA is also widely studied in the fields of data mining, web mining, text mining, and information retrieval. The fundamental task in sentiment analysis is to classify the polarity of a given content as Positive, Negative, or Neutral . Although extensive research has been conducted in this area of computational linguistics, most of the research work has been carried out in the context of English language. However, Bengali sentiment expression has varying degree of sentiment labels, which can be plausibly distinct from English language. Therefore, sentiment assessment of Bengali language is undeniably important to be developed and executed properly. In sentiment analysis, the prediction potential of an automatic modeling is completely dependent on the quality of dataset annotation. Bengali sentiment annotation is a challenging task due to diversified structures (syntax) of the language and its different degrees of innate sentiments (i.e., weakly and strongly positive/negative sentiments). Thus, in this article, we propose a novel and precise guideline for the researchers, linguistic experts, and referees to annotate Bengali sentences immaculately with a view to building effective datasets for automatic sentiment prediction efficiently.

Capturing Dynamics of Information Diffusion in SNS: A Survey of Methodology and Techniques

Studying information diffusion in SNS (Social Networks Service) has remarkable significance in both academia and industry. Theoretically, it boosts the development of other subjects such as statistics, sociology, and data mining. Practically, diffusion modeling provides fundamental support for many downstream applications (e.g., public opinion monitoring, rumor source identification, and viral marketing). Tremendous efforts have been devoted to this area to understand and quantify information diffusion dynamics. This survey investigates and summarizes the emerging distinguished works in diffusion modeling. We first put forward a unified information diffusion concept in terms of three components: information, user decision, and social vectors, followed by a detailed introduction of the methodologies for diffusion modeling. And then, a new taxonomy adopting hybrid philosophy (i.e., granularity and techniques) is proposed, and we made a series of comparative studies on elementary diffusion models under our taxonomy from the aspects of assumptions, methods, and pros and cons. We further summarized representative diffusion modeling in special scenarios and significant downstream tasks based on these elementary models. Finally, open issues in this field following the methodology of diffusion modeling are discussed.

The Influence of E-book Teaching on the Motivation and Effectiveness of Learning Law by Using Data Mining Analysis

This paper studies the motivation of learning law, compares the teaching effectiveness of two different teaching methods, e-book teaching and traditional teaching, and analyses the influence of e-book teaching on the effectiveness of law by using big data analysis. From the perspective of law student psychology, e-book teaching can attract students' attention, stimulate students' interest in learning, deepen knowledge impression while learning, expand knowledge, and ultimately improve the performance of practical assessment. With a small sample size, there may be some deficiencies in the research results' representativeness. To stimulate the learning motivation of law as well as some other theoretical disciplines in colleges and universities has particular referential significance and provides ideas for the reform of teaching mode at colleges and universities. This paper uses a decision tree algorithm in data mining for the analysis and finds out the influencing factors of law students' learning motivation and effectiveness in the learning process from students' perspective.

Intelligent Data Mining based Method for Efficient English Teaching and Cultural Analysis

The emergence of online education helps improving the traditional English teaching quality greatly. However, it only moves the teaching process from offline to online, which does not really change the essence of traditional English teaching. In this work, we mainly study an intelligent English teaching method to further improve the quality of English teaching. Specifically, the random forest is firstly used to analyze and excavate the grammatical and syntactic features of the English text. Then, the decision tree based method is proposed to make a prediction about the English text in terms of its grammar or syntax issues. The evaluation results indicate that the proposed method can effectively improve the accuracy of English grammar or syntax recognition.

Export Citation Format

Share document.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts

Data mining articles within Scientific Reports

Article 28 October 2024 | Open Access

Leveraging ML for profiling lipidomic alterations in breast cancer tissues: a methodological perspective

Parisa Shahnazari
, Kaveh Kavousi
& Reza M Salek

Article 26 October 2024 | Open Access

Multi-omics analysis uncovers the transcriptional regulatory mechanism of magnesium Ions in the synthesis of active ingredients in Sophora tonkinensis

, Peng-Cheng Zhou
& Kun-Hua Wei

Article 21 October 2024 | Open Access

MetaCGRP is a high-precision meta-model for large-scale identification of CGRP inhibitors using multi-view information

Nalini Schaduangrat
, Phisit Khemawoot
& Watshara Shoombuatong

Article 14 October 2024 | Open Access

Comprehensive risk factor-based nomogram for predicting one-year mortality in patients with sepsis-associated encephalopathy

Guangyong Jin
, Menglu Zhou
& Yanan Dai

Article 04 October 2024 | Open Access

Identification of semen traces at a crime scene through Raman spectroscopy and machine learning

Alexey V. Borisov
, Mikhail S. Snegerev
& Yury V. Kistenev

Article 30 September 2024 | Open Access

Applying information mining technology in online entrepreneurship training course

Jinqiang Fang
& Weichen Jia

Enhancing severe hypoglycemia prediction in type 2 diabetes mellitus through multi-view co-training machine learning model for imbalanced dataset

Melih Agraz
, Yixiang Deng
& Christos Socrates Mantzoros

Article 28 September 2024 | Open Access

Distant metastasis patterns among lung cancer subtypes and impact of primary tumor resection on survival in metastatic lung cancer using SEER database

, Bing-Mei Qiu
& Yi Shen

Article 27 September 2024 | Open Access

Transcriptomic changes in oligodendrocyte lineage cells during the juvenile to adult transition in the mouse corpus callosum

Tomonori Hoshino
, Hajime Takase
& Ken Arai

Article 17 September 2024 | Open Access

Evaluating segment anything model (SAM) on MRI scans of brain tumors

, Fady Alnajjar
& Rafat Damseh

Article 15 September 2024 | Open Access

Boundary-aware convolutional attention network for liver segmentation in ultrasound images

, Fulong Liu
& Xiao Zhang

Article 12 September 2024 | Open Access

Inferring gene regulatory networks with graph convolutional network based on causal feature reconstruction

& Xin Quan

Article 11 September 2024 | Open Access

A Bruton tyrosine kinase inhibitor-resistance gene signature predicts prognosis and identifies TRIP13 as a potential therapeutic target in diffuse large B-cell lymphoma

Yangyang Ding
, Keke Huang
& Shudao Xiong

Drug repurposing for Parkinson’s disease by biological pathway based edge-weighted network proximity analysis

Manyoung Han
, Seunghwan Jung
& Doheon Lee

Article 09 September 2024 | Open Access

Long-term trend prediction of pandemic combining the compartmental and deep learning models

Wanghu Chen
& Jiacheng Chi

Using geotagged facial expressions to visualize and characterize different demographic groups’ emotion in theme parks

Xiaoqing Song
& Qin Su

Article 06 September 2024 | Open Access

A coordinated adaptive multiscale enhanced spatio-temporal fusion network for multi-lead electrocardiogram arrhythmia detection

Zicong Yang
, Aitong Jin
& Yan Liu

Article 19 August 2024 | Open Access

The integrated genomic surveillance system of Andalusia (SIEGA) provides a One Health regional resource connected with the clinic

Carlos S. Casimiro-Soriguer
, Javier Pérez-Florido
& Joaquin Dopazo

Article 16 August 2024 | Open Access

An updated resource for the detection of protein-coding circRNA with CircProPlus

, Yunchang Liu
& Yundai Chen

Article 12 August 2024 | Open Access

Development, validation and use of custom software for the analysis of pain trajectories

M. R. van Ittersum
, A. de Zoete
& P. McCarthy

Comprehensive analysis identifies ubiquitin ligase FBXO42 as a tumor-promoting factor in neuroblastoma

Jianwu Zhou
& Yifei Du

Article 10 August 2024 | Open Access

Aquaporin 1 aggravates lipopolysaccharide-induced macrophage polarization and pyroptosis

& Abduxukur Ablimit

Article 07 August 2024 | Open Access

Bidirectional Mendelian randomization to explore the causal relationships between the gut microbiota and male reproductive diseases

Xiaofang Han
& Yuanyuan Ji

Article 06 August 2024 | Open Access

A comprehensive single-cell RNA transcriptomic analysis identifies a unique SPP1+ macrophages subgroup in aging skeletal muscle

, Mengyue Yang
& Weiming Guo

Article 01 August 2024 | Open Access

Association of GATA3 expression in triple-positive breast cancer with overall survival and immune cell infiltration

Xiuwen Chen
, Weilin Zhao
& Qiong Yi

Article 31 July 2024 | Open Access

Advanced differential evolution for gender-aware English speech emotion recognition

& Jiulong Zhu

Article 30 July 2024 | Open Access

Identification and immune landscape of sarcopenia-related molecular clusters in inflammatory bowel disease by machine learning and integrated bioinformatics

Chongkang Yue
& Huiping Xue

Article 26 July 2024 | Open Access

A data-centric machine learning approach to improve prediction of glioma grades using low-imbalance TCGA data

Raquel Sánchez-Marqués
, Vicente García
& J. Salvador Sánchez

Article 24 July 2024 | Open Access

Development and validation of a predictive model based on β-Klotho for head and neck squamous cell carcinoma

XiangXiu Wang
, HongWei Liu
& Ying Cui

Article 15 July 2024 | Open Access

Detecting depression severity using weighted random forest and oxidative stress biomarkers

Mariam Bader
, Moustafa Abdelwanis
& Herbert F. Jelinek

Article 06 July 2024 | Open Access

Scanpro is a tool for robust proportion analysis of single-cell resolution data

Yousef Alayoubi
, Mette Bentsen
& Mario Looso

Article 02 July 2024 | Open Access

Impact of Bariatric Surgery on metabolic health in a Uruguayan cohort and the emerging predictive role of FSTL1

Leonardo Santos
, Mariana Patrone
& Gustavo Bruno

Identification of a novel lactylation-related gene signature predicts the prognosis of multiple myeloma and experiment verification

, Wanqiu Zhang
& Wei Hu

Article 26 June 2024 | Open Access

Single-cell transcriptome profiling highlights the importance of telocyte, kallikrein genes, and alternative splicing in mouse testes aging

, Ziyan Zhang
& Gangcai Xie

Article 25 June 2024 | Open Access

Unifying aspect-based sentiment analysis BERT and multi-layered graph convolutional networks for comprehensive sentiment dissection

Kamran Aziz
, Donghong Ji
& Rashid Abbasi

Article 18 June 2024 | Open Access

Expression characteristics of lipid metabolism-related genes and correlative immune infiltration landscape in acute myocardial infarction

, Jingyi Luo
& Xiaorong Hu

Article 17 June 2024 | Open Access

Multi role ChatGPT framework for transforming medical data analysis

Haoran Chen
, Shengxiao Zhang
& Xuechun Lu

A tensor decomposition reveals ageing-induced differences in muscle and grip-load force couplings during object lifting

, Seyed Saman Saboksayr
& Ioannis Delis

Article 14 June 2024 | Open Access

Research on coal mine longwall face gas state analysis and safety warning strategy based on multi-sensor forecasting models

Haoqian Chang
, Xiangrui Meng
& Zuxiang Hu

PDE1B, a potential biomarker associated with tumor microenvironment and clinical prognostic significance in osteosarcoma

Qingzhong Chen
, Chunmiao Xing
& Zhongwei Qian

Article 13 June 2024 | Open Access

A real-world pharmacovigilance study on cardiovascular adverse events of tisagenlecleucel using machine learning approach

Juhong Jung
, Ju Hwan Kim
& Ju-Young Shin

Article 12 June 2024 | Open Access

Alteration of circulating ACE2-network related microRNAs in patients with COVID-19

Zofia Wicik
, Ceren Eyileten
& Marek Postula

DCRELM: dual correlation reduction network-based extreme learning machine for single-cell RNA-seq data clustering

Qingyun Gao
& Qing Ai

Article 10 June 2024 | Open Access

Multi-cohort analysis reveals immune subtypes and predictive biomarkers in tuberculosis

& Hong Ding

Article 03 June 2024 | Open Access

Depression recognition using voice-based pre-training model

Xiangsheng Huang
, Fang Wang
& Zhenrong Xu

Article 01 June 2024 | Open Access

Mitochondrial RNA modification-based signature to predict prognosis of lower grade glioma: a multi-omics exploration and verification study

Xingwang Zhou
, Yuanguo Ling
& Liangzhao Chu

Article 31 May 2024 | Open Access

Decoding intelligence via symmetry and asymmetry

Jianjing Fu
& Ching-an Hsiao

Article 27 May 2024 | Open Access

Research on domain ontology construction based on the content features of online rumors

Jianbo Zhao
, Huailiang Liu
& Ruiyu Ding

Exploring the pathways of drug repurposing and Panax ginseng treatment mechanisms in chronic heart failure: a disease module analysis perspective

Chengzhi Xie
, Ying Zhang
& Na Lang

Article 22 May 2024 | Open Access

Comprehensive data mining reveals RTK/RAS signaling pathway as a promoter of prostate cancer lineage plasticity through transcription factors and CNV

Guanyun Wei
& Zao Dai

Browse broader subjects

Computational biology and bioinformatics

Quick links

Explore articles by subject
Guide to authors
Editorial policies

A comprehensive survey of data mining

Original Research
Published: 06 February 2020
Volume 12 , pages 1243–1257, ( 2020 )

Cite this article

Manoj Kumar Gupta ORCID: orcid.org/0000-0002-4481-8432 1 &
Pravin Chandra 1

5427 Accesses

61 Citations

Explore all metrics

Data mining plays an important role in various human activities because it extracts the unknown useful patterns (or knowledge). Due to its capabilities, data mining become an essential task in large number of application domains such as banking, retail, medical, insurance, bioinformatics, etc. To take a holistic view of the research trends in the area of data mining, a comprehensive survey is presented in this paper. This paper presents a systematic and comprehensive survey of various data mining tasks and techniques. Further, various real-life applications of data mining are presented in this paper. The challenges and issues in area of data mining research are also presented in this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

A Review of the Development and Future Trends of Data Mining Tools

A Survey on Big Data, Mining: (Tools, Techniques, Applications and Notable Uses)

Data Mining—A Tool for Handling Huge Voluminous Data

Explore related subjects.

Artificial Intelligence

Fayadd U, Piatesky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases. AAAI Press/The MIT Press, Massachusetts Institute of Technology. ISBN 0–262 56097–6 Fayap

Fayadd U, Piatesky-Shapiro G, Smyth P (1996) Knowledge discovery and data mining: towards a unifying framework. In: Proceedings of the 2nd ACM international conference on knowledge discovery and data mining (KDD), Portland, pp 82–88

Heikki M (1996) Data mining: machine learning, statistics, and databases. In: SSDBM ’96: proceedings of the eighth international conference on scientific and statistical database management, June 1996, pp 2–9

Arora RK, Gupta MK (2017) e-Governance using data warehousing and data mining. Int J Comput Appl 169(8):28–31

Google Scholar

Morik K, Bhaduri K, Kargupta H (2011) Introduction to data mining for sustainability. Data Min Knowl Discov 24(2):311–324

Han J, Kamber M, Pei J (2012) Data mining concepts and techniques, 3rd edn. Elsevier, Netherlands

MATH Google Scholar

Friedman JH (1997) Data mining and statistics: What is the connection? in: Keynote Speech of the 29th Symposium on the Interface: Computing Science and Statistics, Houston, TX, 1997

Turban E, Aronson JE, Liang TP, Sharda R (2007) Decision support and business intelligence systems. 8 th edn, Pearson Education, UK

Gheware SD, Kejkar AS, Tondare SM (2014) Data mining: tasks, tools, techniques and applications. Int J Adv Res Comput Commun Eng 3(10):8095–8098

Kiranmai B, Damodaram A (2014) A review on evaluation measures for data mining tasks. Int J Eng Comput Sci 3(7):7217–7220

Sharma M (2014) Data mining: a literature survey. Int J Emerg Res Manag Technol 3(2):1–4

Venkatadri M, Reddy LC (2011) A review on data mining from past to the future. Int J Comput Appl 15(7):19–22

Chen M, Han J, Yu PS (1996) Data mining: an overview from a database perspective. IEEE Trans Knowl Data Eng 8(6):866–883

Gupta MK, Chandra P (2019) A comparative study of clustering algorithms. In: Proceedings of the 13th INDIACom-2019; IEEE Conference ID: 461816; 6th International Conference on “Computing for Sustainable Global Development”

Ponniah P (2001) Data warehousing fundamentals. Wiley, USA

Chandra P, Gupta MK (2018) Comprehensive survey on data warehousing research. Int J Inform Technol 10(2):217–224

Weiss SH, Indurkhya N (1998) Predictive data mining: a practical guide. Morgan Kaufmann Publishers, San Francisco

Fu Y (1997) Data mining: tasks, techniques, and applications. IEEE Potentials 16(4):18–20

Abuaiadah D (2015) Using bisect k-means clustering technique in the analysis of arabic documents. ACM Trans Asian Low-Resour Lang Inf Process 15(3):1–17

Algergawy A, Mesiti M, Nayak R, Saake G (2011) XML data clustering: an overview. ACM Comput Surv 43(4):1–25

Angiulli F, Fassetti F (2013) Exploiting domain knowledge to detect outliers. Data Min Knowl Discov 28(2):519–568

MathSciNet MATH Google Scholar

Angiulli F, Fassetti F (2016) Toward generalizing the unification with statistical outliers: the gradient outlier factor measure. ACM Trans Knowl Discov Data 10(3):1–26

Bhatnagar V, Ahuja S, Kaur S (2015) Discriminant analysis-based cluster ensemble. Int J Data Min Modell Manag 7(2):83–107

Bouguessa M (2013) Clustering categorical data in projected spaces. Data Min Knowl Discov 29(1):3–38

MathSciNet Google Scholar

Campello RJGB, Moulavi D, Zimek A, Sander J (2015) Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans Knowl Discov Data 10(1):1–51

Carpineto C, Osinski S, Romano G, Weiss D (2009) A survey of web clustering engines. ACM Comput. Surv. 41(3):1–38

Ceglar A, Roddick JF (2006) Association mining. ACM Comput Surv 38(2):1–42

Chen YL, Weng CH (2009) Mining fuzzy association rules from questionnaire data. Knowl Based Syst 22(1):46–56

Fan Chin-Yuan, Fan Pei-Shu, Chan Te-Yi, Chang Shu-Hao (2012) Using hybrid data mining and machine learning clustering analysis to predict the turnover rate for technology professionals. Expert Syst Appl 39:8844–8851

Das R, Kalita J, Bhattacharya (2011) A pattern matching approach for clustering gene expression data. Int J Data Min Model Manag 3(2):130–149

Dincer E (2006) The k-means algorithm in data mining and an application in medicine. Kocaeli Univesity, Kocaeli

Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3):1–32

Gupta MK, Chandra P (2019) P-k-means: k-means using partition based cluster initialization method. In: Proceedings of the international conference on advancements in computing and management (ICACM 2019), Elsevier SSRN, pp 567–573

Gupta MK, Chandra P (2019) An empirical evaluation of k-means clustering algorithm using different distance/similarity metrics. In: Proceedings of the international conference on emerging trends in information technology (ICETIT-2019), emerging trends in information technology, LNEE 605 pp 884–892 DOI: https://doi.org/10.1007/978-3-030-30577-2_79

Hea Z, Xua X, Huangb JZ, Denga S (2004) Mining class outliers: concepts, algorithms and applications in CRM. Expert Syst Appl 27(4):681e97

Hung LN, Thu TNT, Nguyen GC (2015) An efficient algorithm in mining frequent itemsets with weights over data stream using tree data structure. IJ Intell Syst Appl 12:23–31

Hung LN, Thu TNT (2016) Mining frequent itemsets with weights over data stream using inverted matrix. IJ Inf Technol Comput Sci 10:63–71

Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput. Surv 31(3):1–60

Jin H, Wang S, Zhou Q, Li Y (2014) An improved method for density-based clustering. Int J Data Min Model Manag 6(4):347–368

Khandare A, Alvi AS (2017) Performance analysis of improved clustering algorithm on real and synthetic data. IJ Comput Netw Inf Secur 10:57–65

Koh YS, Ravana SD (2016) Unsupervised rare pattern mining: a survey. ACM Trans Knowl Discov Data 10(4):1–29

Kosina P, Gama J (2015) Very fast decision rules for classification in data streams. Data Min Knowl Discov 29(1):168–202

Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. Informatica 31:249–268

Kumar D, Bezdek JC, Rajasegarar S, Palaniswami M, Leckie C, Chan J, Gubbi J (2016) Adaptive cluster tendency visualization and anomaly detection for streaming data. ACM Trans Knowl Discov Data 11(2):1–24

Lee G, Yun U (2017) A new efficient approach for mining uncertain frequent patterns using minimum data structure without false positives. Future Gener Comput Syst 68:89–110

Li G, Zaki MJ (2015) Sampling frequent and minimal boolean patterns: theory and application in classification. Data Min Knowl Discov 30(1):181–225. https://doi.org/10.1007/s10618-015-0409-y

Article MathSciNet MATH Google Scholar

Liao TW, Triantaphyllou E (2007) Recent advances in data mining of enterprise data: algorithms and applications. World Scientific Publishing, Singapore, pp 111–145

Mabroukeh NR, Ezeife CI (2010) A taxonomy of sequential pattern mining algorithms. ACM Comput Surv 43:1

Mampaey M, Vreeken J (2011) Summarizing categorical data by clustering attributes. Data Min Knowl Discov 26(1):130–173

Menardi G, Torelli N (2012) Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 28(1):4–28. https://doi.org/10.1007/s10618-012-0295-5

Mukhopadhyay A, Maulik U, Bandyopadhyay S (2015) A survey of multiobjective evolutionary clustering. ACM Comput Surv 47(4):1–46

Pei Y, Fern XZ, Tjahja TV, Rosales R (2016) ‘Comparing clustering with pairwise and relative constraints: a unified framework. ACM Trans Knowl Discov Data 11:2

Rafalak M, Deja M, Wierzbicki A, Nielek R, Kakol M (2016) Web content classification using distributions of subjective quality evaluations. ACM Trans Web 10:4

Reddy D, Jana PK (2014) A new clustering algorithm based on Voronoi diagram. Int J Data Min Model Manag 6(1):49–64

Rustogi S, Sharma M, Morwal S (2017) Improved Parallel Apriori Algorithm for Multi-cores. IJ Inf Technol Comput Sci 4:18–23

Shah-Hosseini H (2013) Improving K-means clustering algorithm with the intelligent water drops (IWD) algorithm. Int J Data Min Model Manag 5(4):301–317

Silva JA, Faria ER, Barros RC, Hruschka ER, de Carvalho ACPLF, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):1–31

Silva A, Antunes C (2014) Multi-relational pattern mining over data streams. Data Min Knowl Discov 29(6):1783–1814. https://doi.org/10.1007/s10618-014-0394-6

Sim K, Gopalkrishnan V, Zimek A, Cong G (2012) A survey on enhanced subspace clustering. Data Min Knowl Discov 26(2):332–397

Sohrabi MK, Roshani R (2017) Frequent itemset mining using cellular learning automata. Comput Hum Behav 68:244–253

Craw Susan, Wiratunga Nirmalie, Rowe Ray C (2006) Learning adaptation knowledge to improve case-based reasoning. Artif Intell 170:1175–1192

Tan KC, Teoh EJ, Yua Q, Goh KC (2009) A hybrid evolutionary algorithm for attribute selection in data mining. Expert Syst Appl 36(4):8616–8630

Tew C, Giraud-Carrier C, Tanner K, Burton S (2013) Behavior-based clustering and analysis of interestingness measures for association rule mining. Data Min Knowl Discov 28(4):1004–1045

Wang L, Dong M (2015) Exemplar-based low-rank matrix decomposition for data clustering. Data Min Knowl Discov 29:324–357

Wang F, Sun J (2014) Survey on distance metric learning and dimensionality reduction in data mining. Data Min Knowl Discov 29:534–564

Wang B, Rahal I, Dong A (2011) Parallel hierarchical clustering using weighted confidence affinity. Int J Data Min Model Manag 3(2):110–129

Zacharis NZ (2018) Classification and regression trees (CART) for predictive modeling in blended learning. IJ Intell Syst Appl 3:1–9

Zhang W, Li R, Feng D, Chernikov A, Chrisochoides N, Osgood C, Ji S (2015) Evolutionary soft co-clustering: formulations, algorithms, and applications. Data Min Knowl Discov 29:765–791

Han J, Fu Y (1996) Exploration of the power of attribute-oriented induction in data mining. Adv Knowl Discov Data Min. AAAI/MIT Press, pp 399-421

Gupta A, Mumick IS (1995) Maintenance of materialized views: problems, techniques, and applications. IEEE Data Eng Bull 18(2):3

Sawant V, Shah K (2013) A review of distributed data mining using agents. Int J Adv Technol Eng Res 3(5):27–33

Gupta MK, Chandra P (2019) An efficient approach for selection of initial cluster centroids for k-means clustering algorithm. In: Proceedings international conference on recent developments in science engineering and technology (REDSET-2019), November 15–16 2019

Gupta MK, Chandra P (2019) MP-K-means: modified partition based cluster initialization method for k-means algorithm. Int J Recent Technol Eng 8(4):1140–1148

Gupta MK, Chandra P (2019) HYBCIM: hypercube based cluster initialization method for k-means. IJ Innov Technol Explor Eng 8(10):3584–3587. https://doi.org/10.35940/ijitee.j9774.0881019

Article Google Scholar

Enke David, Thawornwong Suraphan (2005) The use of data mining and neural networks for forecasting stock market returns. Expert Syst Appl 29:927–940

Mezyk Edward, Unold Olgierd (2011) Machine learning approach to model sport training. Comput Hum Behav 27:1499–1506

Esling P, Agon C (2012) Time-series data mining. ACM Comput Surv 45(1):1–34

Hüllermeier Eyke (2005) Fuzzy methods in machine learning and data mining: status and prospects. Fuzzy Sets Syst 156:387–406

Hullermeier Eyke (2011) Fuzzy sets in machine learning and data mining. Appl Soft Comput 11:1493–1505

Gengshen Du, Ruhe Guenther (2014) Two machine-learning techniques for mining solutions of the ReleasePlanner™ decision support system. Inf Sci 259:474–489

Smith Kate A, Gupta Jatinder ND (2000) Neural networks in business: techniques and applications for the operations researcher. Comput Oper Res 27:1023–1044

Huang Mu-Jung, Tsou Yee-Lin, Lee Show-Chin (2006) Integrating fuzzy data mining and fuzzy artificial neural networks for discovering implicit knowledge. Knowl Based Syst 19:396–403

Padhraic S (2000) Data mining: analysis on grand scale. Stat Method Med Res 9(4):309–327. https://doi.org/10.1191/096228000701555181

Article MATH Google Scholar

Saeed S, Ali M (2012) Privacy-preserving back-propagation and extreme learning machine algorithms. Data Knowl Eng 79–80:40–61

Singh Y, Bhatia PK, Sangwan OP (2007) A review of studies on machine learning techniques. Int J Comput Sci Secur 1(1):70–84

Yahia ME, El-taher ME (2010) A new approach for evaluation of data mining techniques. Int J Comput Sci Issues 7(5):181–186

Jackson J (2002) Data mining: a conceptual overview. Commun Assoc Inf Syst 8:267–296

Heckerman D (1998) A tutorial on learning with Bayesian networks. Learning in graphical models. Springer, Netherlands, pp 301–354

Politano PM, Walton RO (2017) Statistics & research methodol. Lulu. com

Wetherill GB (1987) Regression analysis with application. Chapman & Hall Ltd, UK

Anderberg MR (2014) Cluster analysis for applications: probability and mathematical statistics: a series of monographs and textbooks, vol 19. Academic Press, USA

Mihoci A (2017) Modelling limit order book volume covariance structures. In: Hokimoto T (ed) Advances in statistical methodologies and their application to real problems. IntechOpen, Croatia. https://doi.org/10.5772/66152

Chapter Google Scholar

Thompson B (2004) Exploratory and confirmatory factor analysis: understanding concepts and applications. American Psychological Association, Washington, DC (ISBN:1-59147-093-5)

Kuzey C, Uyar A, Delen (2014) The impact of multinationality on firm value: a comparative analysis of machine learning techniques. Decis Support Syst 59:127–142

Chan Philip K, Salvatore JS (1997) On the accuracy of meta-learning for scalable data mining. J Intell Inf Syst 8:5–28

Tsai Chih-Fong, Hsu Yu-Feng, Lin Chia-Ying, Lin Wei-Yang (2009) Intrusion detection by machine learning: a review. Expert Syst Appl 36:11994–12000

Liao SH, Chu PH, Hsiao PY (2012) Data mining techniques and applications—a decade review from 2000 to 2011. Expert Syst Appl 39:11303–11311

Kanevski M, Parkin R, Pozdnukhov A, Timonin V, Maignan M, Demyanov V, Canu S (2004) Environmental data mining and modelling based on machine learning algorithms and geostatistics. Environ Model Softw 19:845–855

Jain N, Srivastava V (2013) Data mining techniques: a survey paper. Int J Res Eng Technol 2(11):116–119

Baker RSJ (2010) Data mining for education. In: McGaw B, Peterson P, Baker E (eds) International encyclopedia of education, 3rd edn. Elsevier, Oxford, UK

Lew A, Mauch H (2006) Introduction to data mining and its applications. Springer, Berlin

Mukherjee S, Shaw R, Haldar N, Changdar S (2015) A survey of data mining applications and techniques. Int J Comput Sci Inf Technol 6(5):4663–4666

Data mining examples: most common applications of data mining (2019). https://www.softwaretestinghelp.com/data-mining-examples/ . Accessed 27 Dec 2019

Devi SVSG (2013) Applications and trends in data mining. Orient J Comput Sci Technol 6(4):413–419

Data mining—applications & trends. https://www.tutorialspoint.com/data_mining/dm_applications_trends.htm

Keleş MK (2017) An overview: the impact of data mining applications on various sectors. Tech J 11(3):128–132

Top 14 useful applications for data mining. https://bigdata-madesimple.com/14-useful-applications-of-data-mining/ . Accessed 20 Aug 2014

Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inf Technol Decis Making 5(4):597–604

Padhy N, Mishra P, Panigrahi R (2012) A survey of data mining applications and future scope. Int J Comput Sci Eng Inf Technol 2(3):43–58

Gibert K, Sanchez-Marre M, Codina V (2010) Choosing the right data mining technique: classification of methods and intelligent recommendation. In: International Congress on Environment Modelling and Software Modelling for Environment’s Sake, Fifth Biennial Meeting, Ottawa, Canada

Download references

Author information

Authors and affiliations.

University School of Information, Communication and Technology, Guru Gobind Singh Indraprastha University, Sector-16C, Dwarka, Delhi, 110078, India

Manoj Kumar Gupta & Pravin Chandra

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manoj Kumar Gupta .

Rights and permissions

Reprints and permissions

About this article

Gupta, M.K., Chandra, P. A comprehensive survey of data mining. Int. j. inf. tecnol. 12 , 1243–1257 (2020). https://doi.org/10.1007/s41870-020-00427-7

Download citation

Received : 29 June 2019

Accepted : 20 January 2020

Published : 06 February 2020

Issue Date : December 2020

DOI : https://doi.org/10.1007/s41870-020-00427-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Data mining techniques
Data mining tasks
Data mining applications
Classification
Find a journal
Publish with us
Track your research

An official website of the United States government

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Publications
Account settings
Advanced Search
Journal List

Data mining in clinical big data: the frequently used databases, steps, and methodological models

Yuan-jie li.

Author information
Article notes
Copyright and License information

Corresponding author.

Contributed equally.

Received 2020 Jan 24; Accepted 2021 Aug 3; Collection date 2021.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Many high quality studies have emerged from public databases, such as Surveillance, Epidemiology, and End Results (SEER), National Health and Nutrition Examination Survey (NHANES), The Cancer Genome Atlas (TCGA), and Medical Information Mart for Intensive Care (MIMIC); however, these data are often characterized by a high degree of dimensional heterogeneity, timeliness, scarcity, irregularity, and other characteristics, resulting in the value of these data not being fully utilized. Data-mining technology has been a frontier field in medical research, as it demonstrates excellent performance in evaluating patient risks and assisting clinical decision-making in building disease-prediction models. Therefore, data mining has unique advantages in clinical big-data research, especially in large-scale medical public databases. This article introduced the main medical public database and described the steps, tasks, and models of data mining in simple language. Additionally, we described data-mining methods along with their practical applications. The goal of this work was to aid clinical researchers in gaining a clear and intuitive understanding of the application of data-mining technology on clinical big-data in order to promote the production of research results that are beneficial to doctors and patients.

Keywords: Clinical big data, Data mining, Machine learning, Medical public database, SEER, NHANES, TCGA, MIMIC

With the rapid development of computer software/hardware and internet technology, the amount of data has increased at an amazing speed. “Big data” as an abstract concept currently affects all walks of life [ 1 ], and although its importance has been recognized, its definition varies slightly from field to field. In the field of computer science, big data refers to a dataset that cannot be perceived, acquired, managed, processed, or served within a tolerable time by using traditional IT and software and hardware tools. Generally, big data refers to a dataset that exceeds the scope of a simple database and data-processing architecture used in the early days of computing and is characterized by high-volume and -dimensional data that is rapidly updated represents a phenomenon or feature that has emerged in the digital age. Across the medical industry, various types of medical data are generated at a high speed, and trends indicate that applying big data in the medical field helps improve the quality of medical care and optimizes medical processes and management strategies [ 2 , 3 ]. Currently, this trend is shifting from civilian medicine to military medicine. For example, the United States is exploring the potential to use of one of its largest healthcare systems (the Military Healthcare System) to provide healthcare to eligible veterans in order to potentially benefit > 9 million eligible personnel [ 4 ]. Another data-management system has been developed to assess the physical and mental health of active-duty personnel, with this expected to yield significant economic benefits to the military medical system [ 5 ]. However, in medical research, the wide variety of clinical data and differences between several medical concepts in different classification standards results in a high degree of dimensionality heterogeneity, timeliness, scarcity, and irregularity to existing clinical data [ 6 , 7 ]. Furthermore, new data analysis techniques have yet to be popularized in medical research [ 8 ]. These reasons hinder the full realization of the value of existing data, and the intensive exploration of the value of clinical data remains a challenging problem.

Computer scientists have made outstanding contributions to the application of big data and introduced the concept of data mining to solve difficulties associated with such applications. Data mining (also known as knowledge discovery in databases) refers to the process of extracting potentially useful information and knowledge hidden in a large amount of incomplete, noisy, fuzzy, and random practical application data [ 9 ]. Unlike traditional research methods, several data-mining technologies mine information to discover knowledge based on the premise of unclear assumptions (i.e., they are directly applied without prior research design). The obtained information should have previously unknown, valid, and practical characteristics [ 9 ]. Data-mining technology does not aim to replace traditional statistical analysis techniques, but it does seek to extend and expand statistical analysis methodologies. From a practical point of view, machine learning (ML) is the main analytical method in data mining, as it represents a method of training models by using data and then using those models for predicting outcomes. Given the rapid progress of data-mining technology and its excellent performance in other industries and fields, it has introduced new opportunities and prospects to clinical big-data research [ 10 ]. Large amounts of high quality medical data are available to researchers in the form of public databases, which enable more researchers to participate in the process of medical data mining in the hope that the generated results can further guide clinical practice.

This article provided a valuable overview to medical researchers interested in studying the application of data mining on clinical big data. To allow a clearer understanding of the application of data-mining technology on clinical big data, the second part of this paper introduced the concept of public databases and summarized those commonly used in medical research. In the third part of the paper, we offered an overview of data mining, including introducing an appropriate model, tasks, and processes, and summarized the specific methods of data mining. In the fourth and fifth parts of this paper, we introduced data-mining algorithms commonly used in clinical practice along with specific cases in order to help clinical researchers clearly and intuitively understand the application of data-mining technology on clinical big data. Finally, we discussed the advantages and disadvantages of data mining in clinical analysis and offered insight into possible future applications.

Overview of common public medical databases

A public database describes a data repository used for research and dedicated to housing data related to scientific research on an open platform. Such databases collect and store heterogeneous and multi-dimensional health, medical, scientific research in a structured form and characteristics of mass/multi-ownership, complexity, and security. These databases cover a wide range of data, including those related to cancer research, disease burden, nutrition and health, and genetics and the environment. Table 1 summarizes the main public medical databases [ 11 – 26 ]. Researchers can apply for access to data based on the scope of the database and the application procedures required to perform relevant medical research.

Overview of main medical public database

Data mining: an overview

Data mining is a multidisciplinary field at the intersection of database technology, statistics, ML, and pattern recognition that profits from all these disciplines [ 27 ]. Although this approach is not yet widespread in the field of medical research, several studies have demonstrated the promise of data mining in building disease-prediction models, assessing patient risk, and helping physicians make clinical decisions [ 28 – 31 ].

Data-mining models

Data-mining has two kinds of models: descriptive and predictive. Predictive models are used to predict unknown or future values of other variables of interest, whereas descriptive models are often used to find patterns that describe data that can be interpreted by humans [ 32 ].

Data-mining tasks

A model is usually implemented by a task, with the goal of description being to generalize patterns of potential associations in the data. Therefore, using a descriptive model usually results in a few collections with the same or similar attributes. Prediction mainly refers to estimation of the variable value of a specific attribute based on the variable values of other attributes, including classification and regression [ 33 ].

Data-mining methods

After defining the data-mining model and task, the data mining methods required to build the approach based on the discipline involved are then defined. The data-mining method depends on whether or not dependent variables (labels) are present in the analysis. Predictions with dependent variables (labels) are generated through supervised learning, which can be performed by the use of linear regression, generalized linear regression, a proportional hazards model (the Cox regression model), a competitive risk model, decision trees, the random forest (RF) algorithm, and support vector machines (SVMs). In contrast, unsupervised learning involves no labels. The learning model infers some internal data structure. Common unsupervised learning methods include principal component analysis (PCA), association analysis, and clustering analysis.

Data-mining algorithms for clinical big data

Data mining based on clinical big data can produce effective and valuable knowledge, which is essential for accurate clinical decision-making and risk assessment [ 34 ]. Data-mining algorithms enable realization of these goals.

Supervised learning

A concept often mentioned in supervised learning is the partitioning of datasets. To prevent overfitting of a model, a dataset can generally be divided into two or three parts: a training set, validation set, and test set. Ripley [ 35 ] defined these parts as a set of examples used for learning and used to fit the parameters (i.e., weights) of the classifier, a set of examples used to tune the parameters (i.e., architecture) of a classifier, and a set of examples used only to assess the performance (generalized) of a fully-specified classifier, respectively. Briefly, the training set is used to train the model or determine the model parameters, the validation set is used to perform model selection, and the test set is used to verify model performance. In practice, data are generally divided into training and test sets, whereas the verification set is less involved. It should be emphasized that the results of the test set do not guarantee model correctness but only show that similar data can obtain similar results using the model. Therefore, the applicability of a model should be analysed in combination with specific problems in the research. Classical statistical methods, such as linear regression, generalized linear regression, and a proportional risk model, have been widely used in medical research. Notably, most of these classical statistical methods have certain data requirements or assumptions; however, in face of complicated clinical data, assumptions about data distribution are difficult to make. In contrast, some ML methods (algorithmic models) make no assumptions about the data and cross-verify the results; thus, they are likely to be favoured by clinical researchers [ 36 ]. For these reasons, this chapter focuses on ML methods that do not require assumptions about data distribution and classical statistical methods that are used in specific situations.

Decision tree

A decision tree is a basic classification and regression method that generates a result similar to the tree structure of a flowchart, where each tree node represents a test on an attribute, each branch represents the output of an attribute, each leaf node (decision node) represents a class or class distribution, and the topmost part of the tree is the root node [ 37 ]. The decision tree model is called a classification tree when used for classification and a regression tree when used for regression. Studies have demonstrated the utility of the decision tree model in clinical applications. In a study on the prognosis of breast cancer patients, a decision tree model and a classical logistic regression model were constructed, respectively, with the predictive performance of the different models indicating that the decision tree model showed stronger predictive power when using real clinical data [ 38 ]. Similarly, the decision tree model has been applied to other areas of clinical medicine, including diagnosis of kidney stones [ 39 ], predicting the risk of sudden cardiac arrest [ 40 ], and exploration of the risk factors of type II diabetes [ 41 ]. A common feature of these studies is the use of a decision tree model to explore the interaction between variables and classify subjects into homogeneous categories based on their observed characteristics. In fact, because the decision tree accounts for the strong interaction between variables, it is more suitable for use with decision algorithms that follow the same structure [ 42 ]. In the construction of clinical prediction models and exploration of disease risk factors and patient prognosis, the decision tree model might offer more advantages and practical application value than some classical algorithms. Although the decision tree has many advantages, it recursively separates observations into branches to construct a tree; therefore, in terms of data imbalance, the precision of decision tree models needs improvement.

The RF method

The RF algorithm was developed as an application of an ensemble-learning method based on a collection of decision trees. The bootstrap method [ 43 ] is used to randomly retrieve sample sets from the training set, with decision trees generated by the bootstrap method constituting a “random forest” and predictions based on this derived from an ensemble average or majority vote. The biggest advantage of the RF method is that the random sampling of predictor variables at each decision tree node decreases the correlation among the trees in the forest, thereby improving the precision of ensemble predictions [ 44 ]. Given that a single decision tree model might encounter the problem of overfitting [ 45 ], the initial application of RF minimizes overfitting in classification and regression and improves predictive accuracy [ 44 ]. Taylor et al. [ 46 ] highlighted the potential of RF in correctly differentiating in-hospital mortality in patients experiencing sepsis after admission to the emergency department. Nowhere in the healthcare system is the need more pressing to find methods to reduce uncertainty than in the fast, chaotic environment of the emergency department. The authors demonstrated that the predictive performance of the RF method was superior to that of traditional emergency medicine methods and the methods enabled evaluation of more clinical variables than traditional modelling methods, which subsequently allowed the discovery of clinical variables not expected to be of predictive value or which otherwise would have been omitted as a rare predictor [ 46 ]. Another study based on the Medical Information Mart for Intensive Care (MIMIC) II database [ 47 ] found that RF had excellent predictive power regarding intensive care unit (ICU) mortality [ 48 ]. These studies showed that the application of RF to big data stored in the hospital healthcare system provided a new data-driven method for predictive analysis in critical care. Additionally, random survival forests have recently been developed to analyse survival data, especially right-censored survival data [ 49 , 50 ], which can help researchers conduct survival analyses in clinical oncology and help develop personalized treatment regimens that benefit patients [ 51 ].

The SVM is a relatively new classification or prediction method developed by Cortes and Vapnik and represents a data-driven approach that does not require assumptions about data distribution [ 52 ]. The core purpose of an SVM is to identify a separation boundary (called a hyperplane) to help classify cases; thus, the advantages of SVMs are obvious when classifying and predicting cases based on high dimensional data or data with a small sample size [ 53 , 54 ].

In a study of drug compliance in patients with heart failure, researchers used an SVM to build a predictive model for patient compliance in order to overcome the problem of a large number of input variables relative to the number of available observations [ 55 ]. Additionally, the mechanisms of certain chronic and complex diseases observed in clinical practice remain unclear, and many risk factors, including gene–gene interactions and gene-environment interactions, must be considered in the research of such diseases [ 55 , 56 ]. SVMs are capable of addressing these issues. Yu et al. [ 54 ] applied an SVM for predicting diabetes onset based on data from the National Health and Nutrition Examination Survey (NHANES). Furthermore, these models have strong discrimination ability, making SVMs a promising classification approach for detecting individuals with chronic and complex diseases. However, a disadvantage of SVMs is that when the number of observation samples is large, the method becomes time- and resource-intensive, which is often highly inefficient.

Competitive risk model

Kaplan–Meier marginal regression and the Cox proportional hazards model are widely used in survival analysis in clinical studies. Classical survival analysis usually considers only one endpoint, such as the impact of patient survival time. However, in clinical medical research, multiple endpoints usually coexist, and these endpoints compete with one another to generate competitive risk data [ 57 ]. In the case of multiple endpoint events, the use of a single endpoint-analysis method can lead to a biased estimation of the probability of endpoint events due to the existence of competitive risks [ 58 ]. The competitive risk model is a classical statistical model based on the hypothesis of data distribution. Its main advantage is its accurate estimation of the cumulative incidence of outcomes for right-censored survival data with multiple endpoints [ 59 ]. In data analysis, the cumulative risk rate is estimated using the cumulative incidence function in single-factor analysis, and Gray’s test is used for between-group comparisons [ 60 ].

Multifactor analysis uses the Fine-Gray and cause-specific (CS) risk models to explore the cumulative risk rate [ 61 ]. The difference between the Fine-Gray and CS models is that the former is applicable to establishing a clinical prediction model and predicting the risk of a single endpoint of interest [ 62 ], whereas the latter is suitable for answering etiological questions, where the regression coefficient reflects the relative effect of covariates on the increased incidence of the main endpoint in the target event-free risk set [ 63 ]. Currently, in databases with CS records, such as Surveillance, Epidemiology, and End Results (SEER), competitive risk models exhibit good performance in exploring disease-risk factors and prognosis [ 64 ]. A study of prognosis in patients with oesophageal cancer from SEER showed that Cox proportional risk models might misestimate the effects of age and disease location on patient prognosis, whereas competitive risk models provide more accurate estimates of factors affecting patient prognosis [ 65 ]. In another study of the prognosis of penile cancer patients, researchers found that using a competitive risk model was more helpful in developing personalized treatment plans [ 66 ].

Unsupervised learning

In many data-analysis processes, the amount of usable identified data is small, and identifying data is a tedious process [ 67 ]. Unsupervised learning is necessary to judge and categorize data according to similarities, characteristics, and correlations and has three main applications: data clustering, association analysis, and dimensionality reduction. Therefore, the unsupervised learning methods introduced in this section include clustering analysis, association rules, and PCA.

Clustering analysis

The classification algorithm needs to “know” information concerning each category in advance, with all of the data to be classified having corresponding categories. When the above conditions cannot be met, cluster analysis can be applied to solve the problem [ 68 ]. Clustering places similar objects into different categories or subsets through the process of static classification. Consequently, objects in the same subset have similar properties. Many kinds of clustering techniques exist. Here, we introduced the four most commonly used clustering techniques.

Partition clustering

The core idea of this clustering method regards the centre of the data point as the centre of the cluster. The k-means method [ 69 ] is a representative example of this technique. The k-means method takes n observations and an integer, k , and outputs a partition of the n observations into k sets such that each observation belongs to the cluster with the nearest mean [ 70 ]. The k-means method exhibits low time complexity and high computing efficiency but has a poor processing effect on high dimensional data and cannot identify nonspherical clusters.

Hierarchical clustering

The hierarchical clustering algorithm decomposes a dataset hierarchically to facilitate the subsequent clustering [ 71 ]. Common algorithms for hierarchical clustering include BIRCH [ 72 ], CURE [ 73 ], and ROCK [ 74 ]. The algorithm starts by treating every point as a cluster, with clusters grouped according to closeness. When further combinations result in unexpected results under multiple causes or only one cluster remains, the grouping process ends. This method has wide applicability, and the relationship between clusters is easy to detect; however, the time complexity is high [ 75 ].

Clustering according to density

The density algorithm takes areas presenting a high degree of data density and defines these as belonging to the same cluster [ 76 ]. This method aims to find arbitrarily-shaped clusters, with the most representative algorithm being DBSCAN [ 77 ]. In practice, DBSCAN does not need to input the number of clusters to be partitioned and can handle clusters of various shapes; however, the time complexity of the algorithm is high. Furthermore, when data density is irregular, the quality of the clusters decreases; thus, DBSCAN cannot process high dimensional data [ 75 ].

Clustering according to a grid

Neither partition nor hierarchical clustering can identify clusters with nonconvex shapes. Although a dimension-based algorithm can accomplish this task, the time complexity is high. To address this problem, data-mining researchers proposed grid-based algorithms that changed the original data space into a grid structure of a certain size. A representative algorithm is STING, which divides the data space into several square cells according to different resolutions and clusters the data of different structure levels [ 78 ]. The main advantage of this method is its high processing speed and its exclusive dependence on the number of units in each dimension of the quantized space.

In clinical studies, subjects tend to be actual patients. Although researchers adopt complex inclusion and exclusion criteria before determining the subjects to be included in the analyses, heterogeneity among different patients cannot be avoided [ 79 , 80 ]. The most common application of cluster analysis in clinical big data is in classifying heterogeneous mixed groups into homogeneous groups according to the characteristics of existing data (i.e., “subgroups” of patients or observed objects are identified) [ 81 , 82 ]. This new information can then be used in the future to develop patient-oriented medical-management strategies. Docampo et al. [ 81 ] used hierarchical clustering to reduce heterogeneity and identify subgroups of clinical fibromyalgia, which aided the evaluation and management of fibromyalgia. Additionally, Guo et al. [ 83 ] used k-means clustering to divide patients with essential hypertension into four subgroups, which revealed that the potential risk of coronary heart disease differed between different subgroups. On the other hand, density- and grid-based clustering algorithms have mostly been used to process large numbers of images generated in basic research and clinical practice, with current studies focused on developing new tools to help clinical research and practices based on these technologies [ 84 , 85 ]. Cluster analysis will continue to have extensive application prospects along with the increasing emphasis on personalized treatment.

Association rules

Association rules discover interesting associations and correlations between item sets in large amounts of data. These rules were first proposed by Agrawal et al. [ 86 ] and applied to analyse customer buying habits to help retailers create sales plans. Data-mining based on association rules identifies association rules in a two-step process: 1) all high frequency items in the collection are listed and 2) frequent association rules are generated based on the high frequency items [ 87 ]. Therefore, before association rules can be obtained, sets of frequent items must be calculated using certain algorithms. The Apriori algorithm is based on the a priori principle of finding all relevant adjustment items in a database transaction that meet a minimum set of rules and restrictions or other restrictions [ 88 ]. Other algorithms are mostly variants of the Apriori algorithm [ 64 ]. The Apriori algorithm must scan the entire database every time it scans the transaction; therefore, algorithm performance deteriorates as database size increases [ 89 ], making it potentially unsuitable for analysing large databases. The frequent pattern (FP) growth algorithm was proposed to improve efficiency. After the first scan, the FP algorithm compresses the frequency set in the database into a FP tree while retaining the associated information and then mines the conditional libraries separately [ 90 ]. Association-rule technology is often used in medical research to identify association rules between disease risk factors (i.e., exploration of the joint effects of disease risk factors and combinations of other risk factors). For example, Li et al. [ 91 ] used the association-rule algorithm to identify the most important stroke risk factor as atrial fibrillation, followed by diabetes and a family history of stroke. Based on the same principle, association rules can also be used to evaluate treatment effects and other aspects. For example, Guo et al. [ 92 ] used the FP algorithm to generate association rules and evaluate individual characteristics and treatment effects of patients with diabetes, thereby reducing the readability rate of patients with diabetes. Association rules reveal a connection between premises and conclusions; however, the reasonable and reliable application of information can only be achieved through validation by experienced medical professionals and through extensive causal research [ 92 ].

PCA is a widely used data-mining method that aims to reduce data dimensionality in an interpretable way while retaining most of the information present in the data [ 93 , 94 ]. The main purpose of PCA is descriptive, as it requires no assumptions about data distribution and is, therefore, an adaptive and exploratory method. During the process of data analysis, the main steps of PCA include standardization of the original data, calculation of a correlation coefficient matrix, calculation of eigenvalues and eigenvectors, selection of principal components, and calculation of the comprehensive evaluation value. PCA does not often appear as a separate method, as it is often combined with other statistical methods [ 95 ]. In practical clinical studies, the existence of multicollinearity often leads to deviation from multivariate analysis. A feasible solution is to construct a regression model by PCA, which replaces the original independent variables with each principal component as a new independent variable for regression analysis, with this most commonly seen in the analysis of dietary patterns in nutritional epidemiology [ 96 ]. In a study of socioeconomic status and child-developmental delays, PCA was used to derive a new variable (the household wealth index) from a series of household property reports and incorporate this new variable as the main analytical variable into the logistic regression model [ 97 ]. Additionally, PCA can be combined with cluster analysis. Burgel et al. [ 98 ] used PCA to transform clinical data to address the lack of independence between existing variables used to explore the heterogeneity of different subtypes of chronic obstructive pulmonary disease. Therefore, in the study of subtypes and heterogeneity of clinical diseases, PCA can eliminate noisy variables that can potentially corrupt the cluster structure, thereby increasing the accuracy of the results of clustering analysis [ 98 , 99 ].

The data-mining process and examples of its application using common public databases

Open-access databases have the advantages of large volumes of data, wide data coverage, rich data information, and a cost-efficient method of research, making them beneficial to medical researchers. In this chapter, we introduced the data-mining process and methods and their application in research based on examples of utilizing public databases and data-mining algorithms.

The data-mining process

Figure 1 shows a series of research concepts. The data-mining process is divided into several steps: (1) database selection according to the research purpose; (2) data extraction and integration, including downloading the required data and combining data from multiple sources; (3) data cleaning and transformation, including removal of incorrect data, filling in missing data, generating new variables, converting data format, and ensuring data consistency; (4) data mining, involving extraction of implicit relational patterns through traditional statistics or ML; (5) pattern evaluation, which focuses on the validity parameters and values of the relationship patterns of the extracted data; and (6) assessment of the results, involving translation of the extracted data-relationship model into comprehensible knowledge made available to the public.

The steps of data mining in medical public database

Examples of data-mining applied using public databases

Establishment of warning models for the early prediction of disease.

A previous study identified sepsis as a major cause of death in ICU patients [ 100 ]. The authors noted that the predictive model developed previously used a limited number of variables, and that model performance required improvement. The data-mining process applied to address these issues was, as follows: (1) data selection using the MIMIC III database; (2) extraction and integration of three types of data, including multivariate features (demographic information and clinical biochemical indicators), time series data (temperature, blood pressure, and heart rate), and clinical latent features (various scores related to disease); (3) data cleaning and transformation, including fixing irregular time series measurements, estimating missing values, deleting outliers, and addressing data imbalance; (4) data mining through the use of logical regression, generation of a decision tree, application of the RF algorithm, an SVM, and an ensemble algorithm (a combination of multiple classifiers) to established the prediction model; (5) pattern evaluation using sensitivity, precision, and the area under the receiver operating characteristic curve to evaluate model performance; and (6) evaluation of the results, in this case the potential to predicting the prognosis of patients with sepsis and whether the model outperformed current scoring systems.

Exploring prognostic risk factors in cancer patients

Wu et al. [ 101 ] noted that traditional survival-analysis methods often ignored the influence of competitive risk events, such as suicide and car accident, on outcomes, leading to deviations and misjudgements in estimating the effect of risk factors. They used the SEER database, which offers cause-of-death data for cancer patients, and a competitive risk model to address this problem according to the following process: (1) data were obtained from the SEER database; (2) demography, clinical characteristics, treatment modality, and cause of death of cecum cancer patients were extracted from the database; (3) patient data were deleted when there were no demographic, clinical, therapeutic, or cause-of-death variables; (4) Cox regression and two kinds of competitive risk models were applied for survival analysis; (5) the results were compared between three different models; and (6) the results revealed that for survival data with multiple endpoints, the competitive risk model was more favourable.

Derivation of dietary patterns

A study by Martínez Steele et al. [ 102 ] applied PCA for nutritional epidemiological analysis to determine dietary patterns and evaluate the overall nutritional quality of the population based on those patterns. Their process involved the following: (1) data were extracted from the NHANES database covering the years 2009–2010; (2) demographic characteristics and two 24 h dietary recall interviews were obtained; (3) data were weighted and excluded based on subjects not meeting specific criteria; (4) PCA was used to determine dietary patterns in the United States population, and Gaussian regression and restricted cubic splines were used to assess associations between ultra-processed foods and nutritional balance; (5) eigenvalues, scree plots, and the interpretability of the principal components were reviewed to screen and evaluate the results; and (6) the results revealed a negative association between ultra-processed food intake and overall dietary quality. Their findings indicated that a nutritionally balanced eating pattern was characterized by a diet high in fibre, potassium, magnesium, and vitamin C intake along with low sugar and saturated fat consumption.

The use of “big data” has changed multiple aspects of modern life, with its use combined with data-mining methods capable of improving the status quo [ 86 ]. The aim of this study was to aid clinical researchers in understanding the application of data-mining technology on clinical big data and public medical databases to further their research goals in order to benefit clinicians and patients. The examples provided offer insight into the data-mining process applied for the purposes of clinical research. Notably, researchers have raised concerns that big data and data-mining methods were not a perfect fit for adequately replicating actual clinical conditions, with the results potentially capable of misleading doctors and patients [ 86 ]. Therefore, given the rate at which new technologies and trends progress, it is necessary to maintain a positive attitude concerning their potential impact while remaining cautious in examining the results provided by their application.

In the future, the healthcare system will need to utilize increasingly larger volumes of big data with higher dimensionality. The tasks and objectives of data analysis will also have higher demands, including higher degrees of visualization, results with increased accuracy, and stronger real-time performance. As a result, the methods used to mine and process big data will continue to improve. Furthermore, to increase the formality and standardization of data-mining methods, it is possible that a new programming language specifically for this purpose will need to be developed, as well as novel methods capable of addressing unstructured data, such as graphics, audio, and text represented by handwriting. In terms of application, the development of data-management and disease-screening systems for large-scale populations, such as the military, will help determine the best interventions and formulation of auxiliary standards capable of benefitting both cost-efficiency and personnel. Data-mining technology can also be applied to hospital management in order to improve patient satisfaction, detect medical-insurance fraud and abuse, and reduce costs and losses while improving management efficiency. Currently, this technology is being applied for predicting patient disease, with further improvements resulting in the increased accuracy and speed of these predictions. Moreover, it is worth noting that technological development will concomitantly require higher quality data, which will be a prerequisite for accurate application of the technology.

Finally, the ultimate goal of this study was to explain the methods associated with data mining and commonly used to process clinical big data. This review will potentially promote further study and aid doctors and patients.

Abbreviations

Biologic Specimen and Data Repositories Information Coordinating Center

China Health and Retirement Longitudinal Study

China Health and Nutrition Survey

China Kadoorie Biobank

Cause-specific risk

Comparative Toxicogenomics Database

EICU Collaborative Research Database

Frequent pattern

Global burden of disease

Gene expression omnibus

Health and Retirement Study

International Cancer Genome Consortium

Medical Information Mart for Intensive Care

Machine learning

National Health and Nutrition Examination Survey

Principal component analysis

Paediatric intensive care

Random forest

Surveillance, epidemiology, and end results

Support vector machine

The Cancer Genome Atlas

Authors’ contributions

WTW, YJL and JL designed the review. JL, AZF, TH, LL and ADX reviewed and criticized the original paper. All authors read and approved the final manuscript.

This study was supported by the National Social Science Foundation of China (No. 16BGL183).

Declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Wen-Tao Wu and Yuan-Jie Li have contributed equally to this work

Contributor Information

Wen-Tao Wu, Email: [email protected].

Yuan-Jie Li, Email: [email protected].

Ao-Zi Feng, Email: [email protected].

Li Li, Email: [email protected].

Tao Huang, Email: [email protected].

An-Ding Xu, Email: [email protected].

Jun Lyu, Email: [email protected].

1. Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health informatics. J Big Data. 2014;1(1):1–35. doi: 10.1186/2196-1115-1-2. [ DOI ] [ Google Scholar ]
2. Wang F, Zhang P, Wang X, Hu J. Clinical risk prediction by exploring high-order feature correlations. AMIA Annu Symp Proc. 2014;2014:1170–1179. [ PMC free article ] [ PubMed ] [ Google Scholar ]
3. Xu R, Li L, Wang Q. dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text. BMC Bioinform. 2014;15:105. doi: 10.1186/1471-2105-15-105. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
4. Ramachandran S, Erraguntla M, Mayer R, Benjamin P, Editors. Data mining in military health systems-clinical and administrative applications. In: 2007 IEEE international conference on automation science and engineering; 2007. 10.1109/COASE.2007.4341764.
5. Vie LL, Scheier LM, Lester PB, Ho TE, Labarthe DR, Seligman MEP. The US army person-event data environment: a military-civilian big data enterprise. Big Data. 2015;3(2):67–79. doi: 10.1089/big.2014.0055. [ DOI ] [ PubMed ] [ Google Scholar ]
6. Mohan A, Blough DM, Kurc T, Post A, Saltz J. Detection of conflicts and inconsistencies in taxonomy-based authorization policies. IEEE Int Conf Bioinform Biomed. 2012;2011:590–594. doi: 10.1109/BIBM.2011.79. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
7. Luo J, Wu M, Gopukumar D, Zhao Y. Big data application in biomedical research and health care: a literature review. Biomed Inform Insights. 2016;8:1–10. doi: 10.4137/BII.S31559. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
8. Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform. 2008;77(2):81–97. doi: 10.1016/j.ijmedinf.2006.11.006. [ DOI ] [ PubMed ] [ Google Scholar ]
9. Sahu H, Shrma S, Gondhalakar S. A brief overview on data mining survey. Int J Comput Technol Electron Eng. 2011;1(3):114–121. [ Google Scholar ]
10. Obermeyer Z, Emanuel EJ. Predicting the future - big data, machine learning, and clinical medicine. N Engl J Med. 2016;375(13):1216–1219. doi: 10.1056/NEJMp1606181. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
11. Doll KM, Rademaker A, Sosa JA. Practical guide to surgical data sets: surveillance, epidemiology, and end results (SEER) database. JAMA Surg. 2018;153(6):588–589. doi: 10.1001/jamasurg.2018.0501. [ DOI ] [ PubMed ] [ Google Scholar ]
12. Johnson AE, Pollard TJ, Shen L, Lehman LW, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035. doi: 10.1038/sdata.2016.35. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
13. Ahluwalia N, Dwyer J, Terry A, Moshfegh A, Johnson C. Update on NHANES dietary data: focus on collection, release, analytical considerations, and uses to inform public policy. Adv Nutr. 2016;7(1):121–134. doi: 10.3945/an.115.009258. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
14. Vos T, Lim SS, Abbafati C, Abbas KM, Abbasi M, Abbasifard M, et al. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet. 2020;396(10258):1204–1222. doi: 10.1016/S0140-6736(20)30925-9. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
15. Palmer LJ. UK Biobank: Bank on it. Lancet. 2007;369(9578):1980–1982. doi: 10.1016/S0140-6736(07)60924-6. [ DOI ] [ PubMed ] [ Google Scholar ]
16. Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, et al. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45(10):1113–20. 10.1038/ng.2764. [ DOI ] [ PMC free article ] [ PubMed ]
17. Davis S, Meltzer PS. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics. 2007;23(14):1846–1847. doi: 10.1093/bioinformatics/btm254. [ DOI ] [ PubMed ] [ Google Scholar ]
18. Zhang J, Bajari R, Andric D, Gerthoffert F, Lepsa A, Nahal-Bose H, et al. The international cancer genome consortium data portal. Nat Biotechnol. 2019;37(4):367–369. doi: 10.1038/s41587-019-0055-9. [ DOI ] [ PubMed ] [ Google Scholar ]
19. Chen Z, Chen J, Collins R, Guo Y, Peto R, Wu F, et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int J Epidemiol. 2011;40(6):1652–1666. doi: 10.1093/ije/dyr120. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
20. Davis AP, Grondin CJ, Johnson RJ, Sciaky D, McMorran R, Wiegers J, et al. The comparative toxicogenomics database: update 2019. Nucleic Acids Res. 2019;47(D1):D948–D954. doi: 10.1093/nar/gky868. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
21. Zeng X, Yu G, Lu Y, Tan L, Wu X, Shi S, et al. PIC, a paediatric-specific intensive care database. Sci Data. 2020;7(1):14. doi: 10.1038/s41597-020-0355-4. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
22. Giffen CA, Carroll LE, Adams JT, Brennan SP, Coady SA, Wagner EL. Providing contemporary access to historical biospecimen collections: development of the NHLBI Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC) Biopreserv Biobank. 2015;13(4):271–279. doi: 10.1089/bio.2014.0050. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
23. Zhang B, Zhai FY, Du SF, Popkin BM. The China Health and Nutrition Survey, 1989–2011. Obes Rev. 2014;15(Suppl 1):2–7. doi: 10.1111/obr.12119. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
24. Zhao Y, Hu Y, Smith JP, Strauss J, Yang G. Cohort profile: the China Health and Retirement Longitudinal Study (CHARLS) Int J Epidemiol. 2014;43(1):61–68. doi: 10.1093/ije/dys203. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
25. Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU collaborative research database, a freely available multi-centre database for critical care research. Sci Data. 2018;5:180178. doi: 10.1038/sdata.2018.178. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
26. Fisher GG, Ryan LH. Overview of the health and retirement study and introduction to the special issue. Work Aging Retire. 2018;4(1):1–9. doi: 10.1093/workar/wax032. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
27. Iavindrasana J, Cohen G, Depeursinge A, Müller H, Meyer R, Geissbuhler A. Clinical data mining: a review. Yearb Med Inform. 2009:121–33. [ PubMed ]
28. Zhang Y, Guo SL, Han LN, Li TL. Application and exploration of big data mining in clinical medicine. Chin Med J. 2016;129(6):731–738. doi: 10.4103/0366-6999.178019. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
29. Ngiam KY, Khor IW. Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 2019;20(5):e262–e273. doi: 10.1016/S1470-2045(19)30149-4. [ DOI ] [ PubMed ] [ Google Scholar ]
30. Huang C, Murugiah K, Mahajan S, Li S-X, Dhruva SS, Haimovich JS, et al. Enhancing the prediction of acute kidney injury risk after percutaneous coronary intervention using machine learning techniques: a retrospective cohort study. PLoS Med. 2018;15(11):e1002703. doi: 10.1371/journal.pmed.1002703. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
31. Rahimian F, Salimi-Khorshidi G, Payberah AH, Tran J, Ayala Solares R, Raimondi F, et al. Predicting the risk of emergency admission with machine learning: development and validation using linked electronic health records. PLoS Med. 2018;15(11):e1002695. doi: 10.1371/journal.pmed.1002695. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
32. Kantardzic M. Data Mining: concepts, models, methods, and algorithms. Technometrics. 2003;45(3):277. [ Google Scholar ]
33. Jothi N, Husain W. Data mining in healthcare—a review. Procedia Comput Sci. 2015;72:306–313. doi: 10.1016/j.procs.2015.12.145. [ DOI ] [ Google Scholar ]
34. Piatetsky-Shapiro G, Tamayo P. Microarray data mining: facing the challenges. SIGKDD. 2003;5(2):1–5. doi: 10.1145/980972.980974. [ DOI ] [ Google Scholar ]
35. Ripley BD. Pattern recognition and neural networks. Cambridge: Cambridge University Press; 1996. [ Google Scholar ]
36. Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Stat Surv. 2010;4:40–79. doi: 10.1214/09-SS054. [ DOI ] [ Google Scholar ]
37. Shouval R, Bondi O, Mishan H, Shimoni A, Unger R, Nagler A. Application of machine learning algorithms for clinical predictive modelling: a data-mining approach in SCT. Bone Marrow Transp. 2014;49(3):332–337. doi: 10.1038/bmt.2013.146. [ DOI ] [ PubMed ] [ Google Scholar ]
38. Momenyan S, Baghestani AR, Momenyan N, Naseri P, Akbari ME. Survival prediction of patients with breast cancer: comparisons of decision tree and logistic regression analysis. Int J Cancer Manag. 2018;11(7):e9176. doi: 10.5812/ijcm.9176. [ DOI ] [ Google Scholar ]
39. Topaloğlu M, Malkoç G. Decision tree application for renal calculi diagnosis. Int J Appl Math Electron Comput. 2016. 10.18100/ijamec.281134.
40. Li H, Wu TT, Yang DL, Guo YS, Liu PC, Chen Y, et al. Decision tree model for predicting in-hospital cardiac arrest among patients admitted with acute coronary syndrome. Clin Cardiol. 2019;42(11):1087–1093. doi: 10.1002/clc.23255. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
41. Ramezankhani A, Hadavandi E, Pournik O, Shahrabi J, Azizi F, Hadaegh F. Decision tree-based modelling for identification of potential interactions between type 2 diabetes risk factors: a decade follow-up in a Middle East prospective cohort study. BMJ Open. 2016;6(12):e013336. doi: 10.1136/bmjopen-2016-013336. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
42. Carmona-Bayonas A, Jiménez-Fonseca P, Font C, Fenoy F, Otero R, Beato C, et al. Predicting serious complications in patients with cancer and pulmonary embolism using decision tree modelling: the EPIPHANY Index. Br J Cancer. 2017;116(8):994–1001. doi: 10.1038/bjc.2017.48. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
43. Efron B. Bootstrap methods: another look at the jackknife. In: Kotz S, Johnson NL, editors. Breakthroughs in statistics. New York: Springer; 1992. pp. 569–593. [ Google Scholar ]
44. Breima L. Random forests. Mach Learn. 2010;1(45):5–32. doi: 10.1023/A:1010933404324. [ DOI ] [ Google Scholar ]
45. Franklin J. The elements of statistical learning: data mining, inference and prediction. Math Intell. 2005;27(2):83–85. doi: 10.1007/BF02985802. [ DOI ] [ Google Scholar ]
46. Taylor RA, Pare JR, Venkatesh AK, Mowafi H, Melnick ER, Fleischman W, et al. Prediction of in-hospital mortality in emergency department patients with sepsis: a local big data-driven, machine learning approach. Acad Emerg Med. 2016;23(3):269–278. doi: 10.1111/acem.12876. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
47. Lee J, Scott DJ, Villarroel M, Clifford GD, Saeed M, Mark RG. Open-access MIMIC-II database for intensive care research. Annu Int Conf IEEE Eng Med Biol Soc. 2011:8315–8. 10.1109/IEMBS.2011.6092050. [ DOI ] [ PMC free article ] [ PubMed ]
48. Lee J. Patient-specific predictive modelling using random forests: an observational study for the critically Ill. JMIR Med Inform. 2017;5(1):e3. doi: 10.2196/medinform.6690. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
49. Wongvibulsin S, Wu KC, Zeger SL. Clinical risk prediction with random forests for survival, longitudinal, and multivariate (RF-SLAM) data analysis. BMC Med Res Methodol. 2019;20(1):1. doi: 10.1186/s12874-019-0863-0. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
50. Taylor JMG. Random survival forests. J Thorac Oncol. 2011;6(12):1974–1975. doi: 10.1097/JTO.0b013e318233d835. [ DOI ] [ PubMed ] [ Google Scholar ]
51. Hu C, Steingrimsson JA. Personalized risk prediction in clinical oncology research: applications and practical issues using survival trees and random forests. J Biopharm Stat. 2018;28(2):333–349. doi: 10.1080/10543406.2017.1377730. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
52. Dietrich R, Opper M, Sompolinsky H. Statistical mechanics of support vector networks. Phys Rev Lett. 1999;82(14):2975. doi: 10.1103/PhysRevLett.82.2975. [ DOI ] [ Google Scholar ]
53. Verplancke T, Van Looy S, Benoit D, Vansteelandt S, Depuydt P, De Turck F, et al. Support vector machine versus logistic regression modelling for prediction of hospital mortality in critically ill patients with haematological malignancies. BMC Med Inform Decis Mak. 2008;8:56. doi: 10.1186/1472-6947-8-56. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
54. Yu W, Liu T, Valdez R, Gwinn M, Khoury MJ. Application of support vector machine modelling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Med Inform Decis Mak. 2010;10:16. doi: 10.1186/1472-6947-10-16. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
55. Son YJ, Kim HG, Kim EH, Choi S, Lee SK. Application of support vector machine for prediction of medication adherence in heart failure patients. Healthc Inform Res. 2010;16(4):253–259. doi: 10.4258/hir.2010.16.4.253. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
56. Schadt EE, Friend SH, Shaywitz DA. A network view of disease and compound screening. Nat Rev Drug Discov. 2009;8(4):286–295. doi: 10.1038/nrd2826. [ DOI ] [ PubMed ] [ Google Scholar ]
57. Austin PC, Lee DS, Fine JP. Introduction to the analysis of survival data in the presence of competing risks. Circulation. 2016;133(6):601–609. doi: 10.1161/CIRCULATIONAHA.115.017719. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
58. Putter H, Fiocco M, Geskus RB. Tutorial in biostatistics: competing risks and multi-state models. Stat Med. 2007;26(11):2389–2430. doi: 10.1002/sim.2712. [ DOI ] [ PubMed ] [ Google Scholar ]
59. Klein JP. Competing risks. WIREs Comp Stat. 2010;2(3):333–339. doi: 10.1002/wics.83. [ DOI ] [ Google Scholar ]
60. Haller B, Schmidt G, Ulm K. Applying competing risks regression models: an overview. Lifetime Data Anal. 2013;19(1):33–58. doi: 10.1007/s10985-012-9230-8. [ DOI ] [ PubMed ] [ Google Scholar ]
61. Fine JP, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. J Am Stat Assoc. 1999;94(446):496–509. doi: 10.1080/01621459.1999.10474144. [ DOI ] [ Google Scholar ]
62. Koller MT, Raatz H, Steyerberg EW, Wolbers M. Competing risks and the clinical community: irrelevance or ignorance? Stat Med. 2012;31(11–12):1089–1097. doi: 10.1002/sim.4384. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
63. Lau B, Cole SR, Gange SJ. Competing risk regression models for epidemiologic data. Am J Epidemiol. 2009;170(2):244–256. doi: 10.1093/aje/kwp107. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
64. Yang J, Li Y, Liu Q, Li L, Feng A, Wang T, et al. Brief introduction of medical database and data mining technology in big data era. J Evid Based Med. 2020;13(1):57–69. doi: 10.1111/jebm.12373. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
65. Yu Z, Yang J, Gao L, Huang Q, Zi H, Li X. A competing risk analysis study of prognosis in patients with esophageal carcinoma 2006–2015 using data from the surveillance, epidemiology, and end results (SEER) database. Med Sci Monit. 2020;26:e918686. doi: 10.12659/MSM.918686. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
66. Yang J, Pan Z, He Y, Zhao F, Feng X, Liu Q, et al. Competing-risks model for predicting the prognosis of penile cancer based on the SEER database. Cancer Med. 2019;8(18):7881–7889. doi: 10.1002/cam4.2649. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
67. Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform. 2018;19(6):1236–1246. doi: 10.1093/bib/bbx044. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
68. Alashwal H, El Halaby M, Crouse JJ, Abdalla A, Moustafa AA. The application of unsupervised clustering methods to Alzheimer's disease. Front Comput Neurosci. 2019;13:31. doi: 10.3389/fncom.2019.00031. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
69. Macqueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA: University of California Press;1967.
70. Forgy EW. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics. 1965;21:768–769. [ Google Scholar ]
71. Johnson SC. Hierarchical clustering schemes. Psychometrika. 1967;32(3):241–254. doi: 10.1007/BF02289588. [ DOI ] [ PubMed ] [ Google Scholar ]
72. Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Rec. 1996;25(2):103–114. doi: 10.1145/235968.233324. [ DOI ] [ Google Scholar ]
73. Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Rec. 1998;27(2):73–84. doi: 10.1145/276305.276312. [ DOI ] [ Google Scholar ]
74. Guha S, Rastogi R, Shim K. ROCK: a robust clustering algorithm for categorical attributes. Inf Syst. 2000;25(5):345–366. doi: 10.1016/S0306-4379(00)00022-3. [ DOI ] [ Google Scholar ]
75. Xu D, Tian Y. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–193. doi: 10.1007/s40745-015-0040-1. [ DOI ] [ Google Scholar ]
76. Kriegel HP, Kröger P, Sander J, Zimek A. Density-based clustering. WIRES Data Min Knowl. 2011;1(3):231–240. doi: 10.1002/widm.30. [ DOI ] [ Google Scholar ]
77. Ester M, Kriegel HP, Sander J, Xu X, editors. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd international conference on knowledge discovery and data mining Portland, Oregon: AAAI Press; 1996. p. 226–31.
78. Wang W, Yang J, Muntz RR. STING: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd international conference on very large data bases, Morgan Kaufmann Publishers Inc.; 1997. p. 186–95.
79. Iwashyna TJ, Burke JF, Sussman JB, Prescott HC, Hayward RA, Angus DC. Implications of heterogeneity of treatment effect for reporting and analysis of randomized trials in critical care. Am J Respir Crit Care Med. 2015;192(9):1045–1051. doi: 10.1164/rccm.201411-2125CP. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
80. Ruan S, Lin H, Huang C, Kuo P, Wu H, Yu C. Exploring the heterogeneity of effects of corticosteroids on acute respiratory distress syndrome: a systematic review and meta-analysis. Crit Care. 2014;18(2):R63. doi: 10.1186/cc13819. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
81. Docampo E, Collado A, Escaramís G, Carbonell J, Rivera J, Vidal J, et al. Cluster analysis of clinical data identifies fibromyalgia subgroups. PLoS ONE. 2013;8(9):e74873. doi: 10.1371/journal.pone.0074873. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
82. Sutherland ER, Goleva E, King TS, Lehman E, Stevens AD, Jackson LP, et al. Cluster analysis of obesity and asthma phenotypes. PLoS ONE. 2012;7(5):e36631. doi: 10.1371/journal.pone.0036631. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
83. Guo Q, Lu X, Gao Y, Zhang J, Yan B, Su D, et al. Cluster analysis: a new approach for identification of underlying risk factors for coronary artery disease in essential hypertensive patients. Sci Rep. 2017;7:43965. doi: 10.1038/srep43965. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
84. Hastings S, Oster S, Langella S, Kurc TM, Pan T, Catalyurek UV, et al. A grid-based image archival and analysis system. J Am Med Inform Assoc. 2005;12(3):286–295. doi: 10.1197/jamia.M1698. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
85. Celebi ME, Aslandogan YA, Bergstresser PR. Mining biomedical images with density-based clustering. In: International conference on information technology: coding and computing (ITCC’05), vol II. Washington, DC, USA: IEEE; 2005. 10.1109/ITCC.2005.196.
86. Agrawal R, Imieliński T, Swami A, editors. Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD conference on management of data. Washington, DC, USA: Association for Computing Machinery; 1993. p. 207–16. 10.1145/170035.170072.
87. Sethi A, Mahajan P. Association rule mining: A review. TIJCSA. 2012;1(9):72–83. [ Google Scholar ]
88. Kotsiantis S, Kanellopoulos D. Association rules mining: a recent overview. GESTS Int Trans Comput Sci Eng. 2006;32(1):71–82. [ Google Scholar ]
89. Narvekar M, Syed SF. An optimized algorithm for association rule mining using FP tree. Procedia Computer Sci. 2015;45:101–110. doi: 10.1016/j.procs.2015.03.097. [ DOI ] [ Google Scholar ]
90. Verhein F. Frequent pattern growth (FP-growth) algorithm. Sydney: The University of Sydney; 2008. pp. 1–16. [ Google Scholar ]
91. Li Q, Zhang Y, Kang H, Xin Y, Shi C. Mining association rules between stroke risk factors based on the Apriori algorithm. Technol Health Care. 2017;25(S1):197–205. doi: 10.3233/THC-171322. [ DOI ] [ PubMed ] [ Google Scholar ]
92. Guo A, Zhang W, Xu S. Exploring the treatment effect in diabetes patients using association rule mining. Int J Inf Pro Manage. 2016;7(3):1–9. [ Google Scholar ]
93. Pearson K. On lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11):559–572. doi: 10.1080/14786440109462720. [ DOI ] [ Google Scholar ]
94. Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psychol. 1933;24(6):417. doi: 10.1037/h0071325. [ DOI ] [ Google Scholar ]
95. Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci. 2016;374(2065):20150202. doi: 10.1098/rsta.2015.0202. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
96. Zhang Z, Castelló A. Principal components analysis in clinical studies. Ann Transl Med. 2017;5(17):351. doi: 10.21037/atm.2017.07.12. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
97. Apio BRS, Mawa R, Lawoko S, Sharma KN. Socio-economic inequality in stunting among children aged 6–59 months in a Ugandan population based cross-sectional study. Am J Pediatri. 2019;5(3):125–132. doi: 10.11648/j.ajp.20190503.18. [ DOI ] [ Google Scholar ]
98. Burgel PR, Paillasseur JL, Caillaud D, Tillie-Leblond I, Chanez P, Escamilla R, et al. Clinical COPD phenotypes: a novel approach using principal component and cluster analyses. Eur Respir J. 2010;36(3):531–539. doi: 10.1183/09031936.00175109. [ DOI ] [ PubMed ] [ Google Scholar ]
99. Vogt W, Nagel D. Cluster analysis in diagnosis. Clin Chem. 1992;38(2):182–198. doi: 10.1093/clinchem/38.2.182. [ DOI ] [ PubMed ] [ Google Scholar ]
100. Layeghian Javan S, Sepehri MM, Layeghian Javan M, Khatibi T. An intelligent warning model for early prediction of cardiac arrest in sepsis patients. Comput Methods Programs Biomed. 2019;178:47–58. doi: 10.1016/j.cmpb.2019.06.010. [ DOI ] [ PubMed ] [ Google Scholar ]
101. Wu W, Yang J, Li D, Huang Q, Zhao F, Feng X, et al. Competitive risk analysis of prognosis in patients with cecum cancer: a population-based study. Cancer Control. 2021;28:1073274821989316. doi: 10.1177/1073274821989316. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
102. Martínez Steele E, Popkin BM, Swinburn B, Monteiro CA. The share of ultra-processed foods and the overall nutritional quality of diets in the US: evidence from a nationally representative cross-sectional study. Popul Health Metr. 2017;15(1):6. doi: 10.1186/s12963-017-0119-3. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
View on publisher site
PDF (1.1 MB)
Collections

Add to Collections

IMAGES

(PDF) A Review: Data Mining Techniques and Its Applications
(PDF) Exploring Data Mining Research in Nanotechnology
(PDF) Trends in data mining research: A two-decade review using topic
(PDF) Data Mining: a Potential Research Approach for Information System
(PDF) Educational Data Mining: a Case Study
(PDF) TECHNOLOGIES USED IN DATA MINING

VIDEO

Data Mining & Business Intelligence
Data Mining & Business Intelligence
AI Research Topics l Artificial Intelligence Research Topics l Research Topics in AI l AI Topics
Data Mining Topics for Research
data mining subject unit wise important questions jntuh r18 cse aiml data science
15 Hot Trending PHD Research Topics in Data Mining 2018

COMMENTS

data mining Latest Research Papers
Epidemic diseases can be extremely dangerous with its hazarding influences. They may have negative effects on economies, businesses, environment, humans, and workforce. In this paper, some of the factors that are interrelated with COVID-19 pandemic have been examined using data mining methodologies and approaches.
345193 PDFs
Explore the latest full-text research PDFs, articles, conference papers, preprints and more on DATA MINING. Find methods information, sources, references or conduct a literature review on DATA MINING
Data Mining for the Internet of Things: Literature Review and
A variety of researches focusing on knowledge view, technique view, and application view can be found in the literature. However, no previous effort has been made to review the different views of data mining in a systematic way, especially in nowadays big data [5-7]; mobile internet and Internet of Things [8-10] grow rapidly and some data mining researchers shift their attention from data ...
Data mining
Data mining is the process of extracting potentially useful information from data sets. It uses a suite of methods to organise, examine and combine large data sets, including machine learning ...
Data Mining Methods and Obstacles: A Comprehensive Analysis
Big data analytics: a li terature review paper. in Advances in Data Mining. Applications and Theo- Applications and Theo- retical Aspects: 14th Industrial Conference, ICDM 2014, St. Petersburg ...
Trends in data mining research: A two-decade review using topic analysis
The research direction related to practical Applications of data mining also shows a tendency to grow. The last two topics, Text Mining and Data Streams have attracted steady interest from ...
Knowledge Discovery: Methods from data mining and machine learning
The interdisciplinary field of knowledge discovery and data mining emerged from a necessity of big data requiring new analytical methods beyond the traditional statistical approaches to discover new knowledge from the data mine. This emergent approach is a dialectic research process that is both deductive and inductive.
Data mining articles within Scientific Reports
Read the latest Research articles in Data mining from Scientific Reports. ... Anoikis-related gene signatures in colorectal cancer: implications for cell differentiation, immune infiltration, and ...
A comprehensive survey of data mining
Data mining plays an important role in various human activities because it extracts the unknown useful patterns (or knowledge). Due to its capabilities, data mining become an essential task in large number of application domains such as banking, retail, medical, insurance, bioinformatics, etc. To take a holistic view of the research trends in the area of data mining, a comprehensive survey is ...
Recent Advances in Data Mining
It is also necessary to investigate the recent advances in data mining involving the incorporation of machine learning algorithms and artificial neural networks. Among other fields of artificial intelligence, machine and deep learning are certainly some of the most studied in recent years. There has been a massive shift in the last few decades ...
Recent advances in domain-driven data mining
Data mining research has been significantly motivated by and benefited from real-world applications in novel domains. This special issue was proposed and edited to draw attention to domain-driven data mining and disseminate research in foundations, frameworks, and applications for data-driven and actionable knowledge discovery. Along with this special issue, we also organized a related ...
PDF Data Mining in the Real World: Experiences, Challenges, and Recommendations
Abstract - Data mining is used regularly in a variety of in-dustries and is continuing to gain in both popularity and ac-ceptance. However, applying data mining methods to complex real-world tasks is far from straightforward and many pitfalls face data mining practitioners. However, most research in the field tends to focus on the algorithmic ...
(PDF) Data Mining and Big Data Analytics
This research paper seeks to provide an extensive exploration of the expansive landscape of data mining and big data analytics, encompassing their fundamental principles, diverse methodological ...
Equilibrium Data Mining and Data Abundance
Data miners search for predictors of asset payoffs and select those with a sufficiently high precision. Data abundance raises the precision of the best predictors, but it can induce data miners to search less intensively for high-precision signals. In this case, their performance becomes more dispersed and they receive less capital.
PDF A comprehensive survey of data mining
This paper presents a systematic and comprehensive survey of various data mining tasks and techniques. Further, various real-life applications of data mining are presented in this paper. The challenges and issues in area of data mining research are also presented in this paper. Keywords Data mining techniques Data mining tasks Data mining ...
(PDF) A Review of Data Mining Literature
REVIEW OF LITERATURE. Fayyad et.al (1996) [3] in their paper " From data mining. to knowledge discovery in databases" desc ribed KDD. as "a nontrivial proce ss of recognizing valid, novel ...
Data mining in clinical big data: the frequently used databases, steps
Data mining: an overview. Data mining is a multidisciplinary field at the intersection of database technology, statistics, ML, and pattern recognition that profits from all these disciplines [].Although this approach is not yet widespread in the field of medical research, several studies have demonstrated the promise of data mining in building disease-prediction models, assessing patient risk ...
Review Paper on Data Mining Techniques and Applications
Abstract. Data mining is the process of extracting hidden and useful patterns and information from data. Data mining is a new technology that helps businesses to predict future trends and behaviors, allowing them to make proactive, knowledge driven decisions. The aim of this paper is to show the process of data mining and how it can help ...
Data Mining Research
Data mining is defined as the process of discovering hidden and potentially useful information from very large databases [4].The progress in data mining research has made it possible to implement several data mining operations efficiently on large databases. While this is surely an important contribution, we should not lose sight of the final goal of data mining - it is to enable database ...
Internet of things and data mining: An application oriented survey
Therefore, there are diverse survey papers presented on IoT and data mining from the perspective of multiple applications. The few are: ... Lastly, we summarize the applications environment and related open research issues with suggestions for the research aspects of data mining and IoT with application perspective.
PDF Overview of Data Mining's Potential Benefits and Limitations in ...
contrasted data mining to traditional statistics (Grover & Mehra, 2008; Zhao & Luan, 2006), which turned out to be an important theoretical framework through which to understand the purported benefits and drawbacks of data mining. Potential benefits of using data mining in education research Most scholars were optimistic about the benefits