Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Clustering algorithms: A comparative approach

Roles Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

Affiliation Institute of Mathematics and Computer Science, University of São Paulo, São Carlos, São Paulo, Brazil

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Validation, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliation Department of Computer Science, Federal University of São Carlos, São Carlos, São Paulo, Brazil

ORCID logo

Roles Validation, Writing – original draft, Writing – review & editing

Affiliation Federal University of Technology, Paraná, Paraná, Brazil

Roles Funding acquisition, Project administration, Supervision, Validation, Writing – review & editing

Affiliation São Carlos Institute of Physics, University of São Paulo, São Carlos, São Paulo, Brazil

Roles Funding acquisition, Project administration, Supervision, Validation, Writing – original draft, Writing – review & editing

Roles Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing

  • Mayra Z. Rodriguez, 
  • Cesar H. Comin, 
  • Dalcimar Casanova, 
  • Odemir M. Bruno, 
  • Diego R. Amancio, 
  • Luciano da F. Costa, 
  • Francisco A. Rodrigues

PLOS

  • Published: January 15, 2019
  • https://doi.org/10.1371/journal.pone.0210236
  • Reader Comments

Table 1

Many real-world systems can be studied in terms of pattern recognition tasks, so that proper use (and understanding) of machine learning methods in practical applications becomes essential. While many classification methods have been proposed, there is no consensus on which methods are more suitable for a given dataset. As a consequence, it is important to comprehensively compare methods in many possible scenarios. In this context, we performed a systematic comparison of 9 well-known clustering methods available in the R language assuming normally distributed data. In order to account for the many possible variations of data, we considered artificial datasets with several tunable properties (number of classes, separation between classes, etc). In addition, we also evaluated the sensitivity of the clustering methods with regard to their parameters configuration. The results revealed that, when considering the default configurations of the adopted methods, the spectral approach tended to present particularly good performance. We also found that the default configuration of the adopted implementations was not always accurate. In these cases, a simple approach based on random selection of parameters values proved to be a good alternative to improve the performance. All in all, the reported approach provides subsidies guiding the choice of clustering algorithms.

Citation: Rodriguez MZ, Comin CH, Casanova D, Bruno OM, Amancio DR, Costa LdF, et al. (2019) Clustering algorithms: A comparative approach. PLoS ONE 14(1): e0210236. https://doi.org/10.1371/journal.pone.0210236

Editor: Hans A. Kestler, University of Ulm, GERMANY

Received: December 26, 2016; Accepted: December 19, 2018; Published: January 15, 2019

Copyright: © 2019 Rodriguez et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All datasets used for evaluating the algorithms can be obtained from Figshare: https://figshare.com/s/29005b491a418a667b22 .

Funding: This work has been supported by FAPESP - Fundação de Amparo à Pesquisa do Estado de São Paulo (grant nos. 15/18942-8 and 18/09125-4 for CHC, 14/20830-0 and 16/19069-9 for DRA, 14/08026-1 for OMB and 11/50761-2 and 15/22308-2 for LdFC), CNPq - Conselho Nacional de Desenvolvimento Científico e Tecnológico (grant nos. 307797/2014-7 for OMB and 307333/2013-2 for LdFC), Núcleo de Apoio à Pesquisa (LdFC) and CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (Finance Code 001).

Competing interests: The authors have declared that no competing interests exist.

Introduction

In recent years, the automation of data collection and recording implied a deluge of information about many different kinds of systems [ 1 – 8 ]. As a consequence, many methodologies aimed at organizing and modeling data have been developed [ 9 ]. Such methodologies are motivated by their widespread application in diagnosis [ 10 ], education [ 11 ], forecasting [ 12 ], and many other domains [ 13 ]. The definition, evaluation and application of these methodologies are all part of the machine learning field [ 14 ], which became a major subarea of computer science and statistics due to their crucial role in the modern world.

Machine learning encompasses different topics such as regression analysis [ 15 ], feature selection methods [ 16 ], and classification [ 14 ]. The latter involves assigning classes to the objects in a dataset. Three main approaches can be considered for classification: supervised, semi-supervised and unsupervised classification. In the former case, the classes, or labels, of some objects are known beforehand, defining the training set, and an algorithm is used to obtain the classification criteria. Semi-supervised classification deals with training the algorithm using both labeled and unlabeled data. They are commonly used when manually labeling a dataset becomes costly. Lastly, unsupervised classification, henceforth referred as clustering , deals with defining classes from the data without knowledge of the class labels. The purpose of clustering algorithms is to identify groups of objects, or clusters, that are more similar to each other than to other clusters. Such an approach to data analysis is closely related to the task of creating a model of the data, that is, defining a simplified set of properties that can provide intuitive explanation about relevant aspects of a dataset. Clustering methods are generally more demanding than supervised approaches, but provide more insights about complex data. This type of classifiers constitute the main object of the current work.

Because clustering algorithms involve several parameters, often operate in high dimensional spaces, and have to cope with noisy, incomplete and sampled data, their performance can vary substantially for different applications and types of data. For such reasons, several different approaches to clustering have been proposed in the literature (e.g. [ 17 – 19 ]). In practice, it becomes a difficult endeavor, given a dataset or problem, to choose a suitable clustering approach. Nevertheless, much can be learned by comparing different clustering methods. Several previous efforts for comparing clustering algorithms have been reported in the literature [ 20 – 29 ]. Here, we focus on generating a diversified and comprehensive set of artificial, normally distributed data containing not only distinct number of classes, features, number of objects and separation between classes, but also a varied structure of the involved groups (e.g. possessing predefined correlation distributions between features). The purpose of using artificial data is the possibility to obtain an unlimited number of samples and to systematically change any of the aforementioned properties of a dataset. Such features allow the clustering algorithms to be comprehensive and strictly evaluated in a vast number of circumstances, and also grants the possibility of quantifying the sensitivity of the performance with respect to small changes in the data. It should be observed, nevertheless, that the performance results reported in this work are therefore respective and limited to normally distributed data, and other results could be expected for other types of data following other statistical behavior. Here we associate performance with the similarity between the known labels of the objects and those found by the algorithm. Many measurements have been defined for quantifying such similarity [ 30 ], we compare the Jaccard index [ 31 ], Adjusted Rand index [ 32 ], Fowlkes-Mallows index [ 33 ] and Normalized mutual information [ 34 ]. A modified version of the procedure developed by [ 35 ] was used to create 400 distinct datasets, which were used in order to quantify the performance of the clustering algorithms. We describe the adopted procedure and the respective parameters used for data generation. Related approaches include [ 36 ].

Each clustering algorithm relies on a set of parameters that needs to be adjusted in order to achieve viable performance, which corresponds to an important point to be addressed while comparing clustering algorithms. A long standing problem in machine learning is the definition of a proper procedure for setting the parameter values [ 37 ]. In principle, one can apply an optimization procedure (e.g., simulated annealing [ 38 ] or genetic algorithms [ 39 ]) to find the parameter configuration providing the best performance of a given algorithm. Nevertheless, there are two major problems with such an approach. First, adjusting parameters to a given dataset may lead to overfitting [ 40 ]. That is, the specific values found to provide good performance may lead to lower performance when new data is considered. Second, parameter optimization can be unfeasible in some cases, given the time complexity of many algorithms, combined with their typically large number of parameters. Ultimately, many researchers resort to applying classifier or clustering algorithms using the default parameters provided by the software. Therefore, efforts are required for evaluating and comparing the performance of clustering algorithms in the optimization and default situations. In the following, we consider some representative examples of algorithms applied in the literature [ 37 , 41 ].

Clustering algorithms have been implemented in several programming languages and packages. During the development and implementation of such codes, it is common to implement changes or optimizations, leading to new versions of the original methods. The current work focuses on the comparative analysis of several clustering algorithm found in popular packages available in the R programming language [ 42 ]. This choice was motivated by the popularity of the R language in the data mining field, and by virtue of the well-established clustering packages it contains. This study is intended to assist researchers who have programming skills in R language, but with little experience in clustering of data.

The algorithms are evaluated on three distinct situations. First, we consider their performance when using the default parameters provided by the packages. Then, we consider the performance variation when single parameters of the algorithms are changed, while the rest are kept at their default values. Finally, we consider the simultaneous variation of all parameters by means of a random sampling procedure. We compare the results obtained for the latter two situations with those achieved by the default parameters, in such a way as to investigate the possible improvements in performance which could be achieved by modifying the algorithms.

The algorithms were evaluated on 400 artificial, normally distributed, datasets generated by a robust methodology previously described in [ 36 ]. The number of features, number of classes, number of objects for each class and average distance between classes can be systematically changed among the datasets.

The text is divided as follows. We start by revising some of the main approaches to clustering algorithms comparison. Next, we describe the clustering methods considered in the analysis, we also present the R packages implementing such methods. The data generation method and the performance measurements used to compare the algorithms are presented, followed by the presentation of the performance results obtained for the default parameters, for single parameter variation and for random parameter sampling.

Related works

Previous approaches for comparing the performance of clustering algorithms can be divided according to the nature of used datasets. While some studies use either real-world or artificial data, others employ both types of datasets to compare the performance of several clustering methods.

A comparative analysis using real world dataset is presented in several works [ 20 , 21 , 24 , 25 , 43 , 44 ]. Some of these works are reviewed briefly in the following. In [ 43 ], the authors propose an evaluation approach based in a multiple criteria decision making in the domain of financial risk analysis over three real world credit risk and bankruptcy risk datasets. More specifically, clustering algorithms are evaluated in terms of a combination of clustering measurements, which includes a collection of external and internal validity indexes. Their results show that no algorithm can achieve the best performance on all measurements for any dataset and, for this reason, it is mandatory to use more than one performance measure to evaluate clustering algorithms.

In [ 21 ], a comparative analysis of clustering methods was performed in the context of text-independent speaker verification task, using three dataset of documents. Two approaches were considered: clustering algorithms focused in minimizing a distance based objective function and a Gaussian models-based approach. The following algorithms were compared: k-means, random swap, expectation-maximization, hierarchical clustering, self-organized maps (SOM) and fuzzy c-means. The authors found that the most important factor for the success of the algorithms is the model order, which represents the number of centroid or Gaussian components (for Gaussian models-based approaches) considered. Overall, the recognition accuracy was similar for clustering algorithms focused in minimizing a distance based objective function. When the number of clusters was small, SOM and hierarchical methods provided lower accuracy than the other methods. Finally, a comparison of the computational efficiency of the methods revealed that the split hierarchical method is the fastest clustering algorithm in the considered dataset.

In [ 25 ], five clustering methods were studied: k-means, multivariate Gaussian mixture, hierarchical clustering, spectral and nearest neighbor methods. Four proximity measures were used in the experiments: Pearson and Spearman correlation coefficient, cosine similarity and the euclidean distance. The algorithms were evaluated in the context of 35 gene expression data from either Affymetrix or cDNA chip platforms, using the adjusted rand index for performance evaluation. The multivariate Gaussian mixture method provided the best performance in recovering the actual number of clusters of the datasets. The k-means method displayed similar performance. In this same analysis, the hierarchical method led to limited performance, while the spectral method showed to be particularly sensitive to the proximity measure employed.

In [ 24 ], experiments were performed to compare five different types of clustering algorithms: CLICK, self organized mapping-based method (SOM), k-means, hierarchical and dynamical clustering. Data sets of gene expression time series of the Saccharomyces cerevisiae yeast were used. A k-fold cross-validation procedure was considered to compare different algorithms. The authors found that k-means, dynamical clustering and SOM tended to yield high accuracy in all experiments. On the other hand, hierarchical clustering presented a more limited performance in clustering larger datasets, yielding low accuracy in some experiments.

A comparative analysis using artificial data is presented in [ 45 – 47 ]. In [ 47 ], two subspace clustering methods were compared: MAFIA (Adaptive Grids for Clustering Massive Data Sets) [ 48 ] and FINDIT (A Fast and Intelligent Subspace Clustering Algorithm Using Dimension Voting) [ 49 ]. The artificial data, modeled according to a normal distribution, allowed the control of the number of dimensions and instances. The methods were evaluated in terms of both scalability and accuracy. In the former, the running time of both algorithms were compared for different number of instances and features. In addition, the authors assessed the ability of the methods in finding adequate subspaces for each cluster. They found that MAFIA discovered all relevant clusters, but one significant dimension was left out in most cases. Conversely, the FINDIT method performed better in the task of identifying the most relevant dimensions. Both algorithms were found to scale linearly with the number of instances, however MAFIA outperformed FINDIT in most of the tests.

Another common approach for comparing clustering algorithms considers using a mixture of real world and artificial data (e.g. [ 23 , 26 – 28 , 50 ]). In [ 28 ], the performance of k-means, single linkage and simulated annealing (SA) was evaluated, considering different partitions obtained by validation indexes. The authors used two real world datasets obtained from [ 51 ] and three artificial datasets (having two dimensions and 10 clusters). The authors proposed a new validation index called I index that measures the separation based on the maximum distance between clusters and compactness based on the sum of distances between objects and their respective centroids. They found that such an index was the most reliable among other considered indices, reaching its maximum value when the number of clusters is properly chosen.

A systematic quantitative evaluation of four graph-based clustering methods was performed in [ 27 ]. The compared methods were: markov clustering (MCL), restricted neighborhood search clustering (RNSC), super paramagnetic clustering (SPC), and molecular complex detection (MCODE). Six datasets modeling protein interactions in the Saccharomyces cerevisiae and 84 random graphs were used for the comparison. For each algorithm, the robustness of the methods was measured in a twofold fashion: the variation of performance was quantified in terms of changes in the (i) methods parameters and (ii) dataset properties. In the latter, connections were included and removed to reflect uncertainties in the relationship between proteins. The restricted neighborhood search clustering method turned out to be particularly robust to variations in the choice of method parameters, whereas the other algorithms were found to be more robust to dataset alterations. In [ 52 ] the authors report a brief comparison of clustering algorithms using the Fundamental clustering problem suite (FPC) as dataset. The FPC contains artificial and real datasets for testing clustering algorithms. Each dataset represents a particular challenge that the clustering algorithm has to handle, for example, in the Hepta and LSum datasets the clusters can be separated by a linear decision boundary, but have different densities and variances. On the other hand, the ChainLink and Atom datasets cannot be separated by linear decision boundaries. Likewise, the Target dataset contains outliers. Lower performance was obtained by the single linkage clustering algorithm for the Tetra, EngyTime, Twodiamonds and Wingnut datasets. Although the datasets are quite versatile, it is not possible to control and evaluate how some of its characteristics, such as dimensions or number of features, affect the clustering accuracy.

Clustering methods

Many different types of clustering methods have been proposed in the literature [ 53 – 56 ]. Despite such a diversity, some methods are more frequently used [ 57 ]. Also, many of the commonly employed methods are defined in terms of similar assumptions about the data (e.g., k-means and k-medoids) or consider analogous mathematical concepts (e.g, similarity matrices for spectral or graph clustering) and, consequently, should provide similar performance in typical usage scenarios. Therefore, in the following we consider a choice of clustering algorithms from different families of methods. Several taxonomies have been proposed to organize the many different types of clustering algorithms into families [ 29 , 58 ]. While some taxonomies categorize the algorithms based on their objective functions [ 58 ], others aim at the specific structures desired for the obtained clusters (e.g. hierarchical) [ 29 ]. Here we consider the algorithms indicated in Table 1 as examples of the categories indicated in the same table. The algorithms represent some of the main types of methods in the literature. Note that some algorithms are from the same family, but in these cases they posses notable differences in their applications (e.g., treating very large datasets using clara). A short description about the parameters of each considered algorithm is provided in S1 File of the supplementary material.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

The first column shows the name of the algorithms used throughout the text. The second column indicates the category of the algorithms. The third and fourth columns contain, respectively, the function name and R library of each algorithm.

https://doi.org/10.1371/journal.pone.0210236.t001

Regarding partitional approaches, the k-means [ 68 ] algorithm has been widely used by researchers [ 57 ]. This method requires as input parameters the number of groups ( k ) and a distance metric. Initially, each data point is associated with one of the k clusters according to its distance to the centroids (clusters centers) of each cluster. An example is shown in Fig 1(a) , where black points correspond to centroids and the remaining points have the same color if the centroid that is closest to them is the same. Then, new centroids are calculated, and the classification of the data points is repeated for the new centroids, as indicated in Fig 1(b) , where gray points indicate the position of the centroids in the previous iteration. The process is repeated until no significant changes of the centroids positions is observed at each new step, as shown in Fig 1(c) and 1(d) .

thumbnail

Each plot shows the partition obtained after specific iterations of the algorithm. The centroids of the clusters are shown as a black marker. Points are colored according to their assigned clusters. Gray markers indicate the position of the centroids in the previous iteration. The dataset contains 2 clusters, but k = 4 seeds were used in the algorithm.

https://doi.org/10.1371/journal.pone.0210236.g001

The a priori setting of the number of clusters is the main limitation of the k-means algorithm. This is so because the final classification can strongly depend on the choice of the number of centroids [ 68 ]. In addition, the k-means is not particularly recommended in cases where the clusters do not show convex distribution or have very different sizes [ 59 , 60 ]. Moreover, the k-means algorithm is sensitive to the initial seed selection [ 41 ]. Given these limitations, many modifications of this algorithm have been proposed [ 61 – 63 ], such as the k-medoid [ 64 ] and k-means++ [ 65 ]. Nevertheless, this algorithm, besides having low computational cost, can provide good results in many practical situations such as in anomaly detection [ 66 ] and data segmentation [ 67 ]. The R routine used for k-means clustering was the k-means from the stats package, which contains the implementation of the algorithms proposed by Macqueen [ 68 ], Hartigan and Wong [ 69 ]. The algorithm of Hartigan and Wong is employed by the stats package when setting the parameters to their default values, while the algorithm proposed by Macqueen is used for all other cases. Another interesting example of partitional clustering algorithms is the clustering for large applications (clara) [ 70 ]. This method takes into account multiple fixed samples of the dataset to minimize sampling bias and, subsequently, select the best medoids among the chosen samples, where a medoid is defined as the object i for which the average dissimilarity to all other objects in its cluster is minimal. This method tends to be efficient for large amounts of data because it does not explore the whole neighborhood of the data points [ 71 ], although the quality of the results have been found to strongly depend on the number of objects in the sample data [ 62 ]. The clara algorithm employed in our analysis was provided by the clara function contained in the cluster package. This function implements the method developed by Kaufman and Rousseeuw [ 70 ].

The Ordering Points To Identify the Clustering Structure (OPTICS) [ 72 , 73 ] is a density-based cluster ordering based on the concept of maximal density-reachability [ 72 ]. The algorithm starts with a data point and expands its neighborhood using a similar procedure as in the dbscan algorithm [ 74 ], with the difference that the neighborhood is first expanded to points with low core-distance. The core distance of an object p is defined as the m -th smallest distance between p and the objects in its ϵ -neighborhood (i.e., objects having distance less than or equal to ϵ from p ), where m is a parameter of the algorithm indicating the smallest number of points that can form a cluster. The optics algorithm can detect clusters having large density variations and irregular shapes. The R routine used for optics clustering was the optics from the dbscan package. This function considers the original algorithm developed by Ankerst et al. [ 72 ]. An hierarchical clustering structure from the output of the optics algorithm can be constructed using the function extractXi from the dbscan package. We note that the function extractDBSCAN , from the same package, provides a clustering from an optics ordering that is similar to what the dbscan algorithm would generate.

Clustering methods that take into account the linkage between data points, traditionally known as hierarchical methods, can be subdivided into two groups: agglomerative and divisive [ 59 ]. In an agglomerative hierarchical clustering algorithm, initially, each object belongs to a respective individual cluster. Then, after successive iterations, groups are merged until stop conditions are reached. On the other hand, a divisive hierarchical clustering method starts with all objects in a single cluster and, after successive iterations, objects are separated into clusters. There are two main packages in the R language that provide routines for performing hierarchical clustering, they are the stats and cluster . Here we consider the agnes routine from the cluster package which implements the algorithm proposed by Kaufman and Rousseeuw [ 70 ]. Four well-known linkage criteria are available in agnes , namely single linkage, complete linkage, Ward’s method, and weighted average linkage [ 75 ].

Model-based methods can be regarded as a general framework for estimating the maximum likelihood of the parameters of an underlying distribution to a given dataset. A well-known instance of model-based methods is the expectation-maximization (EM) algorithm. Most commonly, one considers that the data from each class can be modeled by multivariate normal distributions, and, therefore, the distribution observed for the whole data can be seen as a mixture of such normal distributions. A maximum likelihood approach is then applied for finding the most probable parameters of the normal distributions of each class. The EM approach for clustering is particularly suitable when the dataset is incomplete [ 76 , 77 ]. On the other hand, the clusters obtained from the method may strongly depend on the initial conditions [ 54 ]. In addition, the algorithm may fail to find very small clusters [ 29 , 78 ]. In the R language, the package mclust [ 79 , 80 ]. provides iterative EM (Expectation-Maximization) methods for maximum likelihood estimation using parameterized Gaussian mixture models. Functions estep and mstep implement the individual steps of an EM iteration. A related algorithm that is also analyzed in the current study is the hcmodel, which can be found in the hc function of the mclust package. The hcmodel algorithm, which is also based on Gaussian-mixtures, was proposed by Fraley [ 81 ]. The algorithm contains many additional steps compared to traditional EM methods, such as an agglomerative procedure and the adjustment of model parameters through a Bayes factor selection using the BIC aproximation [ 82 ].

research paper on clustering algorithms

In recent years, the efficient handling of high dimensional data has become of paramount importance and, for this reason, this feature has been desired when choosing the most appropriate method for obtaining accurate partitions. To tackle high dimensional data, subspace clustering was proposed [ 49 ]. This method works by considering the similarity between objects with respect to distinct subsets of the attributes [ 88 ]. The motivation for doing so is that different subsets of the attributes might define distinct separations between the data. Therefore, the algorithm can identify clusters that exist in multiple, possibly overlapping, subspaces [ 49 ]. Subspace algorithms can be categorized into four main families [ 89 ], namely: lattice, statistical, approximation and hybrid. The hddc function from package HDclassif implements the subspace clustering method of Bouveyron [ 90 ] in the R language. The algorithm is based on statistical models, with the assumption that all attributes may be relevant for clustering [ 91 ]. Some parameters of the algorithm, such as the number of clusters or model to be used, are estimated using an EM procedure.

So far, we have discussed the application of clustering algorithms on static data. Nevertheless, when analyzing data, it is important to take into account whether the data are dynamic or static. Dynamic data, unlike static data, undergo changes over time. Some kinds of data, like the network packets received by a router and credit card transaction streams, are transient in nature and they are known as data stream . Another example of dynamic data are time series because its values change over time [ 92 ]. Dynamic data usually include a large number of features and the amount of objects is potentially unbounded [ 59 ]. This requires the application of novel approaches to quickly process the entire volume of continuously incoming data [ 93 ], the detection of new clusters that are formed and the identification of outliers [ 94 ].

Materials and methods

Artificial datasets.

The proper comparison of clustering algorithms requires a robust artificial data generation method to produce a variety of datasets. For such a task, we apply a methodology based on a previous work by Hirschberger et al. [ 35 ]. The procedure can be used to generate normally distributed samples characterized by F features and separated into C classes. In addition, the method can control both the variance and correlation distributions among the features for each class. The artificial dataset can also be generated by varying the number of objects per class, N e , and the expected separation, α , between the classes.

research paper on clustering algorithms

For each class i in the dataset, a covariance matrix R i of size F × F is created, and this matrix is used for generating N e objects for the classes. This means that pairs of features can have distinct correlation for each generated class. Then, the generated class values are divided by α and translated by s i , where s i is a random variable described by a uniform random distribution defined in the interval [−1, 1]. Parameter α is associated with the expected distances between classes. Such distances can have different impacts on clusterization depending on the number of objects and features used in the dataset. The features in the generated data have a multivariate normal distribution. In addition, the covariance among the features also have a normal distribution. Notice that such a procedure for the generation of artificial datasets was previously used in [ 36 ].

In Fig 2 , we show some examples of artificially generated data. For visualization purposes, all considered cases contain F = 2 features. The parameters used for each case are described in the caption of the figure. Note that the methodology can generate a variety of dataset configurations, including variations in features correlation for each class.

thumbnail

The parameters used for each case are (a) C = 2, Ne = 100 and α = 3.3. (b) C = 2, Ne = 100 and α = 2.3. (c) C = 10, Ne = 50 and α = 4.3. (d) C = 10, Ne = 50 and α = 6.3. Note that each class can present highly distinct properties due to differences in correlation between their features.

https://doi.org/10.1371/journal.pone.0210236.g002

  • Number of classes ( C ): The generated datasets are divided into C = {2, 10, 50} classes.
  • Number of features ( F ): The number of features to characterize the objects is F = {2, 5, 10, 50, 200}.
  • Number of object per class ( N e ): we considered Ne = {5, 50, 100, 500, 5000} objects per class. In our experiments, in a given generated dataset, the number of instances for each class is constant.
  • Mixing parameter ( α ): This parameter has a non-trivial dependence on the number of classes and features. Therefore, for each dataset, the value of this parameter was tuned so that no algorithm would achieve an accuracy of 0% or 100%.

We refer to datasets containing 2, 10, 50 and 200 features as DB2F, DB10F, DB50F, DB200F respectively. Such datasets are composed of all considered number of classes, C = {2, 10, 50}, and 50 elements for each class (i.e., Ne = 50). In some cases, we also indicate the number of classes considered for the dataset. For example, dataset DB2C10F contains 2 classes, 10 features and 50 elements per class.

For each case, we consider 10 realizations of the dataset. Therefore, 400 datasets were generated in total.

Evaluating the performance of clustering algorithms

The evaluation of the quality of the generated partitions is one of the most important issues in cluster analysis [ 30 ]. Indices used for measuring the quality of a partition can be categorized into two classes, internal and external indices. Internal validation indices are based on information intrinsic to the data, and evaluates the goodness of a clustering structure without external information. When the correct partition is not available it is possible to estimate the quality of a partition measuring how closely each instance is related to the cluster and how well-separated a cluster is from other clusters. They are mainly used for choosing an optimal clustering algorithm to be applied on a specific dataset [ 96 ]. On the other hand, external validation indices measure the similarity between the output of the clustering algorithm and the correct partitioning of the dataset. The Jaccard, Fowlkes-Mallows and adjusted rand index belong to the same pair counting category, making them closely related. Some differences include the fact that they can exhibit biasing with respect to the number of clusters or the distribution of class sizes in a partition. Normalization helps prevent this unwanted effect. In [ 97 ] the authors discuss several types of bias that may affect external cluster validity indices. A total of 26 pair-counting based external cluster validity indices were used to identify the bias generated by the number of clusters. It was shown that the Fowlkes Mallows and Jaccard index monotonically decrease as the number of clusters increases, favoring partitions with smaller number of clusters, while the Adjusted Rand Index tends to be indifferent to the number of clusters.

research paper on clustering algorithms

Note that when the two sets of labels have a perfect one-to-one correspondence, the quality measures are all equal to unity.

Previous works have shown that there is no single internal cluster validation index that outperforms the other indices [ 100 , 101 ]. In [ 101 ] the authors compare a set of internal cluster validation indices in many distinct scenarios, indicating that the Silhouette index yielded the best results in most cases.

research paper on clustering algorithms

Results and discussion

The accuracy of each considered clustering algorithm was evaluated using three methodologies. In the first methodology, we consider the default parameters of the algorithms provided by the R package. The reason for measuring performance using the default parameters is to consider the case where a researcher applies the classifier to a dataset without any parameter adjustment. This is a common scenario when the researcher is not a machine learning expert. In the second methodology, we quantify the influence of the algorithms parameters on the accuracy. This is done by varying a single parameter of an algorithm while keeping the others at their default values. The third methodology consists in analyzing the performance by randomly varying all parameters of a classifier. This procedure allows the quantification of certain properties such as the maximum accuracy attained and the sensibility of the algorithm to parameter variation.

Performance when using default parameters

In this experiment, we evaluated the performance of the classifiers for all datasets described in Section Artificial datasets . All unsupervised algorithms were set with their default configuration of parameters. For each algorithm, we divide the results according to the number of features contained in the dataset. In other words, for a given number of features, F , we used datasets with C = {2, 10, 50, 200} classes, and N e = {5, 50, 100} objects for each class. Thus, the performance results obtained for each F corresponds to the performance averaged over distinct number of classes and objects per class. We note that the algorithm based on subspaces cannot be applied to datasets containing 2 features, and therefore its accuracy was not quantified for such datasets.

In Fig 3 , we show the obtained values for the four considered performance metrics. The results indicate that all performance metrics provide similar results. Also, the hierarchical method seems to be strongly affected by the number of features in the dataset. In fact, when using 50 and 200 features the hierarchical method provided lower accuracy. The k-means, spectral, optics and dbscan methods benefit from an increment in the number of features. Interestingly, the hcmodel has a better performance in the datasets containing 10 features than in those containing 2, 50 and 200 features, which suggests an optimum performance for this algorithm for datasets containing around 10 features. It is also clear that for 2 features the performance of the algorithms tend to be similar, with the exception of the optics and dbscan methods. On the other hand a larger number of features induce marked differences in performance. In particular, for 200 features, the spectral algorithm provides the best results among all classifiers.

thumbnail

All artificial datasets were used for evaluation. The averages were calculated separately for datasets containing 2, 10 and 50 features. The considered performance indexes are (a) adjusted Rand, (b) Jaccard, (c) normalized mutual information and (d) Fowlkes Mallows.

https://doi.org/10.1371/journal.pone.0210236.g003

We use the Kruskal-Wallis test [ 102 ], a nonparametric test, to explore the statistical differences in performance when considering distinct number of features in clustering methods. First, we test if the difference in performance is significant for 2 features. For this case, the Kruskal-Wallis test returns a p-value of p = 6.48 × 10 −7 , with a chi-squared distance of χ 2 = 41.50. Therefore, the difference in performance is statistically significant when considering all algorithms. For datasets containing 10 features, a p-value of p = 1.53 × 10 −8 is returned by the Kruskal-Wallis test, with a chi-squared distance of χ 2 = 52.20). For 50 features, the test returns a p-value of p = 1.56 × 10 −6 , with a chi-squared distance of χ 2 = 41.67). For 200 features, the test returns a p-value of p = 2.49 × 10 −6 , with a chi-squared distance of χ 2 = 40.58). Therefore, the null hypothesis of the Kruskal–Wallis test is rejected. This means that the algorithms indeed have significant differences in performance for 2, 10, 50 and 200 features, as indicated in Fig 3 .

In order to verify the influence of the number of objects used for classification, we also calculated the average accuracy for datasets separated according to the number of objects N e . The result is shown in Fig 4 . We observe that the impact that changing N e has on the accuracy depends on the algorithm. Surprisingly, the hierarchical, k-means and clara methods attain lower accuracy when more data is used. The result indicates that these algorithms tend to be less robust with respect to the larger overlap between the clusters due to an increase in the number of objects. We also observe that a larger N e enhances the performance of the hcmodel, optics and dbscan algorithms. This results is in agreement with [ 90 ].

thumbnail

All artificial datasets were used for evaluation. The averages were calculated separately for datasets containing 5, 50 and 100 objects per class. The considered performance indexes are (a) adjusted Rand, (b) Jaccard, (c) normalized mutual information and (d) Fowlkes Mallows.

https://doi.org/10.1371/journal.pone.0210236.g004

In most clustering algorithms, the size of the data has an effect on the clustering quality. In order to quantify this effect, we considered a scenario where the data has a high number of instances. Datasets with F = 5, C = 10 and Ne = {5, 50, 500, 5000} instances per class were created. This dataset will be referenced as DB10C5F. In Fig 5 we can observe that the subspace and spectral methods lead to improved accuracy when the number of instances increases. On the other hand, the size of the dataset does not seem to influence the accuracy of the kmeans, clara, hcmodel and EM algorithms. For the spectral, hierarchical and hcmodel algorithms, the accuracy could not be calculated when 5000 instances per class was used due to the amount of memory used by these methods. For example, in the case of the spectral algorithm method, a lot of processing power is required to compute and store the kernel matrix when the algorithm is executed. When the size of the dataset is too small, we see that the subspace algorithm results in low accuracy.

thumbnail

The plots correspond to the ARI, Jaccard and FM indexes averaged for all datasets containing 10 classes and 5 features (DB10C5F).

https://doi.org/10.1371/journal.pone.0210236.g005

It is also interesting to verify the performance of the clustering algorithms when setting distinct values for the expected number of classes K in the dataset. Such a value is usually not known beforehand in real datasets. For instance, one might expect the data to contain 10 classes, and, as a consequence, set K = 10 in the algorithm, but the objects may actually be better accommodated into 12 classes. An accurate algorithm should still provide reasonable results even when a wrong number of classes is assumed. Thus, we varied K for each algorithm and verified the resulting variation in accuracy. Observe that the optics and dbscan methods were not considered in this analysis as they do not have a parameter for setting the number of classes. In order to simplify the analysis, we only considered datasets comprising objects described by 10 features and divided into 10 classes (DB10C10F). The results are shown in Fig 6 . The top figures correspond to the average ARI and Jaccard indexes calculated for DB10C10F, while the Silhoute and Dunn indexes are shown at the bottom of the figure. The results indicate that setting K < 10 leads to a worse performance than obtained for the cases where K > 10, which suggests that a slight overestimation of the number of classes has smaller effect on the performance. Therefore, a good strategy for choosing K seems to be setting it to values that are slightly larger than the number of expected classes. An interesting behavior is observed for hierarchical clustering. The accuracy improves as the number of expected classes increases. This behavior is due to the default value of the method parameter, which is set as “average”. The “average” value means that the unweighted pair group method with arithmetic mean (UPGMA) is used to agglomerate the points. UPGMA is the average of the dissimilarities between the points in one cluster and the points in the other cluster. The moderate performance of UPGMA in recovering the original groups, even with high subgroup differentiation, is probably a consequence of the fact that UPGMA tends to result in more unbalanced clusters, that is, the majority of the objects are assigned to a few clusters while many other clusters contain only one or two objects.

thumbnail

The upper plots correspond to the ARI and Jaccard indices averaged for all datasets containing 10 classes and 10 features (DB10C10F). The lower plots correspond to the Silhouette and Dunn indices for the same dataset. The red line indicates the actual number of clusters in the dataset.

https://doi.org/10.1371/journal.pone.0210236.g006

The external validation indices show that most of the clustering algorithms correctly identify the 10 main clusters in the dataset. Naturally, this knowledge would not be available in a real life cluster analysis. For this reason, we also consider internal validation indices, which provides feedback on the partition quality. Two internal validation indices were considered, the Silhouette index (defined in the range [−1,1]) and the Dunn index (defined in the range [0, ∞]). These indices were applied to the DB10C10F and DB10C2F dataset while varying the expected number of clusters K . The results are presented in Figs 6 and 7 . In Fig 6 we can see that the results obtained for the different algorithms are mostly similar. The results for the Silhouette index indicate high accuracy around k = 10. The Dunn index displays a slightly lower performance, misestimating the correct number of clusters for the hierarchical algorithm. In Fig 7 Silhouette and Dunn show similar behavior.

thumbnail

The upper plots correspond to the ARI and Jaccard indices averaged for all datasets containing 10 classes and 2 features (DB10C2F). The lower plots correspond to the Silhouette and Dunn indices for the same dataset. The red line indicates the actual number of clusters in the dataset.

https://doi.org/10.1371/journal.pone.0210236.g007

The results obtained for the default parameters are summarized in Table 2 . The table is divided into four parts, each part corresponds to a performance metric. For each performance metric, the value in row i and column j of the table represents the average performance of the method in row i minus the average performance of the method in column j . The last column of the table indicates the average performance of each algorithm. We note that the averages were taken over all generated datasets.

thumbnail

In general, the spectral algorithm provides the highest accuracy rate among all evaluated methods.

https://doi.org/10.1371/journal.pone.0210236.t002

The results shown in Table 2 indicate that the spectral algorithm tends to outperform the other algorithms by at least 10%. On the other hand, the hierarchical method attained lower performance in most of the considered cases. Another interesting result is that the k-means and clara provided equivalent performance when considering all datasets. In the light of the results, the spectral method could be preferred when no optimitization of parameters values is performed.

One-dimensional analysis

research paper on clustering algorithms

In addition to the aforementioned quantities, we also measured, for each dataset, the maximum accuracy obtained when varying each single parameter of the algorithm. We then calculate the average of maximum accuracies, 〈max Acc〉, obtained over all considered datasets. In Table 3 , we show the values of 〈 S 〉, max S , Δ S and 〈max Acc〉 for datasets containing two features. When considering a two-class problem (DB2C2F), a significant improvement in performance (〈 S 〉 = 10.75% and 〈 S 〉 = 13.35%) was observed when varying parameter modelName , minPts and kpar of, respectively, the EM, optics and spectral methods. For all other cases, only minor average gain in performance was observed. For the 10-class problem, we notice that an inadequate value for parameter method of the hierarchical algorithm can lead to substantial loss of accuracy (16.15% on average). In most cases, however, the average variation in performance was small.

thumbnail

This analysis is based on the performance (measured through the ARI index) obtained when varying a single parameter of the clustering algorithm, while maintaining the others in their default configuration. 〈 S 〉, max S , Δ S are associated with the average, standard deviation and maximum difference between the performance obtained when varying a single parameter and the performance obtained for the default parameter values. We also measure 〈max Acc〉, the average of best ARI values obtained when varying each parameter, where the average is calculated over all considered datasets.

https://doi.org/10.1371/journal.pone.0210236.t003

In Table 4 , we show the values of 〈 S 〉, max S , Δ S and 〈max Acc〉 for datasets described by 10 features. For the the two-class clustering problem, a moderate improvement can be observed for the k-means, hierarchical and optics algorithm through the variation of, respectively, parameter nstart , method and minPts . A large increase in accuracy was observed when varying parameter modelName of the EM method. Changing the modelName used by the algorithm led to, on average, an improvement of 18.8%. A similar behavior was obtained when the number of classes was set to C = 10. For 10 classes, the variation of method in the hierarchical algorithm provided an average improvement of 6.72%. A high improvement was also observed when varying parameter modelName of the EM algorithm, with an average improvement of 13.63%.

thumbnail

This analysis is based on the performance obtained when varying a single parameter, while maintaining the others in their default configuration. 〈 S 〉, max S , Δ S are associated with the average, standard deviation and maximum difference between the performance obtained when varying a single parameter and the performance obtained for the default parameter values. We also measure 〈max Acc〉, the average of best ARI values obtained when varying each parameter, where the average is calculated over all considered datasets.

https://doi.org/10.1371/journal.pone.0210236.t004

Differently from the parameters discussed so far, the variation of some parameters plays a minor role in the discriminative power of the clustering algorithms. This is the case, for instance, of parameters kernel and iter of the spectral clustering algorithm and parameter iter.max of the kmeans clustering. In some cases, the effect of a unidimensional variation of parameter resulted in reduction of performance. For instance, the variation of min.individuals and models of the subspace algorithm provided an average loss of accuracy on the order of 〈 S 〉 = 20%, depending on the dataset. Similar behavior is observed for the dbscan method, for which the variation of minPts causes and average loss of accuracy of 20.32%. Parameters metric and rngR of the clara algorithm also led to marked decrease in performance.

In Table 5 , we show the values of 〈 S 〉, max S , Δ S and 〈max Acc〉 for datasets described by 200 features. For the two-class clustering problem, a significant improvement in performance was observed when varying nstart in the k-means method, method in the hierarchical algorithm, modelName in the hcmodel method and modelName in the EM method. On the other hand, when varying metric , min.individuals and use in, respectively, the clara, subspace, and hcmodel methods an average loss of accuracy larger than 10% was verified. The largest loss of accuracy happens with parameter minPts (49.47%) of the dbscan method. For the 10-class problem, similar results were observed, with the exception of the clara method, for which any parameter change resulted in a large loss of accuracy.

thumbnail

This analysis is based on the performance obtained when varying a single parameter, while maintaining the others in their default configuration. 〈 S 〉, max S , Δ S are associated with the average, standard deviation and maximum difference between the performance obtained when varying a single parameter and the performance obtained for the default parameter values. We also measure 〈max Acc〉, the average of best ARI values obtained when varying each parameter.

https://doi.org/10.1371/journal.pone.0210236.t005

Multi-dimensional analysis

A complete analysis of the performance of a clustering algorithm requires the simultaneous variation of all of its parameters. Nevertheless, such a task is difficult to do in practice, given the large number of parameter combinations that need to be taken into account. Therefore, here we consider a random variation of parameters aimed at obtaining a sampling of each algorithm performance for its complete multi-dimensional parameter space.

research paper on clustering algorithms

The performance of the algorithms for the different sets of parameters was evaluated according to the following procedure. Consider the histogram of ARI values obtained for the random sampling of parameters for the k-means algorithm, shown in Fig 8 . The red dashed line indicates the ARI value obtained for the default parameters of the algorithm. The light blue shaded region indicates the parameters configurations where the performance of the algorithm improved. From this result we calculated four main measures. The first, which we call p-value, is given by the area of the blue region divided by the total histogram area, multiplied by 100 in order to result in a percentage value. The p-value represents the percentage of parameter configurations where the algorithm performance improved when compared to the default parameters configuration. The second, third and fourth measures are given by the mean, 〈 R 〉, standard deviation, Δ R , and maximum value, max R , of the relative performance for all cases where the performance is improved (e.g. the blue shaded region in Fig 8 ). The relative performance is calculated as the difference in performance between a given realization of parameter values and the default parameters. The mean indicates the expected improvement of the algorithm for the random variation of parameters. The standard deviation represents the stability of such improvement, that is, how certain one is that the performance will be improved when doing such random variation. The maximum value indicates the largest improvement obtained when random parameters are considered. We also measured the average of the maximum accuracies 〈max ARI〉 obtained for each dataset when randomly selecting the parameters. In the S2 File of the supplementary material we show the distribution of ARI values obtained for the random sampling of parameters for all clustering algorithms considered in our analysis.

thumbnail

The algorithm was applied to dataset DB10C10F, and 500 sets of parameters were drawn.

https://doi.org/10.1371/journal.pone.0210236.g008

In Table 6 we show the performance (ARI) of the algorithms for dataset DB2C2F when applying the aforementioned random selection of parameters. The optics and EM methods are the only algorithms with a p-value larger than 50%. Also, a high average gain in performance was observed for the EM (22.1%) and hierarchical (30.6%) methods. Moderate improvement was observed for the hcmodel, kmeans, spectral, optics and dbscan algorithms.

thumbnail

The p-value represents the probability that the classifier set with a random configuration of parameters outperform the same classifier set with its default parameters. 〈 R 〉, Δ R and max R represent the average, standard deviation and maximum value of the improvement obtained when random parameters are considered. Column 〈max ARI〉 indicates the average of the best accuracies obtained for each dataset.

https://doi.org/10.1371/journal.pone.0210236.t006

The performance of the algorithms for dataset DB10C2F is presented in Table 7 . A high p-value was obtained for the optics (96.6%), EM (76.5%) and k-means (77.7%). Nevertheless, the average improvement in performance was relatively low for most algorithms, with the exception of the optics method, which led to an average improvement of 15.9%.

thumbnail

https://doi.org/10.1371/journal.pone.0210236.t007

A more marked variation in performance was observed for dataset DB2C10F, with results shown in Table 8 . The EM, kmeans, hierarchical and optics clustering algorithms resulted in a p-value larger than 50%. In such cases, when the performance was improved, the average gain in performance was, respectively, 30.1%, 18.0%, 25.9% and 15.5%. This means that the random variation of parameters might represent a valid approach for improving these algorithms. Actually, with the exception of clara and dbscan, all methods display significant average improvement in performance for this dataset. The results also show that a maximum accuracy of 100% can be achieved for the EM and subspace algorithms.

thumbnail

https://doi.org/10.1371/journal.pone.0210236.t008

In Table 9 we show the performance of the algorithms for dataset DB10C10F. The p-values for the EM, clara, k-means and optics indicate that the random selection of parameters usually improves the performance of these algorithms. The hierarchical algorithm can be significantly improved by the considered random selection of parameters. This is a consequence of the default value of parameter method , which, as discussed in the previous section, is not appropriate for this dataset.

thumbnail

https://doi.org/10.1371/journal.pone.0210236.t009

The performance of the algorithms for the dataset DB2C200F is presented in Table 10 . A high p-value was obtained for the EM (65.1%) and k-means (65.6%) algorithms. The average gain in performance in such cases was 39.1% and 35.4%, respectively. On the other hand, only in approximately 16% of the cases the Spectral and Subspace methods resulted in an improved ARI. Interestingly, the random variation of parameters led to, on average, large performance improvements for all algorithms.

thumbnail

https://doi.org/10.1371/journal.pone.0210236.t010

In Table 11 we show the performance of the algorithms for dataset DB10C200F. A high p-value was obtained for all methods. On the other hand, the average improvement in accuracy tended to be lower than in the case of the dataset DB2C200F.

thumbnail

https://doi.org/10.1371/journal.pone.0210236.t011

Conclusions

Clustering data is a complex task involving the choice between many different methods, parameters and performance metrics, with implications in many real-world problems [ 63 , 103 – 108 ]. Consequently, the analysis of the advantages and pitfalls of clustering algorithms is also a difficult task that has been received much attention. Here, we approached this task focusing on a comprehensive methodology for generating a large diversity of heterogeneous datasets with precisely defined properties such as the distances between classes and correlations between features. Using packages in the R language, we developed a comparison of the performance of nine popular clustering methods applied to 400 artificial datasets. Three situations were considered: default parameters, single parameter variation and random variation of parameters. It should be nevertheless be borne in mind that all results reported in this work are respective to specific configurations of normally distributed data and algorithmic implementations, so that different performance can be obtained in other situations. Besides serving as a practical guidance to the application of clustering methods when the researcher is not an expert in data mining techniques, a number of interesting results regarding the considered clustering methods were obtained.

Regarding the default parameters, the difference in performance of clustering methods was not significant for low-dimensional datasets. Specifically, the Kruskal-Wallis test on the differences in performance when 2 features were considered resulted in a p-value of p = 6.48 × 10 −7 (with a chi-squared distance of χ 2 = 41.50). For 10 features, a p-value of p = 1.53 × 10 −8 ( χ 2 = 52.20) was obtained. Considering 50 features resulted in a p-value of p = 1.56 × 10 −6 for the Kruskal-Wallis test ( χ 2 = 41.67). For 200 features, the obtained p-value was p = 2.49 × 10 −6 ( χ 2 = 40.58).

The Spectral method provided the best performance when using default parameters, with an Adjusted Rand Index (ARI) of 68.16%, as indicated in Table 2 . In contrast, the hierarchical method yielded an ARI of 21.34%. It is also interesting that underestimating the number of classes in the dataset led to worse performance than in overestimation situations. This was observed for all algorithms and is in accordance with previous results [ 44 ].

Regarding single parameter variations, for datasets containing 2 features, the hierarchical, optics and EM methods showed significant performance variation. On the other hand, for datasets containing 10 or more features, most methods could be readily improved through changes on selected parameters.

With respect to the multidimensional analysis for datasets containing ten classes and two features, the performance of the algorithms for the multidimensional selection of parameters was similar to that using the default parameters. This suggests that the algorithms are not sensitive to parameter variations for this dataset. For datasets containing two classes and ten features, the EM, hcmodel, subspace and hierarchical algorithm showed significant gain in performance. The EM algorithm also resulted in a high p-value (70.8%), which indicates that many parameter values for this algorithm can provide better results than the default configuration. For datasets containing ten classes and ten features, the improvement was significantly lower for almost all the algorithms, with the exception of the hierarchical clustering. When a large number of features was considered, such as in the case of the datasets containing 200 features, large gains in performance were observed for all methods.

In Tables 12 , 13 and 14 we show a summary of the best accuracies obtained during our analysis. The tables contain the best performance, measured as the ARI of the resulting partitions, achieved by each algorithm in the three considered situations (default, one- and multi-dimensional adjustment of parameters). The results are respective to datasets DB2C2F, DB10C2F, DB2C10F, DB10C10F, DB2C200F and DB10C200F. We observe that, for datasets containing 2 features, the algorithms tend to show similar performance, specially when the number of classes is increased. For datasets containing 10 features or more, the spectral algorithm seems to consistently provide the best performance, although the EM, hierarchical, k-means and subspace algorithms can also achieve similar performance with some parameter tuning. It should be observed that several clustering algorithms, such as optics and dbscan, aim at other data distributions such as elongated or S-shaped [ 72 , 74 ]. Therefore, different results could be obtained for non-normally distributed data.

thumbnail

https://doi.org/10.1371/journal.pone.0210236.t012

thumbnail

https://doi.org/10.1371/journal.pone.0210236.t013

thumbnail

https://doi.org/10.1371/journal.pone.0210236.t014

Other algorithms could be compared in future extensions of this work. An important aspect that could also be explored is to consider other statistical distributions for modeling the data. In addition, an analogous approach could be applied to semi-supervised classification.

Supporting information

S1 file. description of the clustering algorithms’ parameters..

We provide a brief description about the parameters of the clustering algorithms considered in the main text.

https://doi.org/10.1371/journal.pone.0210236.s001

S2 File. Clustering performance obtained for the random selection of parameters.

The file contains figures showing the histograms of ARI values obtained for identifying the clusters of, respectively, datasets DB10C10F and DB2C10F using a random selection of parameters. Each plot corresponds to a clustering method considered in the main text.

https://doi.org/10.1371/journal.pone.0210236.s002

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 7. Aggarwal CC, Zhai C. In: Aggarwal CC, Zhai C, editors. A Survey of Text Clustering Algorithms. Boston, MA: Springer US; 2012. p. 77–128.
  • 13. Joachims T. Text categorization with support vector machines: Learning with many relevant features. In: European conference on machine learning. Springer; 1998. p. 137–142.
  • 14. Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed. San Francisco: Morgan Kaufmann; 2005.
  • 37. Berkhin P. In: Kogan J, Nicholas C, Teboulle M, editors. A Survey of Clustering Data Mining Techniques. Berlin, Heidelberg: Springer Berlin Heidelberg; 2006. p. 25–71.
  • 42. R Development Core Team. R: A Language and Environment for Statistical Computing; 2006. Available from: http://www.R-project.org .
  • 44. Erman J, Arlitt M, Mahanti A. Traffic classification using clustering algorithms. In: Proceedings of the 2006 SIGCOMM workshop on mining network data. ACM; 2006. p. 281–286.
  • 47. Parsons L, Haque E, Liu H. Evaluating subspace clustering algorithms. In: Workshop on Clustering High Dimensional Data and its Applications, SIAM Int. Conf. on Data Mining. Citeseer; 2004. p. 48–56.
  • 48. Burdick D, Calimlim M, Gehrke J. MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases. In: Proceedings of the 17th International Conference on Data Engineering. Washington, DC, USA: IEEE Computer Society; 2001. p. 443–452.
  • 51. UCI. breast-cancer-wisconsin;. Available from: https://http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/ .
  • 52. Ultsch A. Clustering wih som: U* c. In: Proceedings of the 5th Workshop on Self-Organizing Maps. vol. 2; 2005. p. 75–82.
  • 54. Aggarwal CC, Reddy CK. Data Clustering: Algorithms and Applications. vol. 2. 1st ed. Chapman & Hall/CRC; 2013.
  • 58. Jain AK, Topchy A, Law MH, Buhmann JM. Landscape of clustering algorithms. In: Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on. vol. 1. IEEE; 2004. p. 260–263.
  • 64. Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis. Series in Probability& Mathematical Statistics. 2009;.
  • 65. Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics; 2007. p. 1027–1035.
  • 66. Sequeira K, Zaki M. ADMIT: anomaly-based data mining for intrusions. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2002. p. 386–395.
  • 67. Williams GJ, Huang Z. Mining the knowledge mine. In: Australian Joint Conference on Artificial Intelligence. Springer; 1997. p. 340–348.
  • 68. MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. Berkeley, Calif: University of California Press; 1967. p. 281–297.
  • 70. Kaufman L, Rousseeuw PJ. Finding Groups in Data: an introduction to cluster analysis. John Wiley & Sons; 1990.
  • 71. Han J, Kamber M. Data Mining. Concepts and Techniques. vol. 2. 2nd ed. -: Morgan Kaufmann; 2006.
  • 73. Ankerst M, Breunig MM, Kriegel HP, Sander J. OPTICS: Ordering Points To Identify the Clustering Structure. ACM Press; 1999. p. 49–60.
  • 74. Ester M, Kriegel HP, Sander J, Xu X. A Density-based Algorithm for Discovering Clusters a Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. KDD’96. AAAI Press; 1996. p. 226–231.
  • 86. Ng AY, Jordan MI, Weiss Y. On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems 14. MIT Press; 2001. p. 849–856.
  • 95. Horn RA, Johnson CR. Matrix Analysis. 2nd ed. New York, NY, USA: Cambridge University Press; 2012.
  • 96. Liu Y, Li Z, Xiong H, Gao X, Wu J. Understanding of internal clustering validation measures. In: Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE; 2010. p. 911–916.
  • 98. Cover TM, Thomas JA. Elements of Information Theory. vol. 2. Wiley; 2012.
  • 102. McKight PE, Najab J. Kruskal-Wallis Test. Corsini Encyclopedia of Psychology. 2010;.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 17 September 2024

Quantum-inspired clustering with light

  • Miguel Varga 1 ,
  • Pablo Bermejo 2 ,
  • Ruben Pellicer-Guridi 1 , 2 ,
  • Román Orús 2 , 3 , 4 &
  • Gabriel Molina-Terriza 1 , 2 , 3  

Scientific Reports volume  14 , Article number:  21726 ( 2024 ) Cite this article

Metrics details

  • Information technology
  • Optics and photonics
  • Quantum physics

This article introduces a novel approach to perform the simulation of a single qubit quantum-inspired algorithm using laser beams. Leveraging the polarization states of photonic qubits, and inspired by variational quantum eigensolvers, we develop a variational quantum-inspired algorithm implementing a clustering procedure following the approach proposed by some of us in SciRep 13, 13284 (2023). A key aspect of our research involves the utilization of non-orthogonal states within the photonic domain, harnessing the potential of polarization schemes to reproduce unitary circuits. By mapping these non-orthogonal states into polarization states, we achieve an efficient and versatile quantum information processing unit which serves as a clustering device for a diverse set of datasets.

Similar content being viewed by others

research paper on clustering algorithms

Variational quantum and quantum-inspired clustering

research paper on clustering algorithms

Efficient generation of entangled multiphoton graph states from a single atom

research paper on clustering algorithms

Exactly solving the Kitaev chain and generating Majorana-zero-modes out of noisy qubits

Introduction.

In recent years, the field of quantum computing has witnessed a surge in novel proposals for quantum algorithms, many of which are strategically tailored to thrive in the Noisy Intermediate-Scale Quantum (NISQ) era, a period characterized by the presence of error-prone quantum hardware 1 , 2 . While quantum technology continues to evolve, the quest for more robust and reliable quantum algorithms remains paramount. Amidst this dynamic landscape, the ability to simulate quantum algorithms using classical computers for practical applications has not lost its relevance. As quantum algorithms diversify across different quantum computing architectures, a parallel trend is emerging: the development of increasingly efficient classical simulation methods. A good example of these methods are Tensor Networks 3 , which have proven recently capable of simulating the complex dynamics of many-qubit systems 4 , 5 . In addition, classical platforms are continually being innovated to replicate qubit behaviors, adding to the repertoire of simulation tools 6 , 7 .

In this paper, we introduce an alternative approach to simulate variational quantum algorithms 8 by leveraging the capabilities of a photonic device as a dedicated processing unit. By using light, we can replicate single-qubit rotations, therefore being able to implement tasks such as Variational Quantum Eigensolvers (VQE) 9 and Variational Quantum Clustering (VQC) 10 . Photonic quantum devices offer some inherent characteristics which make them compatible for such purpose: (i) they present a processing unit which can be entirely mapped to the logical gates of a single-qubit circuit, (ii) they provide accurate control over noise, and (iii) they can be eventually scaled-up introducing actual quantum regimes for the input state. Using such architecture, we implement a photonic version of the quantum clustering algorithm Ref. 10 for a single qubit. Clustering methods represent an important research avenue in classical ML, since it can be applied to many practical purposes 11 , 12 , 13 . This particular algorithm has a simple structure that allows to implement a clustering variant on NISQ devices, so that it can be run on a single qubit-like device as in this case. Therefore, our photonic classical device mimicks the behavior of a quantum circuit for a single qubit. Our results are a first test of the capabilities of such a classical platform to simulate quantum algorithms.

The algorithm at the core of this experiment is an unsupervised quantum clustering algorithm, designed for the classification of data points within a given dataset where no prior information about the system is available 14 . In such a scenario there is no training stage, and the implementation follows that of Ref. 10 , where a variational quantum circuit is optimized self-consistently so as to minimize the distance between points and cluster centroids. As explained in that reference, the procedure iteratively optimizes a set of variational parameters based on a reference cost function to determine the optimal configuration for the dataset. Notably, this approach relies solely on the intrinsic features of the dataset without any prior labeling.

To achieve this, the first step is to design a cost function. This cost function is built in such a way that its minimization provides the configuration of the optimal classification. The reference Hamiltonian, specifically constructed for this purpose, is given by:

In this equation, N is the number of datapoints, subscripts i ,  j represent each data point within the dataset, and a ,  b denote the family labels associated to points i and j , respectively. The maximum number of families, k , is fixed beforehand. \(\delta _{a,b}\) is the Dirac delta retaining only contributions when points i and j belong to the same cluster. In our coordinate system, each state \(\vec {x}_i\) is represented by a point \((\psi ,\chi )\) at the surface of the Poincaré polarization sphere, as is shown in Fig.  1 (b). Family labels a ,  b also represent points defined on the Poincaré sphere, belonging to the maximally orthogonal states set, whose number of elements depend on the number of clusters, k , involved in the classification 10 . In each iteration, and before calculating the value of the cost function, we have to determine to which cluster belongs each datapoint. This is done by calculating the fidelity \(f_i^a\) of each state \(\vec {x}_i\) with respect to every cluster label a and asigning to \(\vec {x}_i\) the closest label. The cost function takes into account the distance between data points \(d(\vec {x}_i, \vec {x}_j)\) , given by the l \(^2\) -norm, the distance between a data point and the centroid of its corresponding cluster \(d(\vec {x}_i, \vec {c}_i)\) , also given by the l \(^2\) -norm, as well as the fidelity \(f_i^a \equiv | {\langle {\psi _i}\rangle }{\psi ^a} |^2\) between a variational quantum state \(\vert \psi _i\rangle\) for datapoint \(\vec {x}_i\) and a reference state \(\vert \psi ^a\rangle\) for a given cluster \(a \in k\) . The cluster a to which point i belongs is assigned among the k possible families so that the quantity \(f_i^c\) \(\forall {c}\in k\) is maximized when \(c=a\) . Therefore, all fidelities \(f_i^c\) \(\forall {c}\in k\) are precomputed for each data point before calculating the cost function (one could alleviate this calculation by incorporating information concerning, for example, k , so that \(f_i^a\) can be bounded). This fidelity is then influenced by the position of each data point in the Poincaré polarization sphere, which is determined by a set of variational parameters. In addition, \(\lambda\) serves as a regularization parameter, allowing for different penalizations of distances between dataset points and considering the relative importance of distances between data points and their respective cluster centroids. Indeed, the variational nature of the algorithm entirely falls within the function \(f_i^a\) .

This minimization is built so that points i and j originally belonging to the same family (and thus having a small \(d(\vec {x}_i,\vec {x}_j)\) ) are constrained to stay in the same cluster by diminishing the value \(\left( 1-f_i^a\right) \left( 1-f_j^a\right)\) . If points belonging to the same familiy, with such small distance \(d(\vec {x}_i,\vec {x}_j)\) , were assigned to different families, this contribution would not exist in the cost function, but points that do not belong to the same family could be pushed to stay in the same cluster by proximity, adding up a larger contribution (larger \(d(\vec {x}_i,\vec {x}_j)\) and larger \(\left( 1-f_i^a\right) \left( 1-f_j^a\right)\) ) to the cost than the one with correct label assignment.

Our photonic implementation simulates the optimization of the above cost function for a single qubit, using a variational quantum-inspired circuit. Our method consists of 2 distinct working units: a classical one, and a quantum-inspired one, which we simulate in our case with a diode laser. The classical computer will take care of upgrading the variational parameters driving the quantum circuit by means of a classical optimizer aiming at minimizing the cost function, as in a regular optimization problem. The quantum-inspired circuit will be then modulated based on the upgrade of the variational parameters, which will enter the circuit as rotations in the polarization of the light.

Experimental setup

To implement our quantum-inspired circuit, we have set a series of waveplates that modulate the polarization of an 808nm diode laser. These waveplates will adjust the laser beam’s polarization based on variational parameters. Starting from a specific initial polarization state, the combination of waveplates will gradually transform it. The ultimate goal is to measure the polarization of a specific quantum state using a polarimeter, effectively allowing us to position polarized states throughout the Poincaré sphere, akin to the Bloch sphere of a single qubit. In essence, the system will serve as a versatile tool for transforming the positions of dataset points. It begins with an initial configuration and is manipulated to achieve an optimal configuration in which a particular cost function, like the one in Eq. ( 1 ), is minimized. The underlying concept of using an 808nm diode laser stems from its simplicity and its capacity to map a single-qubit quantum circuit to the laser’s operational principles. By replicating quantum logical gates through a combination of waveplates and facilitating the measurement process with a polarimeter, we can recreate the dynamics of the qubit. In our case, we use this setting to implement the photonic simulation of variational quantum clustering.

figure 1

[Color online] ( a ) Scheme of optical setup. Experimental stages are visualized in different colors. In yellow, the state initialization part. In magenta, the states manipulation section. And finally, in cyan, the measurement stage is depicted. SMF: single mode optical fiber. FC: fiber collimator. P: linear polarizator. \(\text {H}_\text {x}\) : half wave plate. \(\text {Q}_\text {x}\) : quarter wave plate. Pol.: polarimeter. ( b ) Example of cluster classification for a configuration of 2 clusters of 200 points defined by \(\frac{d}{\sigma }=8\) , where d is the distance between centers and \(\sigma\) the width of the Gaussian blobs.

The optical setup used to initialize, manipulate and read the states is shown in Fig.  1 . In the state preparation sequence (yellow part of Fig.  1 ), after half wave plate \(H_0\) and polarizer P are applied in the incoming beam, the initial state of the system can be represented as

which corresponds to horizontal polarization. This configuration, \(H_0 + P\) , is fixed in order to allow maximum intensity and a stable polarization.

The two first plates, \(Q_{in}\) and \(H_{in}\) , are used to transform the states from the data feature space to the Poincaré sphere. They are mounted on separate rotary stages (RSW60C-T3A from Zaber). These motors have a maximum accuracy of \(\hbox {0.08}^{\circ }\) and a maximum speed of \(\hbox {450}^{\circ }\) /s. The state \(|\phi _o\rangle\) , defined by the angles \((\psi ,\chi )\) , is determined by the angles \((\alpha , \beta )\) corresponding to the fast axis orientation of \(Q_{in}\) and \(H_{in}\) , respectively. This bijection was done by means of a look-up-table.

After initialization, the state \(\vert \phi _o\rangle\) is modified by subsequent sets of m half and quarter wave plates \(\{\hat{H}_k,\hat{Q}_k\}\) (magenta part of the Fig.  1 ). These wave plates work in the same way as the initial ones, that is, for ideal waveplates their Jones matrices are given by

for the half wave plates, and

for the quarter wave plates. While experimentally, the waveplates can depart from this idealized model, this allows us to simulate the behaviour of our experimental system. Here, \(\beta _k\) and \(\alpha _k\) are the angle of rotation of the waveplates. They act as the variational parameters to be optimized in the variational circuit optimizing the cost function from Eq.( 1 ). Therefore, the final state \(\vert \phi _f\rangle\) of the system is given by

or, in terms of the initial horizontal polarization state \(\vert \psi _{in}\rangle\) ,

These output states \(\vert \phi _o\rangle\) are directly measured by the polarimeter (cyan part of Fig.  1 ). In the case of upgrading this experiment to more qubits, the polarimeter should be replaced by a complete tomography of the output state. In our case, this is simplified as the polarimeter provides the value of the Stokes parameters \(s_0, s_1, s_2, s_3\) corresponding to a point in the Poincaré polarization sphere. The readout of this point in the sphere allows for the self-consistent optimization of the variational parameters in the circuit, allowing in turn to minimize the clustering cost function and therefore implement an unsupervised classification of the points in the dataset.

The results for the case of two clusters with \(\sim 100\) points are shown in Fig. 1 (b). One can observe that the algorithm successfully automatically classifies the points in the two different clusters. The ratio of success of the algorithm for this particular case is of \(100\%\) . In this case, the classification task is conducted on top of a random exploration of the initialization step, in order to favour the exploration of the parameters space.

Numerical analysis

To better understand the phase space of the variational parameters and provide a better intuition for more complex quantum optimization processes, we performed a numerical analysis for a 4 Gaussian blobs dataset, classified with 2 single variational layers, generating 3 different optimization paths in hyperparameter space (as shown in Fig. ( 2 a, b, c). We present in the figure the landscape of the cost function, which as expected is a complex shape with many local minima. The initialization of the algorithm would start at a random point in the landscape and then, subsequently, the optimization algorithm would provide the rotations of the Poincare sphere which would optimize the cost function, providing a path in the landscape.

Notice that while there is a single absolute minima, most of the local minima also provide a good classification. The three examples shown fall into the local minima with success ratios of \(92.5\%\) , \(95\%\) and \(100\%\) . The final classification can be shown in Fig.  3 with the corresponding evolution of the cost function. One can observe that, after just 10 iterations, the cost function typically arrives to a stable solution, while it may need up to 30 iterations to reach the minimum. Only one of the optimization paths ends up displaying perfect classification, corresponding to the one with the smallest cost. This result is a consequence of how sensible variational algorithms are to initialization 15 , 16 , which has become one of the main features to look at in the search for non-classical simulability of variational algorithms 17 .

figure 2

[Color online] Numerical example for the clusterization of a 4 clusters distribution. Top: Colormap of the cost function value with respect of the hiperparameters \(\alpha\) and \(\beta\) corresponding to the rotation angle of a half and a quarter wave-plate, respectively. Bottom: insets’ detail of the trajectories taken to get to the local minima of the cost function.

figure 3

Numerical cost function evolution for 4 Gaussian blobs: 3 different trajectories corresponding to the insets of Fig.  2 .

As mentioned earlier, one could use more complex minimization algorithms, but in our case a combination of Monte Carlo and steepest descent has provided good results. In this work, we focus on the capabilities of the method to reproduce efficiently a single qubit algorithm, showing the main features of this clustering scheme and opening the way for further tuning strategies.

Experimental results

The methods described before were implemented in our clustering experiment, using more clusters, in order to test the experimental limitations of the system. The main results are summarized in the plots in Fig.  4 . Similarly as in Fig.  1 (b) the figure shows the capability of the photonic clustering implemented to automatically identify configurations with different quantity of clusters. Additionally, in Fig.  5 we provide the experimental evolution of the cost function for the four different configurations tested. These results can be compared with the numerical results that we provided earlier. It can be observed that the experimental errors do not significantly affect the expected behaviour of the optimization process.

figure 4

[Color online] Experimental clustering results: ( a ) 3 gaussian blobs, ( b ) 4 gaussian blobs and (5) 5 gaussian globes. Colors correspond to the different clusters identified by the experiment.

figure 5

[Color online] Experimental cost function evolution corresponding to the distributions of the figures 1 and 4 . The cost values where normalized between 0 and 1.

The results presented above constitute a first proof of principle of the validity of using classical photonic systems to simulate quantum clustering. The experiment can be further expanded in many different directions. For instance, it should be possible to classify more complex datasets. However, an increasing number of clusters would considerably lengthen the experimental time, since we have to perform a complete tomographic reconstruction of the states. A potential solution to this limitation could be training on a few data points in our dataset and inferring the position of the remaining points after training by their relative positions within the original dataset. Another alternative approach could be to sequentially add and evaluate points in order to obtain a proper classification minimizing the data required. In addition, one may also explore the possibility of simulating a multi-qubit quantum circuit, by generating and manipulating quantum states consisting of more than one photon.

Conclusions and outlook

In this paper we have introduced a novel quantum-inspired clustering method, based on the photonic simulation of single-qubit dynamics. We have shown how it is possible to optimize the polarization degree of freedom of a laser beam, in the same way that one can optimize the quantum state of a single qubit, so as to minimize a clustering cost function. We have implemented the experiment and shown the validity of our ideas, performing automatic clustering of random data, with no prior information, scattered in up to five zones, with perfect accuracy.

Our experiment is based on encoding the information on the polarization of light, akin to using a single qubit in a quantum algorithm. The rotation matrix introduced by the wave plates in our experiment, can be cast in the quantum circuit in terms of general Pauli rotations around the main axis of a qubit. This can be useful in order to build a bridge between this specific implementation and the virtual environments commonly used in academia and industry with access to quantum hardware (Pennylane, Qiskit, etc...) 18 , 19 . There exist several restrictions in the amount and type of logical gates allowed in actual quantum hardware, so a mapping between the functioning of this optical circuit and universal sets of gates is advisable.

Our work can be further expanded in many directions, as discussed previously. We believe that a promising path is the simulation of multi-qubit systems. In addition, the flexibility of the variational circuit allows for a wide variety of applications. With this in mind, one could for instance build up diverse cost functions for different purposes using the same experimental arrangement, so that we could indeed use a laser beam to implement different types of quantum-inspired machine learning strategies. Indeed, when writing up this paper, we noticed the proposal of a very similar set-up for the realization of VQE algorithms using photonic devices 20 . Last but not least, our scheme allows the introduction of other features, such as data reuploading 21 , 22 . Data reuploading was already proposed in similar contexts, such as in Ref. 23 , where this was used to outperform kernel methods using very few quantum resources, as low as one single qubit. Even without reuploading, introducing several layers of gates in the circuit helps in practice in the convergence of the variational optimization, allowing for small changes of the different parameters at each iteration. All these topics will be the subject of future investigations.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Bharti, K. et al. Noisy intermediate-scale quantum algorithms Rev. Mod. Phys. 94 , 015004. https://doi.org/10.1103/RevModPhys.94.015004 (2022).

Article   ADS   MathSciNet   CAS   Google Scholar  

Preskill, J. Quantum Computing in the NISQ era and beyond. Quantum 2 , 79. https://doi.org/10.22331/q-2018-08-06-79 (2018).

Article   Google Scholar  

Orús, R. A practical introduction to tensor networks: Matrix product states and projected entangled pair states. Annals of Physics [SPACE] https://doi.org/10.1016/j.aop.2014.06.013 (2014).

Article   MathSciNet   Google Scholar  

Tindall, J., Fishman, M., Stoudenmire, E. M. & Sels, D. Efficient Tensor Network Simulation of IBM’s Eagle Kicked Ising Experiment. PRX Quantum (2024).

Patra, S., Jahromi, S. S., Singh, S. & Orús, R. Efficient tensor network simulation of IBM’s largest quantum processors. Phys. Rev. Res. [SPACE] https://doi.org/10.1103/PhysRevResearch.6.013326 (2024).

Perez-Garcia, B. et al. Quantum computation with classical light: The Deutsch Algorithm. Physics Letters A 379 , 1675 (2015).

Article   ADS   CAS   Google Scholar  

Sun, Y., Li, Q., Kong, L.-J., Shang, J. & Zhang, X. Universal classical optical computing inspired by quantum information process. Annalen der Physik 534 , 2200360 (2022).

Article   ADS   Google Scholar  

Cerezo, M. et al. Variational quantum algorithms. Nature Reviews Physics 3 , 625 (2021).

Peruzzo, A. et al. A variational eigenvalue solver on a photonic quantum processor Nature. Communication 5 , 4213. https://doi.org/10.1038/ncomms5213s (2014).

Bermejo, P. & Orús, R. Variational quantum and quantum-inspired clustering Scientific Reports 13 , 13284. https://doi.org/10.1038/s41598-023-39771-6 (2023).

Article   CAS   PubMed   Google Scholar  

Xu, R. & Wunsch, D. Survey of clustering algorithms IEEE Transactions on neural networks 16 , 645 (2005).

Article   PubMed   Google Scholar  

Zait, M. & Messatfa, H. A comparative study of clustering methods. Future Generation Computer Systems 13 , 149 (1997).

Lee, R. C. in Clustering analysis and its applications Advances in Information Systems Science: Volume 8 Springer, pp. 169–292 (1981)

Berry, M. W., Mohamed, A. & Yap, B. W. Supervised and unsupervised learning for data science Supervised and unsupervised learning for data science Springer, (2019)

Truger, F., Barzen, J., Bechtold, M., Beisel, M., Leymann, F., Mandl, A. & Yussupov, V. Warm-starting and quantum computing: A systematic mapping study ACM Computing Surveys (2023)

Egger, D. J., Mareček, J. & Woerner, S. Warm-starting quantum optimization Quantum 5 , 479 (2021).

Google Scholar  

Cerezo, M., Larocca, M., García-Martín, D., Diaz, N. L., Braccia, P., Fontana, E., Rudolph, M. S., Bermejo, P., Ijaz, A., Thanasilp, S., Anschuetz, E. R. & Holmes, Z. Does provable absence of barren plateaus imply classical simulability? or, why we need to rethink variational quantum computing (2023), arXiv:2312.09121 [quant-ph]

Bergholm, V., Izaac, J., Schuld, M., Gogolin, C., Ahmed, S., Ajith, V., Alam, M. S., Alonso-Linaje, G., AkashNarayanan, B. & Asadi, A. Pennylane: Automatic differentiation of hybrid quantum-classical computations et al. , arXiv preprint arXiv:1811.04968 ( 2018)

Qiskit contributors, https://doi.org/10.5281/zenodo.2573505 Qiskit: An open-source framework for quantum computing (2023)

Stornati, P., Acin, A., Chabaud, U., Dauphin, A., Parigi, V. & Centrone, F. Variational quantum simulation using non-gaussian continuous-variable systems ( 2023), arXiv:2310.15919 [quant-ph]

Pérez-Salinas, A., Cervera-Lierta, A., Gil-Fuster, E. & Latorre, J. I. Data re-uploading for a universal quantum classifier. Quantum 4 , 226. https://doi.org/10.22331/q-2020-02-06-226 (2020).

Schuld, M., Sweke, R. & Meyer, J. J. Effect of data encoding on the expressive power of variational quantum-machine-learning models. Physical Review A 103 , 032430 (2021).

Jerbi, S., Fiderer, L. J., Poulsen Nautrup, H., Kübler, J. M., Briegel, H. J. & Dunjko, V. Quantum machine learning beyond kernel methods (2023) https://doi.org/10.1038/s41467-023-36159-y Nature Communications 14 , 517

Download references

Acknowledgements

We acknowledge Donostia International Physics Center (DIPC), Centro de Física de Materiales, Ikerbasque, Basque Government, Diputación de Gipuzkoa, Spanish Ministry of Science and Innovation and European Innovation Council (EIC) for constant support, as well as insightful discussions with the teams from Multiverse Computing, DIPC and CFM on the algorithms and technical implementations. This work was supported by the Spanish Ministry of Science and Innovation through the PLEC2021-008251 project. We also acknowledge support from PTI-001 from CSIC and the Ministry of Science through project PID2022-143268NB-I00.

Author information

Authors and affiliations.

Centro de Física de Materiales, UPV-EHU/CSIC, Paseo Manuel de Lardizabal 5, San Sebastián, E-20018, Spain

Miguel Varga, Ruben Pellicer-Guridi & Gabriel Molina-Terriza

Donostia International Physics Center, Paseo Manuel de Lardizabal 4, San Sebastián, E-20018, Spain

Pablo Bermejo, Ruben Pellicer-Guridi, Román Orús & Gabriel Molina-Terriza

Ikerbasque Foundation for Science, Maria Diaz de Haro 3, Bilbao, E-48013, Spain

Román Orús & Gabriel Molina-Terriza

Multiverse Computing, Paseo de Miramón 170, San Sebastián, E-20014, Spain

You can also search for this author in PubMed   Google Scholar

Contributions

M.V. set up the experiment. M.V and P.B. wrote the clustering code. M.V. and P.B. analyzed the results and wrote the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Miguel Varga .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Varga, M., Bermejo, P., Pellicer-Guridi, R. et al. Quantum-inspired clustering with light. Sci Rep 14 , 21726 (2024). https://doi.org/10.1038/s41598-024-73053-z

Download citation

Received : 28 May 2024

Accepted : 12 September 2024

Published : 17 September 2024

DOI : https://doi.org/10.1038/s41598-024-73053-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

research paper on clustering algorithms

  • DOI: 10.3724/SP.J.1001.2008.00048
  • Corpus ID: 14356730

Clustering Algorithms Research

  • Published 30 June 2008
  • Computer Science, Mathematics
  • Journal of Software

Tables from this paper

table 4

294 Citations

Enhanced clustering algorithm on academic activities, a diffused and emerging clustering algorithm, research on spectral clustering algorithms and prospects, research on k-means clustering algorithm: an improved k-means clustering algorithm, a study of various clustering algorithms on retail sales data, hc_ab: a new heuristic clustering algorithm based on approximate backbone, research on the influence of several factors on clustering results usingdifferent algorithms, research on incremental clustering, research on k-value selection method of k-means clustering algorithm, a comparative performance analysis of clustering, 36 references, a new feature weighted fuzzy clustering algorithm, c hameleon : a hierarchical clustering algorithm using dynamic modeling.

  • Highly Influential

An iterative initial-points refinement algorithm for categorical data clustering

A fast clustering algorithm to cluster very large categorical data sets in data mining, investigating diversity of clustering methods: an empirical comparison, acodf: a novel data clustering approach for data mining in large databases, data clustering: a review, st-dbscan: an algorithm for clustering spatial-temporal data, a new shifting grid clustering algorithm, a clustering algorithm based on maximal θ-distant subtrees, related papers.

Showing 1 through 3 of 0 Related Papers

  • Open access
  • Published: 16 September 2024

ESCHR: a hyperparameter-randomized ensemble approach for robust clustering across diverse datasets

  • Sarah M. Goggin 1 &
  • Eli R. Zunder   ORCID: orcid.org/0000-0002-0356-1685 1 , 2  

Genome Biology volume  25 , Article number:  242 ( 2024 ) Cite this article

1 Altmetric

Metrics details

Clustering is widely used for single-cell analysis, but current methods are limited in accuracy, robustness, ease of use, and interpretability. To address these limitations, we developed an ensemble clustering method that outperforms other methods at hard clustering without the need for hyperparameter tuning. It also performs soft clustering to characterize continuum-like regions and quantify clustering uncertainty, demonstrated here by mapping the connectivity and intermediate transitions between MNIST handwritten digits and between hypothalamic tanycyte subpopulations. This hyperparameter-randomized ensemble approach improves the accuracy, robustness, ease of use, and interpretability of single-cell clustering, and may prove useful in other fields as well.

Clustering is widely used for exploratory data analysis across diverse fields, where it is applied to identify dataset grouping structures in an unsupervised manner. In particular, clustering has become a workhorse tool for single-cell analysis, enabling the identification and characterization of cell populations that share similar molecular profiles within heterogeneous biological samples [ 1 ]. The output of clustering analysis is often used for direct comparison of biological samples, to identify changes in the abundance or molecular state of specific cell populations. Furthermore, clustering output is frequently carried forward into additional downstream analyses such as cell type classification or trajectory analysis [ 2 , 3 , 4 ]. Therefore, the accuracy and reproducibility of clustering partitions is important for the quality of single-cell analysis. This importance has motivated the development of hundreds [ 5 ] of clustering methods with a variety of algorithmic strategies, but there are still important shortcomings in all of these methods which reduce their effectiveness.

An ideal clustering method for single-cell analysis would satisfy the following requirements:

Operate without the need for human input such as hyperparameter tuning. The vast majority of existing methods require selection and optimization of hyperparameters, which can significantly impact clustering quality [ 6 , 7 , 8 , 9 ]. Manual hyperparameter tuning is time-consuming and relies subjectively on human intuition about which groupings appear correct [ 10 ]. Automated methods have been proposed to overcome this limitation, but many are computationally inefficient, and all are biased by the criteria used for optimization [ 9 , 11 , 12 , 13 ].

Perform well across diverse single-cell datasets from different tissues and across multiple measurement modalities such as single-cell/single-nucleus RNA sequencing (scRNA-seq and snRNA-seq), single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq), flow cytometry, mass cytometry, and multiplexed imaging analysis such as high-content fluorescence imaging, imaging mass cytometry (IMC), multiplexed ion beam imaging (MIBI), and multiplexed error-robust fluorescence in situ hybridization (MERSCOPE). Generalizability is a concern in existing methods; many clustering methods perform well on gold-standard single-cell datasets, but do not generalize well to datasets from other tissue types or from other single-cell analysis modalities which may have different or more complex distributions or structural properties [ 7 , 8 , 9 , 10 , 14 ].

Produce stable and consistent partitions that are robust to random sampling and minor perturbations. Existing methods do not reliably produce robust partitions when applied to complex, high-dimensional single-cell datasets. Meaningfully different results can be produced with different hyperparameter combinations [ 8 ], slight perturbations of a dataset [ 10 , 14 ], or even when an identical dataset and hyperparameters are run multiple times due to randomization steps in most clustering algorithms (Additional File 1: Fig. S1a, b).

Capture and describe the wide variety of discrete and continuous grouping structures present in single-cell datasets [ 15 , 16 ]. Most existing methods implement hard clustering, which assumes a data structure with discrete, well-separated groups, but is unable to characterize overlap or continuity between groups. Alternative computational methods for trajectory inference can better capture specific types of continuum-like processes such as cell differentiation in single-cell datasets, but these methods make a different set of assumptions about data structure that can be equally restrictive.

Quantify uncertainty at the levels of individual data points and clusters. There are many scenarios where clustering can provide useful information, but a single optimal solution to the clustering task either does not exist or cannot be determined [ 17 ]. In many cases, there is additionally no known ground truth that could define what a correct solution might look like. Therefore, measures of uncertainty are crucial to assess the reliability and aid interpretability of clustering results before using them as inputs for downstream analytical methods or for purposes such as hypothesis development or orthogonal validation of results.

Scale to analyze large single-cell datasets with millions of cells. While many of the most commonly used methods are scalable, several that have been developed to address these key challenges for clustering have done so at the expense of scalability. Methods that improve on these other challenges can only be realistically impactful if they can produce results for the large dataset sizes that are becoming increasingly commonplace.

Recently developed clustering methods have made progress towards some of these goals. Ensemble and consensus methods represent a promising approach to improve clustering robustness by combining information from multiple diverse partitions [ 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 ]. Fuzzy and soft clustering methods allow data points to belong to multiple clusters, and can therefore be used to provide a more complete description of both continuous and discrete data structures [ 26 , 27 ]. There are several methods that provide measures of stability or uncertainty at the cluster level [ 9 , 11 , 24 , 28 ], but cell-level measures of uncertainty are rarely provided in single-cell methods [ 29 , 30 ]. Additionally, deep learning methods have shown promise in generating informative lower-dimensional representations of diverse types of high-dimensional biological data [ 31 ]. However, none of these approaches have been able to incorporate all of the six key features described above.

To address this need for a single method that performs robustly across diverse datasets with no hyperparameter tuning and transparently communicates uncertainty, we developed a clustering algorithm that applies E n S emble C lustering with H yperparameter R andomization (ESCHR). This algorithm requires no human input due to hyperparameter randomization, which explores a wide range of data subspaces that contribute to the final consensus clustering step. Our implementation of ESCHR in Python ( https://github.com/zunderlab/eschr ) [ 32 ] can be used as a self-contained framework for clustering, or it can be integrated into commonly used single-cell analysis pipelines such as the scverse ecosystem [ 33 ]. To evaluate this new method, we performed extensive benchmarking tests, which demonstrated that ESCHR outperforms both general clustering methods and clustering methods specifically developed for single-cell analysis [ 24 , 25 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 ] in terms of accuracy on synthetic datasets with a known “ground truth” and in terms of robustness on real single-cell datasets encompassing diverse tissues (bone marrow, pancreas, developing and adult brain), organisms (mouse, human), cell numbers (from hundreds to millions), and measurement techniques (single-cell RNA sequencing, mass cytometry, flow cytometry).

After benchmarking for accuracy and robustness, we applied ESCHR clustering to two complex real-world datasets—first to the MNIST dataset [ 42 ], a commonly used example for machine learning image analysis, and then in the single cell context to investigate the relationships between tanycyte populations in the hypothalamus, which have been previously shown to display spatial and molecular-level continuity between subtypes [ 43 , 44 , 45 , 46 , 47 ]. In both of these exploratory analyses, the soft cluster assignments and uncertainty scoring from ESCHR were used to identify regions of low confidence cluster assignments corresponding to transitional overlap between clusters and map the key feature transitions that define these regions.

Overview of ESCHR clustering

To develop a robust and scalable clustering method for the analysis of single-cell datasets, we employed an ensemble and consensus approach, which has been shown to improve robustness across many domains of machine learning [ 21 , 48 , 49 , 50 , 51 , 52 , 53 , 54 ]. This approach consists of two main steps: (1) generate a set of base partitions, referred to as the ensemble, and (2) use this ensemble to generate a final consensus partition. The graph-based Leiden community detection method [ 55 ] was selected as a base algorithm to generate the clustering ensemble, because it is widely used for single-cell analysis, and is efficiently implemented to be scalable for large datasets [ 3 ].

A key element of successful consensus approaches is generating sufficient diversity in the ensemble [ 21 , 49 , 50 , 56 ]. To generate this diversity, ESCHR randomizes four hyperparameters for each base partition: subsampling percentage, number of nearest neighbors, distance metric, and Leiden resolution. Within a given base partition, ESCHR first selects a subsampling percentage by random sampling from a Gaussian distribution with μ scaled to dataset size (within 30–90%) and then extracts the specified subset of data from the full dataset. Next, ESCHR randomly selects values for the number of nearest neighbors (15–150) and the distance metric (euclidean or cosine) and uses these to build a k-nearest neighbors (kNN) graph for the extracted subset of data. Finally, ESCHR performs Leiden community detection on this kNN graph using a randomly selected value for the required resolution-determining hyperparameter (0.25–1.75). The ranges for randomization of these hyperparameters were optimized empirically (Additional File 1: Fig. S2a–f and Methods). This subsampling and randomization scheme is used to produce diversity among each of the different base partitions (Fig.  1 a). This diversity provides many different views of the dataset, and the full ensemble of these views provides a more comprehensive picture of the dataset grouping structure (Additional File 1: Fig. S3), which is less likely to be influenced by the stochastic variations present in any single view, including the full unsampled dataset. In addition to generating ensemble diversity, this hyperparameter randomization approach is what enables ESCHR to operate without the need for hyperparameter tuning at this first stage of the algorithm.

figure 1

ESCHR framework overview. a Starting from a preprocessed input dataset, ESCHR performs ensemble clustering using randomized hyperparameters to obtain a set of base partitions. This set of base partitions is represented using a bipartite graph where one type of node consists of all data points and one type of node consists of all clusters from all base partitions and edges exist between data points and each base cluster they were assigned to throughout the ensemble. b Leiden bipartite clustering is performed on the ensemble bipartite graph. Base clusters are collapsed into their assigned consensus clusters obtained through the bipartite clustering and edge weights are summed such that each data point now has a weighted edge to each consensus cluster representing the number of base clusters it had been assigned to that were then collapsed into that consensus cluster. c Soft cluster memberships are obtained by scaling edge weights between 0 and 1, and can then be visualized directly in heatmap form and used to generate hard cluster assignments, per-data point uncertainty scores, and cluster connectivity maps

After generating a diverse ensemble of base partitions, ESCHR applies a bipartite graph clustering approach to obtain the final consensus partition. First, the base partitions are assembled into a bipartite graph, where cells are represented by one set of vertices, base clusters are represented as a second set of vertices, and each cell is connected by an edge to each of the base clusters it was assigned to throughout the ensemble (Fig.  1 b). Next, ESCHR applies bipartite community detection to obtain the final consensus partition (Fig.  1 b) [ 57 ]. Bipartite community detection is applied here instead of more common consensus approaches that suffer from information loss [ 58 ]. To remain hyperparameter-free without the need for human intervention in this consensus stage of the algorithm, ESCHR performs internal hyperparameter selection to determine the optimal resolution for the final consensus clustering step by selecting the medoid from a range of resolutions (Additional File 1: Fig. S4). After obtaining the final consensus partition, ESCHR converts the ensemble bipartite graph to a final weighted bipartite graph by collapsing all base partition cluster nodes assigned to the same consensus cluster into a single node. Cells are then connected to these consensus cluster nodes by edges with weights representing the number of times each cell was assigned to any of the base partition clusters that were collapsed into a given consensus cluster (Fig.  1 b). These raw membership values are then normalized to obtain proportional soft cluster memberships, and hard cluster labels are assigned as the consensus cluster in which a cell has the highest proportional membership (Fig.  1 c).

While many analysis strategies for single-cell datasets require hard clustering labels, these by definition cannot convey whether a cell is at the borderline between multiple clusters or located firmly in the center of a single cluster. Hard clusters also do not provide any insight into potential continuity between clusters. Using the soft cluster memberships derived from the weighted consensus bipartite graph, ESCHR provides several additional outputs beyond hard cluster assignments that enable a more comprehensive characterization of the grouping structures within a dataset. Firstly, soft cluster memberships can be directly visualized in heatmap form to identify areas of cluster overlap at the single-cell level (Fig.  1 c). Importantly, these soft membership heatmap visualizations can serve as complements or even alternatives to the widely used but also widely misinterpreted [ 59 ] stochastic embedding methods (i.e. UMAP [ 60 ], t-SNE [ 61 ]) for visualizing the complex relational structures within single-cell datasets. ESCHR also produces an Uncertainty Score for every object, derived from its soft cluster membership, which quantifies regions of higher and lower certainty in hard cluster assignment (Fig.  1 c). Finally, ESCHR produces a cluster-level map of the continuity structure within a dataset by using the soft cluster memberships to calculate a corrected-for-chance measure of the connectivity between each pair of hard clusters (Fig.  1 c and Methods).

ESCHR soft clustering and uncertainty scores capture diverse structural characteristics and quantify uncertainty in cluster assignments

We first sought to examine how ESCHR uncertainty scores and soft clustering could enable effective and informative analysis for datasets containing complex combinations of continuity and discreteness, and how these results compared to a wide range of alternative clustering methods used for single-cell analysis or general purpose clustering (Additional File 2: Table S1 and Methods). For this analysis, we generated a synthetic scRNA-seq dataset containing 1000 cells and 1000 features using the DynToy package [ 62 ]. This dataset is generated by sampling “cells” from a complex trajectory model, with library size and transcript distributions per cell modeled on a real scRNA-seq dataset. Specifically, “cells” are sampled from prototypical “cell states”, where each cell has a varying probability of belonging to multiple neighboring states, and the ground truth hard cluster labels are assigned as the state in which the cell has the highest percent membership. This process generates a dataset which is similar to real single-cell data but provides known ground truth grouping structure and known ground truth continuity structure (Fig.  2 a–b, Additional File 1: Fig. S7a), which is not generally available for real datasets (Additional File 1: Supplementary Note 1).

figure 2

Visualization of ESCHR clustering and uncertainty scores compared to other clustering methods. UMAP visualizations of a ground truth cluster labels, b ground truth cell state membership, c ESCHR hard clusters, and d ESCHR uncertainty scores. e Heatmap visualization of ESCHR soft cluster memberships. f UMAP visualizations of cluster assignments from selected comparison methods. Points are colored by cluster ID. g Box and whisker plot comparing uncertainty scores of data points from ESCHR hard clustering that were accurately assigned versus not accurately assigned. The box shows the quartiles of the dataset, whiskers extend to 1.5*IQR, plotted points are outliers. Two-sided Mann–Whitney U test was used for statistical analysis. N  = 126,545, 750,955 for inaccurate and accurate groups respectively. h Comparison of ESCHR uncertainty scores versus method agreement per each individual data point. Primary box and whisker plot x -axis is binned ESCHR uncertainty scores and y -axis is the average method agreement across all pairs of methods; inset scatterplot shows raw data (i.e., not binned) with a red line of best fit and Pearson correlation statistic

We first compared the ESCHR hard clustering results (Fig.  2 c, Additional File 1: Fig. S7b) and uncertainty scores (Fig.  2 d) with the true hard cluster labels and the true membership percentage for those labels. While ESCHR successfully captures all of the ground truth cell states, it also adds two additional clusters (ESCHR clusters 9 and 6) between true clusters M2 and M3 and between M1 and M7. However, the ground truth membership percentages show that these regions are highly transitional, with low percentages for the maximum membership (Fig.  2 b). ESCHR uncertainty scores correspond closely to this observed ground truth continuity in Fig.  2 b, indicating that the uncertainty scores can identify regions of uncertainty in cluster assignment due to ground truth continuity and cluster overlap. In addition to quantifying this level of uncertainty per “cell,” ESCHR also provides information at the cluster level about which clusters overlap, and to what extent, through direct visualization of the soft cluster memberships. This reveals an overlap structure that corresponds to the ground truth patterns of transitional membership between groups, such as between ESCHR clusters 7, 9, and 1 (corresponding to true labels M2 and M3) and ESCHR clusters 1, 8, and 2 (corresponding to true labels M3, M6, and M5) (Fig.  2 e).

We next evaluated the results from multiple different clustering methods and found that there was wide disagreement between the results of these different methods (Fig.  2 f, Additional File 1: Fig. S7c). Seurat, Scanpy, and Phenograph, which are all based on either Leiden or Louvain as their base clustering method, all identify approximately the same clusters as ESCHR, but importantly each of these methods has selected different boundaries between these clusters. While the results from the remaining methods exhibit more diversity, it is notable that none have placed cluster boundaries within the regions of ground truth high single state membership but rather have over-clustered transitional regions or under-clustered by grouping multiple true clusters together. The regions of disagreement between the different clustering methods highlight areas that are challenging for and perhaps not well suited to the discreteness assumptions of traditional hard clustering. High ESCHR uncertainty scores and overlapping soft cluster memberships correspond to regions of disagreement between other clustering methods, providing further evidence that these metrics can help identify regions that are challenging for traditional clustering methods due to continuous data structures such as overlap between ground truth clusters.

To assess whether ESCHR uncertainty scores were similarly informative across diverse datasets, we generated an additional 4 simulated datasets using DynToy and 16 additional structurally diverse synthetic datasets which consist of randomly generated Gaussian distributions varying in number of objects (5000 or 10,000), number of features (20, 40, 50, 60), number of clusters (3, 8, 15, 20), cluster sizes, cluster standard deviations, cluster overlap, and feature anisotropy (Additional File 1: Figs. S5–S6, Additional File 2: Tables S2–S3). To quantitatively evaluate the utility of ESCHR uncertainty scores across our full set of 21 structurally diverse synthetic datasets with ground truth cluster labels, we first compared ESCHR uncertainty scores to the accuracy of assignment compared to ground truth labels per data point across all datasets, and found that ESCHR uncertainty scores were significantly higher in inaccurately assigned cells (Fig.  2 g). We then quantified the level of agreement between clustering assignments from all the different clustering algorithms we tested (in Fig.  2 f) and used this as an alternative external indicator for per data point uncertainty and difficulty of clustering (Methods). This analysis revealed that higher ESCHR uncertainty scores were significantly negatively correlated with method agreement (Fig.  2 h). Taken together, these comparisons demonstrate that ESCHR uncertainty scores identify meaningful uncertainty and that when used in combination with the soft clustering results, they enable more in-depth interpretation of dataset structure than other methods which produce only hard cluster assignments. Furthermore, ESCHR is able to provide these high-quality insights for datasets with diverse structural characteristics without the need for human intervention such as hyperparameter tuning.

ESCHR outperforms other methods across measures of accuracy and robustness

To systematically evaluate the performance of ESCHR vs. other clustering methods on real datasets as well as synthetic ones, we performed systematic benchmarking of ESCHR against other clustering algorithms (Additional File 2: Table S1) using a collection of 45 published real datasets in addition to the 21 synthetic datasets described above. This collection of 45 published datasets vary widely in size (300–2,000,000 cells), source tissue (e.g. blood, bone marrow, brain), measurement type (sc/nRNA-seq, mass cytometry, flow cytometry, non-single-cell datasets), and data structure (varying degrees of discreteness and continuity) (Additional File 2: Table S4). For our evaluation criteria, we selected two extrinsic evaluation metrics, Adjusted Rand Index (ARI) and Adjusted Mutual Information (AMI), to assess two aspects of the clustering results: (1) accuracy and (2) robustness. Extrinsic evaluation metrics measure the distance of a clustering result to some external set of labels, and our two selected metrics ARI and AMI represent different approaches to this problem, with divergent biases. ARI tends to yield higher scores in cases of similarly sized clusters and similar numbers of clusters within and between the partitions being compared, while AMI is biased towards purity and yields higher scores when there are shared pure clusters between the two partitions (Methods) [ 63 ]. Using ARI and AMI together should therefore provide a more complete comparison of clustering performance [ 12 , 13 ].

When we applied these extrinsic metrics ARI and AMI to assess clustering accuracy for our collection of synthetic datasets, ESCHR outperformed all other clustering algorithms across both metrics, and this superior performance was statistically significant for all cases (Fig.  3 a, Additional File 1: Table S5, Additional File 1: Fig. S8b–c). We also applied ARI and AMI to benchmark clustering accuracy in non-synthetic real datasets, although it is important to note that a priori known class labels do not generally exist for real-world single-cell datasets, and the various proxies accepted as ground truth labels should be interpreted with skepticism (discussed further in Additional File 1: Supplementary Note 1). Keeping these caveats in mind, ESCHR still clustered real datasets more accurately by ARI and AMI than all methods, significantly so in all comparisons except for scCAN and Agglomerative clustering by ARI and only scCAN by AMI (Fig.  3 b, Additional File 1: Table S6, Additional File 1: Fig. S8b–c). Many of the ground truth labels that are widely accepted for real single-cell datasets are based on a hierarchical framework of clustering or manual labeling, which could explain why agglomerative clustering performs better relative to the other methods for this particular comparison.

figure 3

Systematic analysis of ESCHR clustering performance compared to competing methods on synthetic and real datasets. a – d Box and whisker plots comparing accuracy ( a and b ) and robustness ( c and d ) of results from ESCHR and all comparison methods across all synthetic ( a and c ) and real ( b and d ) benchmark datasets as measured by ARI (left) and AMI (right). Boxes show the quartiles of the dataset, and whiskers extend to 1.5*IQR. Data points used in the creation of box and whisker plots and shown in overlaid scatterplots are the means across 5 replicates for each dataset. Two-sided Wilcoxon signed-rank test with Bonferroni correction was used for statistical analysis comparing ESCHR to each method. N  = 21 for comparisons using synthetic datasets and N  = 45 for comparisons using real datasets. e Mean rank across all metrics shown in box-and-whisker plots for different cluster numbers of the synthetic datasets. Error bars show 1 standard deviation. f Mean rank across all metrics shown in box-and-whisker plots for different modalities of the real datasets. Points represent means across all replicates of all datasets in a given category and error bars show 1 standard deviation. g Mean rank across all metrics shown in box-and-whisker plots for different sample number bins for all real and synthetic datasets. Points represent means across all replicates of all datasets in a given category and error bars show 1 standard deviation. h Box plots of difference from the true cluster number for each synthetic dataset for each method. Values below zero reflect calculated cluster numbers being lower than true cluster numbers and higher than zero indicates more clusters than the true cluster number. SINCERA is shown separately due to the scale of values being 2 orders of magnitude different from all other methods. i Scalability comparison between ESCHR and other methods on synthetic datasets with an increasing number of data points. X -axis is log scaled but labels show the unscaled values for easier interpretation. Each dot represents 5 replicates and error bars show 1 standard deviation

After benchmarking for accuracy, we next used ARI and AMI to evaluate clustering robustness, by comparing results from repeated runs with different random subsamples of a given dataset (Methods). Due to its ensemble and consensus clustering approach, we expected ESCHR to perform well in these tests of robustness, and indeed it demonstrated superior performance to all other clustering algorithms on both synthetic and real data across both ARI and AMI metrics (Fig.  3 d–e, Additional File 1: Fig. S8d–e). These results were significant for all comparisons except against scCAN on the real datasets by ARI (Additional File 1: Tables S5–S6). To gain insight into the generalizability of ESCHR versus the other methods for specific dataset types, we calculated the mean rank of each clustering algorithm across all metrics for major subcategories of our collection of datasets: cluster number for synthetic datasets (for which we have reliable ground truth cluster numbers), data modality for real datasets, and sample number across all datasets. Different clustering algorithms perform better or worse for different subsets, but ESCHR is consistently ranked first or tied for first across these subcategories of both synthetic (Fig.  3 e, g) and real datasets (Fig.  3 f, g), indicating that its performance is more generalizable to diverse datasets than the other tested clustering algorithms.

We next evaluated the scalability of each method over a range of dataset sizes. While ESCHR generally takes the longest, this does not present a practical limitation for typical usage, as it is able to successfully complete analyses on millions of data points and the runtime scales linearly (Fig.  3 g). This analysis also revealed that several of the alternative clustering algorithms we tested could not successfully run to completion for larger datasets. The dataset size limit for ESCHR is effectively the size limit of its underlying base clustering method, the Leiden algorithm implemented in Python [ 55 ]. While it is true that our method does have longer runtimes than some of the commonly used methods we compare to here, we believe it is worth the wait due to the demonstrated superior accuracy and robustness of our results, and perhaps even more importantly due to the additional insights afforded by the uncertainty scores and soft cluster membership information highlighted in Fig.  2 . Additionally, the manual guess-and-check hyperparameter tuning that is required to achieve desired results with other methods can be very time-consuming (not to mention highly subjective), and so it is possible that in practical usage ESCHR could potentially end up providing useful results more quickly than other methods. When taken together, these quantitative evaluations demonstrate that ESCHR performs favorably compared to the other methods tested here and achieves our desired goals of providing accurate and robust results, being generalizable to a broad range of diverse datasets, and being scalable to large datasets.

ESCHR soft clustering and uncertainty scores provide increased interpretability in exploratory data analysis of the MNIST dataset

To illustrate how ESCHR can identify regions of continuity and provide insight into cluster overlap and dataset structure, we selected the MNIST dataset for further analysis. This dataset, consisting of 70,000 handwritten digits with ground truth labels, is often used for machine learning demonstrations because the images can be visualized for intuitive interpretation [ 42 ]. Other clustering algorithms set to default hyperparameters do not recapitulate the ground truth labels with high accuracy (Additional File 1: Fig. S9a), explained in part by the real variation that exists within the ground truth sets. For example, there are two common variations of the handwritten digit 1, and most of the clustering algorithms capture this difference. Of all the clustering algorithms tested, ESCHR clusters the MNIST dataset with the highest robustness and accuracy (Additional File 1: Fig. S9b), but it consistently splits the 1 and 9 digits into separate subsets (Fig.  4 a–b), and in some cases, it splits the digit 4 as well (Additional File 1: Fig. S9a). ESCHR usually produces highly consistent results from run to run thanks to its consensus clustering step, but this inconsistency around the digits 4 and 9 is suggestive of a high degree of continuity within and between these two classes (Additional File 1: Fig. S9c), which is highlighted by elevated ESCHR uncertainty scores in this region (Fig.  4 c). The soft cluster membership heatmap also draws attention to the visual similarities between digits 3, 5, and 8, as well as the two types of handwritten 1 digits (Fig.  4 d). These subset-level differences and connections between related digits motivated further investigation of the ESCHR outputs for the MNIST dataset.

figure 4

ESCHR-guided exploration of the benchmarking dataset MNIST. a UMAP visualization with points colored by true class labels. b UMAP visualization with points colored by ESCHR hard cluster labels. c UMAP visualization with points colored by ESCHR uncertainty score. d Heatmap visualization of ESCHR soft cluster memberships. e Nodes represent ESCHR hard clusters and are located on the centroid of the UMAP coordinates for all data points assigned to that hard cluster. Node size is scaled to the number of data points in a given cluster. Edges exist between nodes that were determined to have significant connectivity by ESCHR cluster connectivity analysis, and edge thickness is scaled to the connectivity score. f Stacked bar plot showing the soft membership of datapoints in clusters 1b and 1a, ordered by increasing ESCHR soft cluster membership (SCM) rank ordering score (middle); kernel density estimation across the ordering score (bottom); dashed lines indicate boundaries between ordering score density peaks to separate “core” and “transitional” datapoints (middle and bottom); smaller images show individual representative images and larger images show summed pixel intensities for all datapoints contained within each dashed partition (top). g Visualization of data points from ESCHR clusters 4, 9a, and 9b projected onto the first two principal components resulting from PCA performed on the soft membership matrix of these three clusters. The primary scatterplot shows points colored by their ESCHR hard cluster assignment, and the inset scatterplot shows points colored by the ESCHR uncertainty score. Images are real examples from the MNIST dataset. h Scatterplot points in the first two rows of plots show the pixel locations of the 30 features with the largest positive (first row, red) and 30 largest negative (second row, blue) Pearson correlation to each of the PCs. Example digit images are underlaid in light gray to aid interpretation. The final row contains heatmaps with each pixel colored according to its Pearson correlation with PC1 (left) or PC2 (right), with bright red indicating a large positive correlation and dark blue indicating a large negative correlation

To further investigate the continuity and overlap structure that was indicated by the uncertainty scores and soft cluster membership heatmap, cluster connectivity mapping was applied to identify significant overlap beyond what would be expected by random chance for the ESCHR clusters (Fig.  4 e) (Methods). This revealed significant overlap between clusters “3”– “5”– “8,” “1a”– “1b,” and “4”– “9a”– “9b.” To explore the nature of the continuity structure underlying the significantly overlapping clusters “1a” and “1b,” we devised a simple rank ordering scheme based on the soft membership values for the datapoints in these two clusters and then used this ordering score to examine both the continuous progression of soft membership values across the rank-ordered datapoints and their density along this ordering score (Methods). This revealed that each cluster had a high-density peak of “core” datapoints with a secondary smaller “transitional” peak (Fig.  4 f, bottom). Individual representative MNIST digit images (Fig.  4 f, top row) and summed pixel intensities (Fig.  4 f, second row) from the images within each of these regions indicate that the core “1b” images are heavily slanted whereas the core “1a” images are vertically straight, with the images from the lower density transitional peaks falling in between these extremes. The two high-density peaks consisting of images with distinctly different styles of 1 s explain why ESCHR and many of the other clustering methods tested identified two clusters corresponding to this single digit (Additional File 1: Fig. S9a), while the high degree of pixel overlap between the two styles and the presence of images with intermediate slantedness explain the high degree of continuity and significant overlap detected by ESCHR.

We next examined the more complex relationship between subsets of the digits 4 and 9. Cluster connectivity mapping indicated that there was significant overlap among all three of the ESCHR clusters “4,” “9a,” and “9b” (Fig.  4 e). Additionally in the soft membership heatmap, there appear to be some cells that are overlapping all three clusters, and some cells from clusters “9a” and “9b” that overlap separately with cluster “4” and not with each other (Fig.  4 d). Unlike the simpler relationship between ESCHR clusters “1a” and “1b,” which could be analyzed by linear one-dimensional reduction, the more complex relationship around digits 4 and 9 could not be adequately captured or described along a single dimension, so principal components analysis (PCA) was applied to the ESCHR soft cluster memberships corresponding to these three clusters in order to reduce these relationships into two dimensions (Methods). Representative images selected from throughout the resulting PC space reveal that the between-cluster continuity is indeed reflecting the existence of a continuous progression through different conformations of the two digits 4 and 9 (Fig.  4 f). Specifically, we can see that there is a continuous progression through the 9’s based on how slanted they are, with two areas of higher density at either extreme. This explains why a clustering algorithm would be likely to split this into two clusters, albeit with a high amount of uncertainty about precisely where to make the split. The images also illustrate how the more slanted closed 4’s form a continuous transition primarily with cluster 9a and the more vertically oriented closed 4’s form a continuous transition primarily with cluster 9b. This approach also allows us to identify features that are most correlated with the top two principal components. The top PC-correlated features lend further insight by identifying the specific pixels that are primarily capturing these changes in slantedness and upper loop closure (Fig.  4 g). These analyses illustrate how structures within the MNIST dataset are not ideally suited for hard clustering assignment, but also how ESCHR is able to identify these structures and provide deeper insights than could be obtained by other hard clustering methods, or even beyond what is available from the ground truth class assignments.

ESCHR captures cell types and continuity in static adult tissue

To illustrate how ESCHR can provide additional interpretability and insight for single-cell datasets, we selected an integrated scRNA-seq dataset of hypothalamic tanycytes [ 46 ] for further analysis. Tanycytes are elongated ependymoglial cells that form the ventricular layer of the third ventricle and median eminence and have historically been classified into four subtypes (α1, α2, β1, β2) based on the hypothalamic nuclei where they project to, their spatial localization along the third ventricle, and their morphological, structural, genetic, and functional properties (Fig.  5 a) [ 64 ]. More recent studies have suggested that many of these properties may exhibit substantial continuity between and within each of these subtypes [ 43 , 44 , 45 , 46 , 47 , 65 ]. However, individual tanycyte scRNA-seq studies and an integrated analysis of these datasets all reported discrete groupings of tanycytes defined by hard clustering approaches [ 44 , 46 , 66 , 67 , 68 ], with no insight into the robustness of these assignments and whether there is overlap or continuity between them.

figure 5

ESCHR identifies continuity between and within canonical cell subtypes in static adult tissue. a Schematic illustration of canonical tanycyte subtypes in their anatomical context surrounding the third ventricle. b UMAP visualization with points colored by ESCHR hard cluster labels. c UMAP visualization with points colored by ESCHR uncertainty score. d Heatmap dotplot showing expression of marker genes for the canonical tanycyte subtypes across the ESCHR hard clusters. e Heatmap visualization of ESCHR soft cluster memberships. f Nodes represent ESCHR hard clusters and are located on the centroid of the UMAP coordinates for all data points assigned to that hard cluster. Node size is scaled to the number of data points in a given cluster. Edges exist between nodes that were determined to have significant connectivity by ESCHR cluster connectivity analysis, and edge thickness is scaled to the connectivity score. Node colors map to their ESCHR hard cluster colors from panel b (left) and to the color from panel a of the canonical subtype to which they primarily belong (right). g UMAP visualization with the subset of points that were included in the ordering analysis colored by ESCHR soft cluster membership (SCM) rank ordering score, and all others colored gray. h – j UMAP visualizations where points included in the ordering analysis are colored by their expression level and all others are colored gray. k – m Scatterplots showing normalized mRNA abundance on the y -axis and SCM rank order on the x -axis. Expression is bounded between the 2nd and 98th percentiles. Lines show Gaussian-smoothed B-splines fit to the data. n Schematic illustration of the anatomical region being shown in o – q . o – q In situ hybridization (ISH) of coronal brain sections, using probes specific for Igfbp5, Tgfb2, and Crym (Allen Mouse Brain Atlas). Red arrowheads indicate the areas of expression in the region of interest

Initial ESCHR analysis produced hard clustering outputs that match canonical tanycyte subtypes by their RNA expression profiles (Fig.  5 b–d) [ 46 ]. Subtypes β1 (expressing Fizb, Penk, Rlbp1, and Ptn) and α2 (expressing Vcan, Nr2e1, Fabp5, and Slc7a11) are represented by multiple hard clusters, while the subtypes β2 (expressing Scn7a, Cal25a1, Meat, and Lrrtm3) and α1 (expressing Mafb, Necab2, Agt, Slc17a8, and Lyz2) each correspond to a single hard cluster, indicating that there is more transcriptional diversity within the β1 and α2 populations. On top of this, however, ESCHR uncertainty scores identify substantial heterogeneity within each hard cluster, including the β2 and α1 clusters (Fig.  5 c), and the soft cluster memberships reveal additional levels of overlap and continuity between these canonical tanycyte subtypes (Fig.  5 e). ESCHR cluster connectivity mapping (Methods) revealed significant overlap between the β1 clusters (2, 3, and 5) and each of the other three canonical subtypes (Fig.  5 f). This result was somewhat unexpected, because transcriptional continuity was previously thought to exist only between spatially neighboring tanycyte subtypes [ 45 , 65 ]. A more recent study provided evidence that β1 tanycytes exhibit some transcriptional continuity with both α1 and α2 tanycytes, but also indicated that β2 tanycytes were non-overlapping and transcriptionally distinct [ 43 ]. Our analysis with ESCHR soft clustering memberships and cluster connectivity provide additional corroboratory evidence for the transcriptional continuity between β1 and α1/α2 tanycytes, but also reveal a previously uncharacterized relationship of transcriptional continuity between β1 and β2 tanycytes.

To further investigate this previously uncharacterized transcriptional overlap between β1 and β2 tanycytes, specifically between ESCHR clusters 1 and 2, we selected the subset of cells comprising the transitional zone between clusters and rank-ordered these based on whether their soft cluster membership was closer to β1 (ESCHR cluster 1) or β2 (ESCHR cluster 2) (Fig.  5 g and Methods). Using this rank ordering scheme, we identified genes with expression patterns that correlate with progression through the transition zone from β2 to β1 tanycytes, either decreasing across the transition like Igfbp5 (Fig.  5 h, k), peaking during the transition like Tgfb2 (Fig.  5 i, l), or increasing across the transition like Crym (Fig.  5 j, m). We next sought to determine whether these gene expression patterns in the transitional zone between ESCHR clusters were also observed in the spatial distribution of β2 and β1 tanycytes along the median eminence and third ventricle where these subtypes are thought to reside (Fig.  5 n). To investigate this possibility, we examined the in situ hybridization (ISH) database from the Allen Mouse Brain Atlas (ABA; http://mouse.brain-map.org ) [ 69 ] and observed that the overlapping expression for these three genes did in fact manifest as progressive spatial overlap spanning the anatomical regions canonically associated with β2 and β1 populations (Fig.  5 o–q). Altogether, this analysis of tanycyte subtypes demonstrates the utility of ESCHR for (1) identifying robust and biologically meaningful hard cluster assignments, (2) providing insight into the overlap and continuity between cell type clusters, and (3) providing a springboard for further analysis of expression level transitions via soft cluster membership ordering.

Discussion and conclusions

Clustering is a fundamental tool for single-cell analysis, used to identify groupings of cell types or cell states that serve as the basis for direct comparisons between biological samples or between specific cell types within a biological sample, as well as numerous further downstream applications. However, it has proven challenging to generate appropriate and consistent cell groupings when using previously available clustering methods on single-cell datasets, due to (1) continuity and overlap between cell types, (2) randomness and stochasticity built into the clustering algorithms, and (3) non-generalizable hyperparameter settings that were optimized for a specific dataset or data type. To overcome these limitations we developed ESCHR, a user-friendly method for ensemble clustering that captures both discrete and continuous structures within a dataset and transparently communicates the level of uncertainty in cluster assignment. Using a large collection of datasets representing a variety of measurement techniques, tissues of origin, species of origin, and dataset sizes, we benchmarked ESCHR’s performance against several other clustering algorithms, demonstrating that ESCHR consistently provides the highest robustness and accuracy for clustering across all categories of this diverse dataset collection.

One of the key design features of ESCHR is our approach using hyperparameter randomization during the ensemble generation step. While this was a deliberate design choice to generate diversity among the base clusterings to enhance the robustness and generalizability of clustering, an additional benefit is that it removes the need to manually test and select an optimized set of hyperparameters for each dataset. This design also affords several avenues for potential future improvements to the ESCHR algorithm, such as expanding the number of hyperparameters randomized in order to generate an even more diverse clustering ensemble. For example, we currently use k-nearest neighbor (kNN) graphs for the base Leiden clustering steps, but mutual nearest neighbor (mNN) or shared nearest neighbor (sNN) have shown good performance in other frameworks [ 3 , 41 , 70 ], and may improve ESCHR performance if incorporated as an additional hyperparameter to vary. ESCHR may also benefit from expanding the set of distance metrics utilized. We currently restrict our analysis to euclidean and cosine distances due to their efficient implementations within our chosen fast approximate nearest neighbor (ANN) package [ 71 ]. However, recent research has demonstrated the efficacy of a broader range of distance metrics for capturing diverse data structural properties [ 72 ]. While not all of these metrics may be applicable in an ANN context, several may hold the potential for enhancing the quality of our clustering outcomes. Additionally, the current version of ESCHR uses only Leiden community detection for clustering in the ensemble stage, but additional base clustering methods could be explored and potentially incorporated in future versions. Finally, our empirical identification of optimal ranges for ESCHR’s numeric hyperparameters was somewhat limited by the time and memory required for running these experiments with many, sometimes large, datasets and very wide search spaces. It is therefore possible that there may be more optimal default ranges or more sophisticated regimes for hyperparameter randomization and selection that could improve ESCHR’s performance.

Another key design feature of ESCHR is our soft clustering approach for generating the final consensus results. Single-cell data is inherently complex and heterogeneous, and clustering methods often make assumptions about the structure of the data that may not hold in practice. For example, hard clustering methods assume discrete groups of single cells, which rarely exist in biological data [ 15 ]. Many clustering algorithms make further assumptions about the shapes and other properties of these discrete groups. In the opposite direction, toward continuity rather than discreteness, numerous methods have been developed for trajectory inference in single-cell datasets [ 4 ], but these methods also make assumptions about dataset structure, for example, many force a branched tree structure. ESCHR’s soft cluster outputs enable unified mapping of both discrete and continuous grouping structures, without the need for assumptions about the shape and properties of the dataset. To illustrate this concept, we used ESCHR to identify tanycyte subtypes and reveal the transitional continuity between them (Fig.  5 a–q, Additional File 1: Fig. S10), which is notable because assumptions about lineage relationships or dynamic developmental processes in this static adult tissue would be inappropriate and could lead to inaccuracies and distortion. Instead, ESCHR can identify and characterize discrete and continuous patterns simultaneously, even in the same dataset, without relying on assumptions about data shape and properties.

One of ESCHR’s most useful outputs is the per-cell uncertainty score, which enables users to estimate clustering uncertainty and interpret hard clustering results more effectively. The Impossibility Theorem for clustering states that it is impossible for any clustering method to satisfy the three proposed axioms of good clustering, and therefore all clustering algorithms must make trade-offs among the desirable features, and no clustering result can be perfect [ 17 ]. Because of this, it is critical to evaluate the guaranteed uncertainty in a clustering result before using it for direct comparisons, downstream analyses, or hypothesis generation. ESCHR uncertainty scores, which are derived from the degree of cluster overlap for each datapoint as indicated by their soft cluster assignments, provide a useful proxy for this uncertainty and difficulty in cluster assignment. These scores can be visualized alongside hard cluster assignments to facilitate a more discerning interpretation of clustering results. We have validated the utility of these uncertainty scores by demonstrating that (1) they identify areas of ground truth continuity due to cells transitioning between cell states in simulated scRNA-seq data (Fig.  2 b, d), (2) they are significantly higher for inaccurately assigned data points (Fig.  2 g), and (3) they are significantly negatively correlated with the level of agreement between clustering algorithms (Fig.  2 h). Altogether, these findings demonstrate that ESCHR uncertainty scores provide meaningful insights into clustering uncertainty.

To make the advantages of ESCHR clustering easily accessible to the research community, we have made ESCHR available as a Python module on GitHub ( https://github.com/zunderlab/eschr ), packaged as an extensible software framework that is compatible with the scverse suite of single-cell analysis tools [ 33 ]. We have provided tutorials for how to incorporate it into existing single-cell analysis workflows as well as for how to use it as a standalone analysis framework. In conclusion, our results demonstrate that ESCHR is a useful method for single-cell analysis, offering robust and reproducible clustering results with the added benefits of per-cell uncertainty scores and soft clustering outputs for improved interpretability. By emphasizing ease of adoption, clustering robustness and accuracy, generalizability across a wide variety of datasets, and improved interpretability through soft clustering outputs and the quantification of uncertainty, we aim to support the responsible and informed use of clustering results in the single-cell research community.

ESCHR Framework

ESCHR takes as input a matrix, M , with n instances (e.g., cells) as rows and d features (e.g., genes/proteins) as columns. It does not perform internal normalization or correction, so input data are expected to have already been preprocessed appropriately. ESCHR can be thought of in three primary steps: base clustering to generate the ensemble, consensus determination, and output/visualization.

Consistent with other published manuscripts in this domain, we will use the following notation. Let \(X=\left\{{x}_{1},{x}_{2},...,{x}_{n}\right\}\) denote a set of objects to be clustered, where each \({x}_{i}\) is a tuple of some \(d\) -dimensional feature space for all \(i=1...n\) . Let \({X}_{s}=\left\{{x}_{1},{x}_{2},...,{x}_{r}\right\}\) denote a random subset of \(X\) where all of \({x}_{1},...,{x}_{r}\) are between 1 and \(n\) . \({\mathbb{P}}=\left\{{P}_{1},{P}_{2},...,{P}_{m}\right\}\) is a set of partitions, where each \({P}_{i}=\left\{{C}_{1}^{i},{C}_{2}^{i},...,{C}_{{q}_{i}}^{i}\right\}\) is a partition of an independent instantiation of \({X}_{s}\) and contains \({q}_{i}\) clusters. \({C}_{j}^{i}\) is the \(j\) th cluster of the \(i\) th partition, for all \(i=1...m\) . \(t={\sum }_{i=1}^{m}{q}_{i}\) is the total number of clusters from all ensemble members. Where \({\mathbb{P}}_{X}\) is the set of all possible partitions with the set of objects \(X\) and \({{\mathbb{P}}\subset {\mathbb{P}}}_{X}\) , the goal of clustering ensemble methods is to find a consensus partition \({P}^{*}\epsilon {\mathbb{P}}_{X}\) which best represents the properties of each partition in \({\mathbb{P}}\) . Additionally, the more general terminology of “instance” and “feature” will generally be used rather than domain-specific terms such as cells and genes/proteins.

Hyperparameter-randomized ensemble clustering

The ESCHR ensemble is generated with Leiden community detection as the base clustering algorithm [ 55 ]. Leiden is applied using Reichardt and Bornholdt’s Potts model with a configuration null model [ 73 ]. Diversity is generated among ensemble members through a combination of data subsampling and Leiden hyperparameter randomization. The subsampling percentage varies for each ensemble member and is selected from a Gaussian distribution with the mean μ scaled to dataset size within the range 30 to 90. After subsampling a random subset \({X}_{s}\) from \(X\) , principal components analysis (PCA) is applied to generate the most informative features for this data subspace. A default value of 30 or one less than the number of features if the number of features is less than 30 is used for the number of PCs. In the subsequent clustering step, three numerical hyperparameters are randomized for each ensemble member: (1) \(k\) , the number of neighbors for building a k-nearest neighbors (kNN) graph; (2) the choice of distance metric for building the kNN graph; and (3) \(r\) , a resolution parameter for the modularity optimization function used in Leiden community detection. The numerical hyperparameters \(k\) and \(r\) are randomly selected from within empirically established ranges (Additional File 1: Fig. S2). The distance metric is selected between either euclidean or cosine, because these choices are efficiently implemented for fast calculation of approximate nearest neighbors (ANN) in our chosen implementation, nmslib [ 71 ]. Since each ensemble member is independent, we implemented parallelization via multiprocessing for this stage of the algorithm. Ensemble size is set at a default of 150 based on experiments demonstrating that this was sufficient to reach convergence to a stable solution (Additional File 1: Fig. S2).

Bipartite graph clustering and consensus determination

Bipartite graph clustering was used to obtain consensus clusters from the ESCHR ensemble. This approach was selected because methods that compute consensus using unipartite projection graphs of either instance or cluster pairwise relations suffer from information loss [ 58 ]. For these calculations, the biadjacency matrix is defined as: \(B=\left[\begin{array}{cc}0& {A}^{T}\\ A& 0\end{array}\right]\) where \(A\) is an \(n\times t\) connectivity matrix whose rows correspond to instances {1... n} and columns correspond to the ensemble clusters {1... t}. \({A}_{i,j}\) is an indicator that takes value 1 if instance \(i\) belongs to the \(j\) th cluster and 0 otherwise. Using this, we then create a bipartite graph \(G=(V,W)\) . The weights matrix \(W = B,\) and \(V={V}_{1}\cup {V}_{2}\) , where \({V}_{1}\) contains \(n\) vertices each representing an instance of the data set \(X\) ; \({V}_{2}\) contains \(t\) vertices each representing a cluster of the ensemble (see Fig.  1 a “Ensemble bipartite graph”). Given our bipartite graph G , we can define a community structure on G as a partition \({P}_{1}=\left\{{\text{C}}_{1},{C}_{2},...,{C}_{{k}_{1}}\right\}\) containing pairwise disjoint subsets of \({V}_{1}\) and \({P}_{2}=\left\{{\text{D}}_{1},{D}_{2},...,{D}_{{k}_{2}}\right\}\) containing pairwise disjoint subsets of \({V}_{2}\) , such that all \({V}_{1}\) nodes in a specific \({C}_{i}\) are more connected to a particular subset of \({V}_{2}\) than the rest of the nodes in \({V}_{1}\) are, and likewise (but opposite) for a given \({D}_{j}\) of \({V}_{2}\) . Optimal \({P}_{1}\) and \({P}_{2}\) are computed with the Leiden algorithm for bipartite community detection with the Constant Potts Model quality function [ 57 , 74 ]. This approach was designed to overcome the resolution limit of previous bipartite community detection approaches [ 75 , 76 ]. There is one hyperparameter for this approach, the resolution \(\gamma\) , which indirectly influences the number of clusters for \({P}_{1}\) and \({P}_{2}\) by modulating the density of connections within and between communities [ 74 ]. To avoid the need for external hyperparameter tuning, we implemented an internal hyperparameter selection strategy at this stage. First, ESCHR generates a set of potential consensus labelings across an internally-specified range of \(\gamma\) values. Since ARI can be used as a similarity measure between two different clustering results, ESCHR then calculates the pairwise ARI between each of the final consensus labelings generated using each different \(\gamma\) value. Finally, ESCHR selects the result that has the highest sum of similarity to all other results from the set of potential consensus labelings (the medoid) to return as the final consensus result. In experiments to validate this approach, we found that the number of and the memberships in the final consensus hard clusters is robust to the setting of this resolution parameter, indicating that more extensive optimization is not required (Additional File 1: Fig. S4d–e). To obtain the final consensus result, we collapse the base ensemble clusters contained in \({V}_{2}\) into the \({P}_{2}\) meta-cluster to which they were assigned. This results in each vertex of \({V}_{1}\) having a weighted edge to each meta-cluster equal to the sum of its edges with constituent base clusters of \({V}_{2}\) . The resulting weighted bipartite graph \({G}^{*}\) therefore represents the final consensus clustering \({P}^{*}\) , with \(n\) vertices representing the instances, \({q}^{*}\) vertices representing the final consensus clusters, and weighted edges representing the membership of instance \(i\) in each of the \({q}^{*}\) clusters of \({P}^{*}\) .

Hard and soft clustering outputs

Let \({\Theta \epsilon {\mathbb{R}}n\times q}^{*}\) be a nonnegative matrix where each row, \({\Theta }_{i} := ({\Theta }_{i1},...,{\Theta }_{i{k}_{2}})\) , contains nonnegative numbers that sum to less than or equal to one, representing the membership of instance \(i\) in each of the \({q}^{*}\) clusters of \({P}^{*}\) . \({\Theta }_{ij}\) is calculated by dividing the weight of the edge between instance \(i\) and consensus cluster \({D}_{j}\) by the sum of all edge weights for instance \(i\) . We refer to this matrix as the soft membership matrix and to each row as the association vector \(v\) for each instance. To determine hard clustering assignments, each instance is assigned to the meta-cluster with the highest entry in its association vector \(v\) , with ties broken randomly. A “core” cell of a given cluster \(j\) will have \({\Theta }_{ij}=1\) and zeros elsewhere, while a “transitional” instance may have up to \({q}^{*}\) non-zero membership values. To describe the degree to which a given instance is “core” versus” transitional,” we define an “uncertainty score,” \(\Omega\) , for each instance as the highest membership value in its association vector ( \(\Omega = \text{max}(v)\) ). We can additionally calculate the mean of all instance memberships in a given cluster to yield a measure of each cluster’s discreteness, which we call the “cluster stability score” \(s =\frac{1}{n}{\Sigma }_{i\epsilon n}{\theta }_{i,j}\) .

Cluster connectivity mapping

To map the connectivity structure of clusters, we first calculate the sum-of-squares-and-cross-products matrix (SSCP) of the soft membership matrix \(\Theta\) , which is calculated as \(S=\Theta {\prime}\Theta\) and then consider \({S}_{i,j}\) to be an uncorrected measure for connectivity between consensus clusters \(i\) and \(j\) . To correct for connectivity that may result from random chance, we first estimate a null distribution of connectivity scores accounting for the following attributes of \(\Theta\) : (1) the association vector \(v\) for a given instance are proportions and can sum to no more than 1 (with cells summing to less than one being potentially outliers) and (2) the distribution of values is not uniformly distributed and will be differently skewed for different datasets depending on overall levels of continuity or discreteness. In practice, we achieve this by independently shuffling the association vector, \(v\) , for each instance to generate a random sample. We then calculate the SSCP for 500 iterations of this randomization procedure. Using this empirical null distribution, we then calculate a p-value for each observed edge and prune edges that do not meet a default alpha value cutoff of 0.05. Thus the final corrected connectivity is defined as the ratio of the cross-product of instance memberships between a given two clusters normalized to the cross-product of instance memberships expected under constrained randomization.

Exploring soft cluster continuity for the MNIST dataset

To visualize the transition between clusters 1a and 1b (Fig.  4 f) that was identified by connectivity mapping of ESCHR clustering results, we devised a simple approach for creating a one-dimensional ordering of the instances in a transitional zone based on their membership in the connected clusters of interest. Specifically, the ordering score, \({\delta }_{i}\) , of cell \(i\) having \({m}_{j}\) membership in the cluster at position \(j\) along the cluster path of interest was calculated as: \({\delta }_{i}={\Sigma }_{j=1}^{N}{m}_{j}\cdot j\) , where \(N\) is the number of clusters in the path. To obtain the relevant cells of interest for ordering, we used the following criteria: cells were included if they had (1) > 90% membership in either one of the 2 clusters of interest or (2) > 5% membership in both clusters and a combined membership of > 80% in the 2 clusters. We then visualized the progression of cluster memberships using a stacked bar plot of rank-ordered data points. The ordered data points were then partitioned into “core” datapoints and “transitional” datapoints for each cluster based on bimodality observed in the ordering scores for the cells assigned to each hard cluster.

While a linear ordering approach could in principle be used to create an ordering across a path of more than two connected clusters, it would likely only be effective in cases where connectivity mapping identifies a linear path of successively connected clusters. In cases such as the example in Fig.  4 where connectivity mapping identified a ring of 3 connected clusters (9a, 9b, 4), this approach will generally not work as well since a group of more than two clusters with nonlinear connectivity may exhibit more complex continuity structures than could be captured with a simple linear ordering. We therefore devised another method for distilling the core continuity structure for cases of greater than two clusters and nonlinear connectivity paths. We first performed principal components analysis (PCA) on the columns of the soft membership matrix \(\Theta\) that correspond to the hard clusters selected for analysis, thereby capturing the primary axes of variation contained within these soft memberships. We then projected the data onto the first two PCs and used this to gain insight into the continuity structure by (1) visualizing the data points belonging to the relevant hard clusters projected into the space of these first two PCs and (2) identifying and exploring the features most highly correlated with these PCs.

Exploring soft cluster continuity for the Tany-Seq dataset

To visualize transitional zones between connected clusters in the Tany-Seq dataset [ 46 ], we applied the same one-dimensional linear ordering approach that was used to examine clusters 1a and 1b of the MNIST dataset in Fig.  4 f. To identify marker features associated with the one-dimensional soft cluster transition paths, we calculated the Pearson correlation between each feature and the vector of cluster memberships for each cluster in the path. Features were then selected based on their correlation with each of the clusters individually and based on the sum of their correlations across the clusters. The three genes in Fig.  5 h–j were selected from the top ten features identified through each of these methods based on their expression patterns and the availability of in situ hybridization images of sufficient quality in the Allen Mouse Brain Atlas [ 69 ]. To handle outliers for the expression heatmap UMAP plots and the expression scatterplots in Fig.  5 h–j, values were thresholded to fall between the 2nd percentile and 98th percentile. The curves overlaid on the expression scatter plots in these panels were generated by first fitting B-splines with degree 3 (cubic) to the points included in the scatterplot. To generate a smoothed curve, a Gaussian kernel with a sigma of 10 was applied to the results of the spline function evaluated at 100 evenly spaced points within the range of the number of points included in the scatter plot. This is approximately equivalent to the behavior for large data sizes of the “geom_smooth” function from the R package ggplot [ 77 ].

Clustering evaluation metrics

Extrinsic evaluation metrics measure the distance of a clustering result to some external set of labels. When these labels are ground truth class labels, we can consider these to be measures of accuracy. However, they can also be used in other contexts such as with an “external” set of clustering labels. There are numerous metrics that can be used to measure this distance between a given set of predicted cluster labels and a ground truth or other set of external labels. Each of these metrics introduces some type of bias in evaluating the accuracy and robustness of clustering results, as discussed further below. To diversify these biases, we selected 2 metrics from different categories: Adjusted Rand Index (ARI) from the category of methods that employ peer-to-peer correlation and Adjusted Mutual Information (AMI) from the information theoretic measures [ 63 ]. We use both ARI and AMI to evaluate accuracy and robustness in our systematic benchmarking in Fig.  3 and in Supplementary Figures S2 and S7 (Fig.  3 ; Additional File 1: Figs. S2 and S7).

The ARI is the corrected-for-chance version of the Rand index, which measures the agreement between two sets of partition labels \(U\) and \(V\) [ 78 ]. The ARI is defined as:

where \(a\) is the number of pairs of two objects in the same group in both \(U\) and \(V\) ; \(b\) is the number of pairs of two objects in different groups in both \(U\) and \(V\) ; \(c\) is the number of pairs of two objects in the same group in \(U\) but in different groups in \(V\) ; and \(d\) is the number of pairs of two objects in different groups in \(U\) but in the same group in \(V\) . Random clusterings have an expected score of zero and identical partitions have a score of 1. ARI is biased towards solutions containing (1) balanced clusters (i.e., similar size clusters within each partition) and (2) similar cluster numbers and sizes between the two partitions [ 13 ]. ARI was calculated using the implementation in sklearn (v 1.0.1).

AMI is the corrected-for-chance version of Mutual Information, which quantifies the amount of information that can be obtained about one random variable (in this application, a list of cluster labels) by observing the other random variable (another list of cluster labels) [ 79 ]. Let \(C=\left\{{C}_{1},{C}_{2},...,{C}_{tc}\right\}\) and \(G=\left\{{G}_{1},{G}_{2},...,{G}_{tg}\right\}\) be the predicted and ground truth labels on a dataset with n cells. AMI is then defined as:

Here, \(I(C,G)\) represents the mutual information between \(C\) and \(G\) and is defined as:

\(H(C)\) And \(H(G)\) are the entropies: \(H(C)=-{\sum }_{p=1}^{tc}\left|{C}_{p}\right|log\frac{\left|{C}_{p}\right|}{n}\) and \(H(G)=-{\sum }_{p=1}^{tg}\left|{G}_{p}\right|\text{log}\frac{\left|{G}_{p}\right|}{n}\) . \(E\left\{I\left(C,G\right)\right\}\) is the expected mutual information between two random clusters. Random clusterings have an expected score of zero and identical partitions have a score of 1. AMI is biased towards solutions containing pure clusters, with a “pure cluster” being defined as a cluster in one set of labels that contains instances from only one cluster of the other set of labels to which it is being compared [ 13 ]. AMI was calculated using the implementation in sklearn (v 1.0.1).

Systematic benchmarking

For benchmarking ESCHR, we selected the following clustering algorithms for comparison: (1&2) K-means and agglomerative hierarchical clustering (from scikit-learn version 1.0.1) [ 80 ], (3) SC3 (version 1.10.1 from Bioconductor) [ 24 ], (4) SC3s (version 0.1.1 through Scanpy) [ 25 ], (5) Seurat (version 4.1.1 from CRAN) [ 40 ], (6) SINCERA (version 1.0 from https://github.com/xu-lab/SINCERA ) [ 39 ], (7) Scanpy (version 1.8.2 from Anaconda) [ 35 ], (8) Phenograph (version 1.5.7) [ 41 ], (9) scCAN (version 1.0 from https://github.com/bangtran365/scCAN ) [ 38 ], and (10) scAIDE (version 1.0 https://github.com/tinglabs/scAIDE ) [ 37 ].

Clustering algorithms were excluded from our benchmarking comparison if they did not meet the following selection criteria: (1) software freely available; (2) code publicly available; (3) can run on multiple data modalities (e.g. not scRNA-seq-specific); (4) no unresolved errors during install or implementation; (5) does not require additional user input during the algorithm (other than prior information); and (6) able to complete analysis of datasets with ≥ 100,000 data points and 2000 features.

For the included methods, we followed the instructions and tutorials provided by the authors of each software package. For the K-means, SC3s, and Agglomerative methods, which require pre-specification of cluster number, we calculated distortion scores over a range of cluster numbers for K-means clustering and used the elbow method to select the optimal cluster number for use across all three methods. Default values were used for all other hyperparameters for each tool, as is common practice for most realistic use cases [ 6 , 81 ]. ESCHR was also run with all default settings, which is the intended usage. For all benchmarking analyses, the memory was set to 100 GB of RAM on the University of Virginia (UVA) Rivanna High Performance Computing (HPC) cluster.

No random seeds were intentionally fixed, but from inspecting the respective codebases, we believe it is likely that there remained internally fixed random seeds for some functions within some of the tested methods. Many common methods have internally fixed random seeds and/or default hyperparameters with fixed random seeds. This practice may mask a lack of robustness of these methods, and should only legitimately serve to replicate exact analyses when that is desired by the end user.

To assess the accuracy of methods in clustering our synthetic and image datasets, which have ground truth labels, we used the two extrinsic evaluation metrics defined above (ARI, AMI). For these purposes, each of the five independent runs of a given method was scored against the ground truth labels. Since it is nearly universal in papers describing new single-cell analysis methods, we also applied this analysis to evaluate the accuracy of each of the methods for our collection of “real” datasets using available published labels. However, we stress that we do not think this is a reliable or effective measure for evaluating clustering methods, as we detail further in Supplementary Note 1 (Additional File 1: Supplementary Note 1).

We also used the extrinsic metrics ARI and AMI to evaluate the stability and reproducibility of hard clustering results. In line with standard practice for benchmarking stability/robustness [ 82 ], we performed repeated runs with 5 random subsamples (90%) of each dataset for every method. This simulates slight differences in data collection and/or preprocessing, and if the clustering is capturing a true underlying structure rather than overfitting to noise it should be detected regardless of the exact set of cells that are sampled for the analysis. We then calculated pairwise scores for each metric between each of the 5 independent runs of a respective method and then took the mean across replicate pairs to obtain the final score per dataset-method.

To calculate both the method and replicate “agreement scores” for comparison with ESCHR uncertainty scores (Fig.  2 j), we first constructed contingency matrices between all pairs of replicates and methods and mapped the cluster labels from the result with more clusters to the result with fewer clusters. Using the shared labels between a given pair of clustering results we could then calculate per-instance agreement (binary) within the pair of results. The final per-instance score was calculated as the mean agreement across all possible combinations.

Statistical analyses

Statistical comparisons were performed using the “scipy.stats” and “statannotations” Python packages [ 83 , 84 ]. The two-sided Wilcoxon signed-rank test with Bonferroni correction was used to compare the performance of ESCHR versus each alternative method in the systematic benchmarking panels shown in Fig.  3 . Comparisons were calculated using dataset means across replicates for all tested datasets. N  = 21 for comparisons using synthetic datasets and N  = 45 for comparisons using real datasets. The two-sided Mann–Whitney-Wilcoxon was used for comparing uncertainty scores between accurately and inaccurately assigned cells in Fig.  2 g, with N  = 126,545 and N  = 750,955 for inaccurate and accurate groups respectively. The resulting p -value was below the threshold of calculation in a standard Python computing environment and was reported as zero, so we have reported this as p  < 0.00001 in the figure.

Availability of data and materials

All “real” datasets are publicly available. The 45 datasets used in our study are described in Supplementary Table 2 (Additional File 2: Table S2), including information and links to download preprocessed datasets. The only processing step we performed was log2 transformation for scRNA-seq datasets and arcsinH transformation for mass cytometry datasets if a given dataset was not yet scaled. The MNIST dataset was downloaded from keras datasets [ 85 ] and was preprocessed as recommended by the accompanying documentation.

Synthetic Gaussian datasets were generated using sklearn “make_blobs” [ 80 ] using various combinations of object number, feature number, cluster number, cluster size, and cluster standard deviations. Anisotropic transformations were then applied to the resulting datasets by multiplying pairs of features (an n  × 2 subset of the full data matrix) by different 2 × 2 matrices filled with random values between − 2 and 2. These datasets are available at: https://doi.org/ https://doi.org/10.5281/zenodo.12746558 [ 86 ].

Simulated scRNA-seq datasets with 1000 “cells” and 1000 “genes'' were generated using the DynToy package, which simulates different complex trajectory models based on real single-cell gene expression data [ 62 ]. These datasets are available at: https://doi.org/10.5281/zenodo.12786322 [ 87 ].

ISH images were downloaded from the ABA portal ( http://mouse.brain-map.org ) [ 88 ] and are freely available.

ESCHR is available under the MIT License and can be installed via PyPi ( https://pypi.org/project/eschr/ ) [ 89 ] or GitHub ( https://github.com/zunderlab/eschr ) [ 32 ]. The source code of the ESCHR version used for making the figures in this manuscript is available on zenodo ( https://doi.org/10.5281/zenodo.13380410 ) [ 90 ]. ESCHR is compatible with standard Python single-cell data structures and can be easily incorporated into scverse workflows or used as a standalone framework.

Further information and requests for resources should be directed to and will be fulfilled by Eli Zunder ([email protected]).

Svensson V, da Veiga Beltrame E, Pachter L. A curated database reveals trends in single-cell transcriptomics. Database (Oxford). 2020;2020:baaa073.

Article   PubMed   Google Scholar  

Xie B, Jiang Q, Mora A, Li X. Automatic cell type identification methods for single-cell RNA sequencing. Comput Struct Biotechnol J. 2021;19:5874–87.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet. 2019;20:273–82.

Article   CAS   PubMed   Google Scholar  

Saelens W, Cannoodt R, Todorov H, Saeys Y. A comparison of single-cell trajectory inference methods. Nat Biotechnol. 2019;37:547–54.

Zappia L, Theis FJ. Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome Biol. 2021;22:301.

Schneider I, Cepela J, Shetty M, Wang J, Nelson AC, Winterhoff B, et al. Use of “default” parameter settings when analyzing single cell RNA sequencing data using Seurat: a biologist’s perspective. JTGG. 2020;5:37–49.

Lu Y, Phillips CA, Langston MA. A robustness metric for biological data clustering algorithms. BMC Bioinformatics. 2019;20(Suppl 15):503.

Article   PubMed   PubMed Central   Google Scholar  

Krzak M, Raykov Y, Boukouvalas A, Cutillo L, Angelini C. Benchmark and parameter sensitivity analysis of single-cell RNA sequencing clustering methods. Front Genet. 2019;10:1253.

Tang M, Kaymaz Y, Logeman BL, Eichhorn S, Liang ZS, Dulac C, et al. Evaluating single-cell cluster stability using the Jaccard similarity index. Bioinformatics. 2021;37:2212–4.

Gibson G. Perspectives on rigor and reproducibility in single cell genomics. PLoS Genet. 2022;18:e1010210.

Patterson-Cross RB, Levine AJ, Menon V. Selecting single cell clustering parameter values using subsampling-based robustness metrics. BMC Bioinformatics. 2021;22:39.

Renedo-Mirambell M, Arratia A. Identifying bias in network clustering quality metrics. PeerJ Comput Sci. 2023;9:e1523.

Amigó E, Gonzalo J, Artiles J, Verdejo F. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr Boston. 2009;12:461–86.

Article   Google Scholar  

Freytag S, Tian L, Lönnstedt I, Ng M, Bahlo M. Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data. [version 2; peer review: 3 approved]. F1000Res. 2018;7:1297.

Cembrowski MS, Menon V. Continuous Variation within Cell Types of the Nervous System. Trends Neurosci. 2018;41:337–48.

Tanay A, Regev A. Scaling single-cell genomics from phenomenology to mechanism. Nature. 2017;541:331–8.

Kleinberg J. An Impossibility Theorem for Clustering. Advances in neural information processing systems. 2002;15.

Huh R, Yang Y, Jiang Y, Shen Y, Li Y. SAME-clustering: Single-cell Aggregated Clustering via Mixture Model Ensemble. Nucleic Acids Res. 2020;48:86–95.

Burton RJ, Cuff SM, Morgan MP, Artemiou A, Eberl M. GeoWaVe: geometric median clustering with weighted voting for ensemble clustering of cytometry data. Bioinformatics. 2023;39:btac751.

Tsoucas D, Yuan G-C. GiniClust2: a cluster-aware, weighted ensemble clustering method for cell-type detection. Genome Biol. 2018;19:58.

Sagi O, Rokach L. Ensemble learning: A survey. WIREs Data Mining Knowl Discov. 2018;8:e1249.

Wan S, Kim J, Won KJ. SHARP: hyperfast and accurate processing of single-cell RNA-seq data via ensemble random projection. Genome Res. 2020;30:205–13.

Risso D, Purvis L, Fletcher RB, Das D, Ngai J, Dudoit S, et al. clusterExperiment and RSEC: s Bioconductor package and framework for clustering of single-cell and other large gene expression datasets. PLoS Comput Biol. 2018;14:e1006378.

Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods. 2017;14:483–6.

Quah FX, Hemberg M. SC3s: efficient scaling of single cell consensus clustering to millions of cells. BMC Bioinformatics. 2022;23:536.

Zhu L, Lei J, Klei L, Devlin B, Roeder K. Semisoft clustering of single-cell data. Proc Natl Acad Sci USA. 2019;116:466–71.

Peters G, Crespo F, Lingras P, Weber R. Soft clustering – Fuzzy and rough approaches and their extensions and derivatives. Int J Approximate Reasoning. 2013;54:307–22.

Kanter I, Dalerba P, Kalisky T. A cluster robustness score for identifying cell subpopulations in single cell gene expression datasets from heterogeneous tissues and tumors. Bioinformatics. 2019;35:962–71.

Chen Z, Goldwasser J, Tuckman P, Liu J, Zhang J, Gerstein M. Forest Fire Clustering for single-cell sequencing combines iterative label propagation with parallelized Monte Carlo simulations. Nat Commun. 2022;13:3538.

Tasic B, Menon V, Nguyen TN, Kim TK, Jarsky T, Yao Z, et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat Neurosci. 2016;19:335–46.

Karim MR, Beyan O, Zappa A, Costa IG, Rebholz-Schuhmann D, Cochez M, et al. Deep learning-based clustering approaches for bioinformatics. Brief Bioinformatics. 2021;22:393–415.

Goggin SM, Zunder ER. ESCHR. Computer software. Github; 2024. https://github.com/zunderlab/eschr .

Virshup I, Bredikhin D, Heumos L, Palla G, Sturm G, Gayoso A, et al. The scverse project provides a computational ecosystem for single-cell omics data analysis. Nat Biotechnol. 2023;41:604–6.

Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, et al. Comprehensive Integration of Single-Cell Data. Cell. 2019;177:1888-1902.e21.

Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:15.

Guo M, Xu Y. Single-Cell Transcriptome Analysis Using SINCERA Pipeline. Methods Mol Biol. 2018;1751:209–22.

Xie K, Huang Y, Zeng F, Liu Z, Chen T. scAIDE: clustering of large-scale single-cell RNA-seq data reveals putative and rare cell types. NAR Genom Bioinform. 2020;2:lqaa082.

Tran B, Tran D, Nguyen H, Ro S, Nguyen T. scCAN: single-cell clustering using autoencoder and network fusion. Sci Rep. 2022;12:10267.

Guo M, Wang H, Potter SS, Whitsett JA, Xu Y. SINCERA: A Pipeline for Single-Cell RNA-Seq Profiling Analysis. PLoS Comput Biol. 2015;11:e1004575.

Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36:411–20.

Levine JH, Simonds EF, Bendall SC, Davis KL, Amir ED, Tadmor MD, et al. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis. Cell. 2015;162:184–97.

The MNIST Database of Handwritten Digits. https://yann.lecun.com/exdb/mnist/ . Accessed 11 Feb 2023.

Brunner M, Lopez-Rodriguez D, Messina A, Thorens B, Santoni F, Langlet F. Pseudospatial transcriptional gradient analysis of hypothalamic ependymal cells: towards a new tanycyte classification. BioRxiv. 2023.

Campbell JN, Macosko EZ, Fenselau H, Pers TH, Lyubetskaya A, Tenen D, et al. A molecular census of arcuate hypothalamus and median eminence cell types. Nat Neurosci. 2017;20:484–96.

Langlet F. Tanycyte gene expression dynamics in the regulation of energy homeostasis. Front Endocrinol (Lausanne). 2019;10:286.

Sullivan AI, Potthoff MJ, Flippo KH. Tany-Seq: Integrated Analysis of the Mouse Tanycyte Transcriptome. Cells. 2022;11:1565.

Fong H, Kurrasch DM. Developmental and functional relationships between hypothalamic tanycytes and embryonic radial glia. Front Neurosci. 2022;16:1129414.

Dietterich TG. Ensemble Methods in Machine Learning. In: Multiple Classifier Systems. Berlin. Heidelberg: Springer Berlin Heidelberg; 2000. p. 1–15.

Google Scholar  

Ghosh J, Acharya A. Cluster ensembles. WIREs Data Mining Knowl Discov. 2011;1:305–15.

Strehl A, Ghosh J. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. JMLR (J Mach Learn Res). 2002;3:583–617.

Ben-Hur A, Elisseeff A, Guyon I. A stability based method for discovering structure in clustered data. Pac Symp Biocomput. 2002:6–17.

Monti S, Tamayo P, Mesirov J, Golub T. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn. 2003;52:91–118.

Fred A. Finding consistent clusters in data partitions. In: Kittler J, Roli F, editors. Multiple Classifier Systems. Springer, Berlin Heidelberg: Berlin, Heidelberg; 2001. p. 309–18.

Chapter   Google Scholar  

Naegle KM, Welsch RE, Yaffe MB, White FM, Lauffenburger DA. MCAM: multiple clustering analysis methodology for deriving hypotheses and insights from high-throughput proteomic datasets. PLoS Comput Biol. 2011;7:e1002119.

Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 2019;9:5233.

Topchy A, Jain AK, Punch W. Combining multiple weak clusterings. In: Third IEEE International Conference on Data Mining. Melbourne, FL: IEEE Comput. Soc; 2003. p. 331–8.

Traag VA. leidenalg Documentation. Release 0102.dev9+gdc8ec1a.d20230927. Section 4.1.2 Bipartite:15–6.

Fern XZ, Brodley CE. Solving cluster ensemble problems by bipartite graph partitioning. In: Twenty-first international conference on Machine learning - ICML ’04. New York: ACM Press; 2004. p. 36.

Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nat Commun. 2019;10:5416.

McInnes L, Healy J, Saul N, Großberger L. UMAP: uniform manifold approximation and projection. JOSS. 2018;3:861.

van der Maaten L, Hinton G. Visualizing Data using t-SNE. JMLR. 2008;9:2579–605.

Cannoodt R, Saelens W, Deconinck L, Saeys Y. Spearheading future omics analyses using dyngen, a multi-modal simulator of single cells. Nat Commun. 2021;12:3942.

Palacio-Niño J-O, Berzal F. Evaluation Metrics for Unsupervised Learning Algorithms. arXiv. 2019.

Rodríguez EM, Blázquez JL, Pastor FE, Peláez B, Peña P, Peruzzo B, et al. Hypothalamic tanycytes: a key component of brain-endocrine interaction. Int Rev Cytol. 2005;247:89–164.

Langlet F. Targeting Tanycytes: Balance between Efficiency and Specificity. Neuroendocrinology. 2020;110:574–81.

Yoo S, Kim J, Lyu P, Hoang TV, Ma A, Trinh V, et al. Control of neurogenic competence in mammalian hypothalamic tanycytes. Sci Adv. 2021;7:eabg3777.

Chen R, Wu X, Jiang L, Zhang Y. Single-cell RNA-Seq reveals hypothalamic cell diversity. Cell Rep. 2017;18:3227–41.

Deng G, Morselli LL, Wagner VA, Balapattabi K, Sapouckey SA, Knudtson KL, et al. Single-Nucleus RNA sequencing of the hypothalamic arcuate nucleus of C57BL/6J Mice After Prolonged Diet-Induced Obesity. Hypertension. 2020;76:589–97.

Lein ES, Hawrylycz MJ, Ao N, Ayres M, Bensinger A, Bernard A, et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature. 2007;445:168–76.

Xu C, Su Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics. 2015;31:1974–80.

Boytsov L, Naidan B. Engineering Efficient and Effective Non-metric Space Library. In: Brisaboa N, Pedreira O, Zezula P, editors. Similarity search and applications. Springer, Berlin Heidelberg: Berlin, Heidelberg; 2013. p. 280–93.

Watson ER, Mora A, Taherian Fard A, Mar JC. How does the structure of data impact cell-cell similarity? Evaluating how structural properties influence the performance of proximity metrics in single cell RNA-seq data. Brief Bioinformatics. 2022;23:bbac387.

Reichardt J, Bornholdt S. Statistical mechanics of community detection. Phys Rev E. 2006;74:016110.

Traag VA, Van Dooren P, Nesterov Y. Narrow scope for resolution-limit-free community detection. Phys Rev E. 2011;84:016114.

Article   CAS   Google Scholar  

Calderer G, Kuijjer ML. Community detection in large-scale bipartite biological networks. Front Genet. 2021;12:649440.

Xu Y, Chen L, Li B, liu W. Density-based modularity for evaluating community structure in bipartite networks. Inf Sci (Ny). 2015;317:278–94.

Wickham H. ggplot2: Elegant Graphics for Data Analysis. 2nd ed. Cham: Springer; 2016.

Book   Google Scholar  

Hubert L, Arabie P. Comparing partitions J of Classification. 1985;2:193–218.

Vinh N, Epps J, Bailey J. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. J Mach Learn Res. 2010;11:2837–54.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.

Rodriguez MZ, Comin CH, Casanova D, Bruno OM, Amancio DR, da Costa L F, et al. Clustering algorithms: a comparative approach. PLoS One. 2019;14:e0210236.

Weber LM, Saelens W, Cannoodt R, Soneson C, Hapfelmeier A, Gardner PP, et al. Essential guidelines for computational method benchmarking. Genome Biol. 2019;20:125.

Jones E, Oliphant T, Peterson P, others. SciPy: Open source scientific tools for Python. Version 1.31. Computer software. 2024. https://github.com/scipy/scipy/tree/v1.13.0 .

Charlier F, Weber M, Izak D, Harkin E, Magnus M, Lalli J, et al. Statannotations. Version 0.6. Computer software. 2023. https://github.com/trevismd/statannotations/tree/v0.6.0 .

Chollet F, et al. Keras. Version 3.0. Computer software. 2023. https://keras.io/api/datasets/mnist .

Goggin S. Synthetic gaussian datasets. Zenodo. 2024. https://doi.org/10.5281/zenodo.12746558

Goggin S. DynToy simulated scRNA-seq datasets. Zenodo. 2024. https://doi.org/10.5281/zenodo.12786322

Allen Mouse Brain Atlas. https://mouse.brain-map.org/ . Accessed 2 Mar 2023.

Goggin SM, Zunder ER. ESCHR. Computer software. PyPi. 2024. https://pypi.org/project/eschr/ .

Goggin S. ESCHR v0.2.0. Zenodo. 2024. https://doi.org/10.5281/zenodo.13380410 .

Yanai I, Lercher M. A hypothesis is a liability. Genome Biol. 2020;21:231.

Tyler SR, Bunyavanich S, Schadt EE. PMD Uncovers Widespread Cell-State Erasure by scRNAseq Batch Correction Methods. BioRxiv. 2021.

Download references

Acknowledgements

The authors acknowledge Research Computing at The University of Virginia for providing computational resources and technical support that have contributed to the results reported within this publication. URL: https://rc.virginia.edu . The authors would also like to thank Kristen Naegle and Sean Chadwick for critical feedback on the manuscript.

Peer review information

Andrew Cosgrove was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Review history

The review history is available as Additional file 3.

Research reported in this publication was supported by the National Institute of Neurological Disorders and Stroke of the National Institutes of Health under award number R01NS111220 to E.R.Z. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. S.M.G. was further supported by the Transdisciplinary Big Data Science Training Grant, award number T32LM012416.

Author information

Authors and affiliations.

Neuroscience Graduate Program, School of Medicine, University of Virginia, Charlottesville, VA, 22902, USA

Sarah M. Goggin & Eli R. Zunder

Department of Biomedical Engineering, School of Engineering, University of Virginia, Charlottesville, VA, 22902, USA

Eli R. Zunder

You can also search for this author in PubMed   Google Scholar

Contributions

S.M.G and E.R.Z. conceptualized the ESCHR clustering approach. S.M.G. developed the ESCHR method and Python package. S.M.G. and E.R.Z. planned all analysis and benchmarking experiments. S.M.G. performed all analysis and benchmarking experiments, and prepared figures. S.M.G. and E.R.Z. wrote the manuscript.

Authors’ X handles

X handles: @Sarah_M_Goggin (Sarah M. Goggin), @zunderlab (Eli R. Zunder).

Corresponding author

Correspondence to Eli R. Zunder .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

13059_2024_3386_moesm1_esm.docx.

Additional file 1. All supplementary notes and figures, tables S5-S6 [ 91 , 92 ].

Additional file 2. Tables S1-S4.

Additional file 3. review history, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Goggin, S.M., Zunder, E.R. ESCHR: a hyperparameter-randomized ensemble approach for robust clustering across diverse datasets. Genome Biol 25 , 242 (2024). https://doi.org/10.1186/s13059-024-03386-5

Download citation

Received : 01 February 2024

Accepted : 30 August 2024

Published : 16 September 2024

DOI : https://doi.org/10.1186/s13059-024-03386-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Single-cell RNA-seq
  • Mass cytometry
  • Consensus clustering
  • Soft clustering

Genome Biology

ISSN: 1474-760X

research paper on clustering algorithms

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Journal Proposal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

electronics-logo

Article Menu

research paper on clustering algorithms

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

An enhanced k-means clustering algorithm for phishing attack detections.

research paper on clustering algorithms

1. Introduction

  • A novel approach combining K-Means Clustering with the CfsSubsetEval attribute evaluator for improved phishing detection.
  • Comprehensive comparative analysis of the proposed method against kernel K-Means, demonstrating improvements in accuracy across various sample sizes.
  • Insights into the scalability and adaptability of unsupervised learning techniques for phishing detection, particularly in handling large datasets.
  • Exploration of the potential for real-time phishing detection through the proposed method, opening avenues for practical applications such as browser extensions.
  • Critical analysis of the strengths and limitations of the proposed approach, providing a balanced view of its potential impact on cybersecurity practices.

2. Background

3. proposed approach for enhanced phishing website classification.

Proposed algorithm for phishing website classification
Dataset

Feature Importance Analysis

Performance comparison and insights, 5. limitations of the proposed approach, 6. conclusions, author contributions, data availability statement, acknowledgments, conflicts of interest.

  • Athulya, A.A.; Praveen, K. Towards the Detection of Phishing Attacks. In Proceedings of the 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI) (48184), Tirunelveli, India, 15–17 June 2020; pp. 337–343. [ Google Scholar ] [ CrossRef ]
  • Sadiq, A.; Anwar, M.; Butt, R.A.; Masud, F.; Shahzad, M.K.; Naseem, S.; Younas, M. A review of phishing attacks and countermeasures for internet of things-based smart business applications in industry 4.0. Hum. Behav. Emerg. Technol. 2021 , 3 , 854–864. [ Google Scholar ] [ CrossRef ]
  • Aleroud, A.; Zhou, L. Phishing environments, techniques, and countermeasures: A survey. Comput. Secur. 2017 , 68 , 160–196. [ Google Scholar ] [ CrossRef ]
  • Dolan, K. Quarters. In Safe Places: Stories ; University of Massachusetts Press: Amherst, MA, USA, 2022; pp. 80–88. [ Google Scholar ] [ CrossRef ]
  • Alkhalil, Z.; Hewage, C.; Nawaf, L.; Khan, I. Phishing attacks: A recent comprehensive study and a new anatomy. Front. Comput. Sci. 2021 , 3 , 563060. [ Google Scholar ] [ CrossRef ]
  • Patel, J. Phishing URL detection using artificial neural network. Int. J. Res. Eng. Sci. Manag. 2022 , 5 , 47–51. [ Google Scholar ]
  • Frauenstein, E.D.; Flowerday, S. Susceptibility to phishing on social network sites: A personality information processing model. Comput. Secur. 2020 , 94 , 101862. [ Google Scholar ] [ CrossRef ]
  • Jain, A.K.; Gupta, B.B. A survey of phishing attack techniques, defence mechanisms and open research challenges. Enterp. Inf. Syst. 2021 , 16 , 527–565. [ Google Scholar ] [ CrossRef ]
  • Alghenaim, M.F.; Bakar, N.A.A.; Rahim, F.A.; Vanduhe, V.Z.; Alkawsi, G. Phishing Attack Types and Mitigation: A Survey. In Proceedings of the International Conference on Data Science and Emerging Technologies, Singapore, 20–21 December 2022; pp. 131–153. [ Google Scholar ] [ CrossRef ]
  • Aljabri, M.; Mirza, S. Phishing Attacks Detection using Machine Learning and Deep Learning Models. In Proceedings of the 2022 7th International Conference on Data Science and Machine Learning Applications (CDMA), Riyadh, Saudi Arabia, 1–3 March 2022; pp. 175–180. [ Google Scholar ] [ CrossRef ]
  • Abedin, N.F.; Bawm, R.; Sarwar, T.; Saifuddin, M.; Rahman, M.A.; Hossain, S. Phishing Attack Detection using Machine Learning Classification Techniques. In Proceedings of the 2020 3rd International Conference on Intelligent Sustainable Systems (ICISS), Thoothukudi, India, 3–5 December 2020; pp. 1125–1130. [ Google Scholar ] [ CrossRef ]
  • Basit, A.; Zafar, M.; Liu, X.; Javed, A.R.; Jalil, Z.; Kifayat, K. A comprehensive survey of AI-enabled phishing attacks detection techniques. Telecommun. Syst. 2021 , 76 , 139–154. [ Google Scholar ] [ CrossRef ]
  • Lain, D.; Kostiainen, K.; Capkun, S. Phishing in Organizations: Findings from a Large-Scale and Long-Term Study. In Proceedings of the 2022 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 22–26 May 2022; pp. 842–859. [ Google Scholar ] [ CrossRef ]
  • Alam, M.N.; Sarma, D.; Lima, F.F.; Saha, I.; Ulfath, R.E.; Hossain, S. Phishing attacks detection using machine learning approach. In Proceedings of the 3rd International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 20–22 August 2020; pp. 1173–1179. [ Google Scholar ] [ CrossRef ]
  • Kamalam, G.K.; Suresh, P.; Nivash, R.; Ramya, A.; Raviprasath, G. Detection of Phishing Websites Using Machine Learning. In Proceedings of the 2022 International Conference on Computing, Communication and Informatics (ICCCI), Coimbatore, India, 25–27 January 2022; pp. 1–4. [ Google Scholar ] [ CrossRef ]
  • Basit, A.; Zafar, M.; Javed, A.R.; Jalil, Z. A Novel Ensemble Machine Learning Method to Detect Phishing Attack. In Proceedings of the 2020 23rd IEEE International Multi-Topic Conference (INMIC), Bahawalpur, Pakistan, 25–27 January 2020; pp. 1–5. [ Google Scholar ] [ CrossRef ]
  • Mao, J.; Bian, J.; Tian, W.; Zhu, S.; Wei, T.; Li, A.; Liang, Z. Phishing page detection via learning classifiers from page layout feature. EURASIP J. Wirel. Commun. Netw. 2019 , 43 , 43. [ Google Scholar ] [ CrossRef ]
  • Shahrivari, V.; Darabi, M.M.; Izadi, M. Phishing detection using machine learning techniques. arXiv 2020 , arXiv:2009.11116. [ Google Scholar ] [ CrossRef ]
  • Adebowale, M.A.; Lwin, K.T.; Hossain, M.A. Intelligent phishing detection scheme using deep learning algorithms. J. Enterp. Inf. Manag. 2023 , 36 , 747–766. [ Google Scholar ] [ CrossRef ]
  • Sahu, K.; Shrivastava, S.K. Kernel K-Means Clustering for Phishing Website and Malware Categorization. Int. J. Comput. Appl. 2015 , 111 , 20–25. [ Google Scholar ] [ CrossRef ]
  • Hossain, S.; Sarma, D.; Chakma, R.J. Machine learning-based phishing attack detection. Int. J. Adv. Comput. Sci. Appl. 2020 , 11 , 378–388. [ Google Scholar ] [ CrossRef ]
  • Muhammad, B.A.; Iqbal, R.; James, A.; Nkantah, D. Comparative Performance of Machine Learning Methods for Text Classification. In Proceedings of the 2020 International Conference on Computing and Information Technology (ICCIT-1441), Tabuk, Saudi Arabia, 9–10 September 2020; pp. 1–5. [ Google Scholar ] [ CrossRef ]
  • Barlow, L.; Bendiab, G.; Shiaeles, S.; Savage, N. A Novel Approach to Detect Phishing Attacks using Binary Visualisation and Machine Learning. In Proceedings of the 2020 IEEE World Congress on Services (SERVICES), Beijing, China, 18–23 October 2020; pp. 177–182. [ Google Scholar ] [ CrossRef ]
  • Salahdine, F.; Mrabet, Z.E.; Kaabouch, N. Phishing Attacks Detection—A Machine Learning-Based Approach. In Proceedings of the 2021 IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 1–4 December 2020; pp. 250–255. [ Google Scholar ] [ CrossRef ]
  • Gupta, B.B.; Jain, A.K. Phishing attack detection using a search engine and heuristics-based technique. J. Inf. Technol. Res. 2020 , 13 , 94–109. [ Google Scholar ] [ CrossRef ]
  • Salihu, S.A.; Oladipo, I.D.; Wojuade, A.A.; Abdulraheem, M.; Babatunde, A.O.; Ajiboye, A.R.; Balogun, G.B. Detection of Phishing URLs Using Heuristics-Based Approach. In Proceedings of the 2022 5th Information Technology for Education and Development (ITED), Abuja, Nigeria, 1–3 November 2022; pp. 1–7. [ Google Scholar ] [ CrossRef ]
  • Baki, S.; Verma, R.M. Sixteen Years of Phishing User Studies: What Have We Learned? IEEE Trans. Dependable Secur. Comput. 2023 , 20 , 1200–1212. [ Google Scholar ] [ CrossRef ]
  • Salloum, S.; Gaber, T.; Vadera, S.; Shaalan, K. Phishing Email Detection Using Natural Language Processing Techniques: A Literature Survey. Procedia Comput. Sci. 2021 , 189 , 19–28. [ Google Scholar ] [ CrossRef ]
  • Salloum, S.; Gaber, T.; Vadera, S.; Shaalan, K. A Systematic Literature Review on Phishing Email Detection Using Natural Language Processing Techniques. IEEE Access 2022 , 10 , 65703–65727. [ Google Scholar ] [ CrossRef ]
  • Gualberto, E.S.; De Sousa, R.T.; De Brito Vieira, T.P.; Da Costa, J.P.C.L.; Duque, C.G. The Answer is in the Text: Multi-Stage Methods for Phishing Detection Based on Feature Engineering. IEEE Access 2020 , 8 , 223529–223547. [ Google Scholar ] [ CrossRef ]
  • Goud, N.S.; Mathur, A. Feature Engineering Framework to detect Phishing Websites using URL Analysis. Int. J. Adv. Comput. Sci. Appl. 2021 , 12 , 295–303. [ Google Scholar ] [ CrossRef ]
  • Mourtaji, Y.; Bouhorma, M.; Alghazzawi, D.; Aldabbagh, G.; Alghamdi, A. Hybrid Rule-Based Solution for Phishing URL Detection Using Convolutional Neural Network. Wirel. Commun. Mob. Comput. 2021 , 2021 , 8241104. [ Google Scholar ] [ CrossRef ]
  • Abidoye, A.P.; Kabaso, B. Hybrid machine learning: A tool to detect phishing attacks in communication networks. ECTI Trans. Comput. Inf. Technol. ECTI-CIT 2021 , 15 , 374–389. [ Google Scholar ] [ CrossRef ]
  • Asiri, S.; Xiao, Y.; Alzahrani, S.; Li, T. PhishingRTDS: A real-time detection system for phishing attacks using a Deep Learning model. Comput. Secur. 2024 , 141 , 103843. [ Google Scholar ] [ CrossRef ]
  • Linh, D.M.; Hung, H.; Chau, H.M.; Vu, Q.S.; Tran, T.N. Real-time phishing detection using deep learning methods by extensions. Int. J. Electr. Comput. Eng. 2024 , 14 , 3021–3035. [ Google Scholar ] [ CrossRef ]
  • Abdelali Elkouay, N.M.; Madani, A. Graph-based phishing detection: URLGBM model driven by machine learning. Int. J. Comput. Appl. 2024 , 46 , 481–495. [ Google Scholar ] [ CrossRef ]
  • Balaji, S.; Sathishkumar, R.; Sharmila, G.; Arikaran, N.; Nivas, K.; Dhamotharan, S. Machine Learning Based Improved Phishing Detection using Adversarial Auto Encoder. In Proceedings of the 2023 International Conference on System, Computation, Automation and Networking (ICSCAN), Puducherry, India, 17–18 November 2023; pp. 1–4. [ Google Scholar ] [ CrossRef ]
  • Shirazi, H.; Muramudalige, S.R.; Ray, I.; Jayasumana, A.P.; Wang, H. Adversarial Autoencoder Data Synthesis for Enhancing Machine Learning-Based Phishing Detection Algorithms. IEEE Trans. Serv. Comput. 2023 , 16 , 2411–2422. [ Google Scholar ] [ CrossRef ]
  • Berens, B.; Dimitrova, K.; Mossano, M.; Volkamer, M. Phishing awareness and education–When to best remind. In Proceedings of the Workshop on Usable Security and Privacy (USEC), San Diego, CA, USA, 28 April 2022. [ Google Scholar ] [ CrossRef ]
  • Sarker, O.; Jayatilaka, A.; Haggag, S.; Liu, C.; Babar, M.A. A Multi-vocal Literature Review on challenges and critical success factors of phishing education, training and awareness. J. Syst. Softw. 2024 , 208 , 111899. [ Google Scholar ] [ CrossRef ]
  • Zhang, X.; Miao, X.; Xue, M. A Reputation-Based Approach Using Consortium Blockchain for Cyber Threat Intelligence Sharing. Secur. Commun. Netw. 2022 , 2022 , 7760509. [ Google Scholar ] [ CrossRef ]
  • Fuxen, P.; Hachani, M.; Hackenberg, R.; Ross, M. MANTRA: Towards a Conceptual Framework for Elevating Cybersecurity Applications Through Privacy-Preserving Cyber Threat Intelligence Sharing. Cloud Comput. 2024 , 2024 , 43. [ Google Scholar ]
  • Hannousse, A.; Yahiouche, S. Towards benchmark datasets for machine learning based website phishing detection: An experimental study. Eng. Appl. Artif. Intell. 2021 , 104 , 104347. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

Type of PhishingMethodTargetTypical CharacteristicsTypical SolutionsRemaining Challenges
Website PhishingCreation of fake websites mimicking legitimate onesGeneral internet usersFake websites often mimic legitimate sites closely, can be linked from emails or search engines. Examples: fake banking websites, counterfeit e-commerce sites.Secure browsing habits, browser security features, website verification toolsDifficult for users to identify fake sites, continuous creation of new phishing sites, sophisticated design and domain spoofing.
Mobile Website PhishingFake websites designed for mobile devicesMobile internet usersOptimised for mobile viewing, often linked from SMS or mobile apps. Examples: fake mobile banking sites, fraudulent mobile service portals.Mobile security software, cautious browsing on mobile devices, awareness campaignsSmaller screens make it harder to identify phishing cues, increased mobile browsing, sophisticated mobile site designs.
Social Media PhishingFake social media pages or malicious links shared via social platformsSocial media usersFake profiles, fraudulent links, impersonation of trusted contacts. Examples: phishing links shared in posts or messages, fake customer service accounts.Social media platform security measures, user education, link verification toolsHigh volume of social media activity, sophisticated fake profiles, rapid spread of malicious links.
Type of PhishingMethodTargetTypical CharacteristicsTypical SolutionsRemaining Challenges
Email PhishingMass distribution of fraudulent emailsGeneral publicSpoofed sender address, urgent language, generic greetings. Examples: fake bank alerts, lottery scams.Spam filters, email authentication protocols (DMARC, DKIM, SPF)Sophisticated spoofing techniques can bypass filters, human error, and lack of awareness.
Spear PhishingTargeted emails to specific individuals or organisationsSpecific individuals or organisationsPersonalised content, research on targets, often mimics trusted contacts. Examples: fake job offers, custom emails to company employees.Employee training, multi-factor authentication (MFA), advanced email filteringHigh customisation makes detection difficult, and social engineering exploits human trust.
WhalingHighly targeted attacks on high-profile individualsHigh-profile individuals (e.g., C-suite executives)Sophisticated content, often involves significant research. Examples: CEO fraud, business email compromise.Executive training, secure email gateways, incident response plansHigh-value targets attract persistent attackers, sophisticated social engineering.
SmishingSMS messages with malicious links or requestsMobile usersShort, urgent messages with malicious links. Examples: fake delivery notifications, fraudulent alerts from service providers.Mobile security apps, awareness campaigns, SMS filteringLimited SMS filtering capabilities, increased mobile usage, human susceptibility to urgency.
VishingVoice calls impersonating authority figures or entitiesPhone usersImpersonation of authority figures, creation of urgency. Examples: fake IRS calls, tech support scams.Caller ID verification, awareness training, call-blocking appsSpoofed caller IDs, convincing social engineering tactics, real-time interaction pressure.
Clone PhishingReplication of legitimate emails with malicious contentPrevious email recipientsNearly identical to legitimate emails but with malicious attachments/links. Examples: duplicated event invitations, fake service updates.Digital signatures, employee training, secure email practicesDifficulty in distinguishing cloned emails, reliance on recipient vigilance, advanced spoofing techniques.
Search Engine PhishingManipulation of search engine results to appear legitimateInternet searchersAppears in search results, often mimics popular websites. Examples: fake tech support websites, phishing sites mimicking popular services.Search engine algorithms, user education, browser warningsConstantly evolving phishing sites, reliance on user caution, search engine algorithm limitations.
StudyTechniques UsedDatasetKey Findings
Alam et al. [ ]Decision Trees (DT) and Random Forest (RF)Kaggle phishing datasetRF achieved high accuracy (96.9%) despite missing data, effectively addressing overfitting issues.
Kamalam et al. [ ]Decision Trees (DT) and Random Forest (RF)Phishing URLs DatasetDeveloped a Chrome extension for phishing detection, achieving 97.31% accuracy with RF.
Basit et al. [ ]Artificial Neural Networks (ANN), K-Nearest Neighbours (KNN), Decision Trees (DT), and Random Forest Classifier (RFC)UCI ML repositoryKNN combined with RFC provided the lowest FP rate (0.038) and highest TP rate (0.983).
Mao et al. [ ]Support Vector Machine (SVM)Phishing website datasetSVM achieved an accuracy rate of approximately 96%.
Shahrivari et al. [ ]XGBoost and other classifiers (LR, DT, SVM, Ada Boost, RF, Neural Networks, KNN, Gradient Boosting)Phishing website datasetXGBoost demonstrated exceptional performance, achieving an accuracy of approximately 98%.
Adebowale et al. [ ]Various deep learning techniquesMultiple phishing datasetsAchieved accuracy ranging from 91% to 94.8% using deep learning methods.
Sahu et al. [ ]K-Means ClusteringLarge phishing and malware datasetImproved error rate in classifying phishing and malware websites from 10% to 20%.
Hossain et al. [ ]Random Forest (RF), Bagging, Stochastic Gradient Descent (SGD), Logistic Regression (LR)Website attribute datasetRF achieved the best performance with an F1 score of 0.99.
Abedin et al. [ ]K-Nearest Neighbours (KNN), Logistic Regression (LR), Random Forest (RF)Phishing URLs DatasetRF demonstrated the highest precision rate (97%) and AUC score of 1.0.
Muhammad et al. [ ]Decision Trees (DT), K-Nearest Neighbours (KNN), Naive Bayes (NB), Random Forest (RF), Support Vector Machine (SVM)SMS and email phishing datasetsRF showed exceptional average accuracy but was time-consuming in classification.
Barlow et al. [ ]Neural Networks with Binary Visualisation TechniquesPhishing website datasetEffective in rapidly and accurately detecting phishing attackers, improving through learning from incorrect classifications.
Salahdine et al. [ ]Artificial Neural Networks (ANN), Logistic Regression (LR), Support Vector Machine (SVM)Phishing website datasetANN achieved the highest accuracy using two hidden layers and the Relu activation function.
ApproachDescriptionProsCons
Heuristic-Based DetectionUse of predefined rules and heuristics to identify phishing characteristics, such as URL analysis and content inspection [ , ].Simple and fast implementation, low computational costLimited detection accuracy, prone to false positives and false negatives
Behavioural AnalysisMonitoring and analysing user behaviour patterns to detect anomalies indicative of phishing attacks [ ].Can detect new and unknown phishing techniques, adaptive to user behaviourHigh computational cost, privacy concerns, and potential for false positives
Natural Language Processing (NLP)Applying NLP techniques to analyse textual content in emails and websites for phishing indicators [ , ].Effective in analysing textual content, can detect sophisticated phishing attemptsRequires large datasets for training, language-specific limitations
Feature EngineeringDesigning and extracting specific features from data to improve the accuracy of phishing detection models [ , ].Can significantly enhance model performance, adaptable to different datasetsTime-consuming and requires domain expertise, may not generalise well
Hybrid ApproachesCombining multiple techniques, such as machine learning and rule-based methods, to enhance detection performance [ , ].Higher detection accuracy, robust against various attack typesIncreased complexity, higher computational cost
Real-Time Detection SystemsImplementing systems capable of detecting phishing attempts in real-time, often integrated into browsers or email clients [ , ].Immediate protection, high user convenienceHigh computational and resource requirements, potential latency issues
Graph-Based ApproachesUsing graph theory to model relationships between entities (e.g., email senders, domains) and detect phishing patterns [ ].Effective in identifying complex relationships, scalable to large datasetsComplex implementation, requires significant computational resources
Adversarial TrainingTraining models to recognise and resist adversarial attacks designed to evade detection mechanisms [ , ].Improved model robustness, can handle sophisticated attacksRequires continuous updates, high computational cost
User Education and AwarenessDeveloping training programs and awareness campaigns to educate users about phishing threats and prevention strategies [ , ].Enhances user ability to recognise phishing attempts, cost-effectiveRelies on user compliance, not a technical solution
Threat Intelligence SharingCollaborating and sharing threat intelligence data among organisations to improve detection capabilities [ , ].Collective defence, access to broader threat dataData privacy concerns, potential for information overload
Number of SamplesError Rate (%)—Kernel K-MeansError Rate (%)—Proposed K-Means
200048.510.8
700047.0713.38
10,00043.0213.75
Number of SamplesAccuracy (%)—Kernel K-MeansAccuracy (%)—Proposed K-Means
200051.589.2
700052.9386.62
10,00056.9886.25
Number of SamplesAccuracy (%)Precision (%)Recall (%)F1-Score (%)
200089.288.590.189.3
700086.6285.987.686.7
10,00086.2585.787.086.3
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Al-Sabbagh, A.; Hamze, K.; Khan, S.; Elkhodr, M. An Enhanced K-Means Clustering Algorithm for Phishing Attack Detections. Electronics 2024 , 13 , 3677. https://doi.org/10.3390/electronics13183677

Al-Sabbagh A, Hamze K, Khan S, Elkhodr M. An Enhanced K-Means Clustering Algorithm for Phishing Attack Detections. Electronics . 2024; 13(18):3677. https://doi.org/10.3390/electronics13183677

Al-Sabbagh, Abdallah, Khalil Hamze, Samiya Khan, and Mahmoud Elkhodr. 2024. "An Enhanced K-Means Clustering Algorithm for Phishing Attack Detections" Electronics 13, no. 18: 3677. https://doi.org/10.3390/electronics13183677

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

Help | Advanced Search

Statistics > Machine Learning

Title: modern hierarchical, agglomerative clustering algorithms.

Abstract: This paper presents algorithms for hierarchical, agglomerative clustering which perform most efficiently in the general-purpose setup that is given in modern standard software. Requirements are: (1) the input data is given by pairwise dissimilarities between data points, but extensions to vector data are also discussed (2) the output is a "stepwise dendrogram", a data structure which is shared by all implementations in current standard software. We present algorithms (old and new) which perform clustering in this setting efficiently, both in an asymptotic worst-case analysis and from a practical point of view. The main contributions of this paper are: (1) We present a new algorithm which is suitable for any distance update scheme and performs significantly better than the existing algorithms. (2) We prove the correctness of two algorithms by Rohlf and Murtagh, which is necessary in each case for different reasons. (3) We give well-founded recommendations for the best current algorithms for the various agglomerative clustering schemes.
Comments: 29 pages
Subjects: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS)
classes: 62H30
 classes: I.5.3
Cite as: [stat.ML]
  (or [stat.ML] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

A hybrid metaheuristic optimised ensemble classifier with self organizing map clustering for credit scoring

  • Original Paper
  • Published: 18 September 2024
  • Volume 24 , article number  57 , ( 2024 )

Cite this article

research paper on clustering algorithms

  • Indu Singh   ORCID: orcid.org/0000-0002-3635-173X 1 ,
  • D. P. Kothari 2 ,
  • S. Aditya 1 ,
  • Mihir Rajora 1 ,
  • Charu Agarwal 3 &
  • Vibhor Gautam 4  

Credit scoring is a mathematical and statistical tool that aids financial institutions in deciding suitable candidates for the issuance of loans, based on the analysis of the borrower’s financial history. Distinct groups of borrowers have unique characteristics that must be identified and trained on to increase the accuracy of classification models for all credit borrowers that financial institutions serve. Numerous studies have shown that models based on diverse base-classifier models outperform other statistical and AI-based techniques for related classification problems. This paper proposes a novel multi-layer clustering and soft-voting-based ensemble classification model, aptly named Self Organizing Map Clustering with Metaheuristic Voting Ensembles (SCMVE) which uses a self-organizing map for clustering the data into distinct clusters with their unique characteristics and then trains a sailfish optimizer powered ensemble of SVM-KNN base classifiers for classification of each distinct identified cluster. We train and evaluate our model on the standard public credit scoring datasets—namely the German, Australian and Taiwan datasets and use multiple evaluation scores such as precision, F1 score, recall to compare the results of our model with other prominent works in the field. On evaluation, SCMVE shows outstanding results (95% accuracy on standard datasets) when compared with popular works in the field of credit scoring.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

research paper on clustering algorithms

Explore related subjects

  • Artificial Intelligence

Data Availability

The datasets generated and/or analysed during the current study are available in https://archive.ics.uci.edu/dataset/143/statlog+australian+credit+approval , https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data and https://archive.ics.uci.edu/dataset/350/default +of+credit+card+clients.

AghaeiRad A, Chen N, Ribeiro B (2017) Improve credit scoring using transfer of learned knowledge from self-organizing map. Neural Comput Appl 28(6):1329–1342

Article   Google Scholar  

Bumacov V, Ashta A, Singh P (2014) The use of credit scoring in microfinance institutions and their outreach. Strateg Chang 23(7–8):401–413

Dastile X, Celik T, Potsane M (2020) Statistical and machine learning models in credit scoring: a systematic literature survey. Appl Soft Comput 91:106263

Dua D, Graff C (2017) UCI machine learning repository

Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179–188

Gholamian M, Jahanpour S, Sadatrasoul S (2013) A new method for clustering in credit scoring problems. J Math Comput Sci 6:97–106

Guo S, He H, Huang X (2019) A multi-stage self-adaptive classifier ensemble model with application in credit scoring. IEEE Access 7:78549–78559

Hand DJ, Kelly MG (2002) Superscorecards. IMA J Manag Math 13(4):273–281

Google Scholar  

He H, Zhang W, Zhang S (2018) A novel ensemble method for credit scoring: adaption of different imbalance ratios. Expert Syst Appl 98:105–117

Henley WEM, Hand DJ (1996) Ak-nearest-neighbour classifier for assessing consumer credit risk. J Roy Stat Soc Ser D (Stat) 45(1):77–95

Hsieh N-C (2005) Hybrid mining approach in the design of credit scoring models. Expert Syst Appl 28(4):655–665

Hsieh N-C, Hung L-P (2010) A data-driven ensemble classifier for credit scoring analysis. Expert Syst Appl 37(1):534–545

Huang C, Li Y, Yao X (2019) A survey of automatic parameter tuning methods for metaheuristics. IEEE Trans Evol Comput 24(2):201–216

Kohonen T (1998) The self-organizing map. Neurocomputing 21(1–3):1–6

Kozodoi N, Lessmann S (2020) Multi-objective particle swarm optimization for feature selection in credit scoring. In: Workshop on mining data for financial applications. Springer, pp 68–76

Kuppili V, Tripathi D, Edla DR (2020) Credit score classification using spiking extreme learning machine. Comput Intell 36(2):402–426

Lappas PZ, Yannacopoulos AN (2021) A machine learning approach combining expert knowledge with genetic algorithms in feature selection for credit risk assessment. Appl Soft Comput 107:107391

Lau KW, Hujun Y, Simon H (2006) Kernel self-organising maps for classification. Neurocomputing 69(16–18):2033–2040

Lee T-S, Chiu C-C, Lu C-J, Chen I-F (2002) Credit scoring using the hybrid neural discriminant technique. Expert Syst Appl 23(3):245–254

Li X, Ying W, Tuo J, Li B, Liu W (2004) Applications of classification trees to consumer credit scoring methods in commercial banks. In: 2004 IEEE international conference on systems, man and cybernetics (IEEE Cat. No. 04CH37583), vol 5. IEEE, pp 4112–4117

Li S-T, Shiue W, Huang M-H (2006) The evaluation of consumer loans using support vector machines. Expert Syst Appl 30(4):772–782

Liu W, Fan H, Xia M (2022) Multi-grained and multi-layered gradient boosting decision tree for credit scoring

Martens D, De Backer M, Haesen R, Vanthienen J, Snoeck M, Baesens B (2007) Classification with ant colony optimization. IEEE Trans Evol Comput 11(5):651–665

Nalič J, Martinovič G, Žagar D (2020) New hybrid data mining model for credit scoring based on feature selection algorithm and ensemble classifiers. Adv Eng Inform 45:101130

Nassef MGA, Hussein TM, Mokhiamar O (2021) An adaptive variational mode decomposition based on sailfish optimization algorithm and gini index for fault identification in rolling bearings. Measurement 173:108514

Onan A (2018a) Biomedical text categorization based on ensemble pruning and optimized topic modelling. Comput Math Methods Med 2018(1):2497471

Onan A (2018) An ensemble scheme based on language function analysis and feature engineering for text genre classification. J Inf Sci 44(1):28–47

Onan A (2019a) Consensus clustering-based undersampling approach to imbalanced learning. Sci Program 2019(1):5901087

Onan A (2019b) Two-stage topic extraction model for bibliometric data analysis based on word embeddings and clustering. IEEE Access 7:145614–145633

Onan A (2021a) Sentiment analysis on massive open online course evaluations: a text mining and deep learning approach. Comput Appl Eng Educ 29(3):572–589

Onan A (2021b) Sentiment analysis on product reviews based on weighted word embeddings and deep neural networks. Concurr Comput Pract Exp 33(23):e5909

Onan A (2022) Bidirectional convolutional recurrent neural network architecture with group-wise enhancement mechanism for text sentiment classification. J King Saud Univ Comput Inf Sci 34(5):2098–2117

Onan A (2023a) Gtr-ga: harnessing the power of graph-based neural networks and genetic algorithms for text augmentation. Expert Syst Appl 232:120908

Onan A (2023b) Srl-aco: a text augmentation framework based on semantic role labeling and ant colony optimization. J King Saud Univ Comput Inf Sci 35(7):101611

Onan A, Korukoǧlu S, Bulut H (2016a) Ensemble of keyword extraction methods and classifiers in text classification. Expert Syst Appl 57:232–247

Onan A, Korukoǧlu S, Bulut H (2016b) A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Syst Appl 62:1–16

Onan A, Korukoǧlu S, Bulut H (2017) A hybrid ensemble pruning approach based on consensus clustering and multi-objective evolutionary algorithm for sentiment classification. Inf Process Manag 53(4):814–833

Pławiak P, Abdar M, Acharya RU (2019) Application of new deep genetic cascade ensemble of svm classifiers to predict the Australian credit scoring. Appl Soft Comput 84:105740

Reichert AK, Cho C-C, Wagner GM (1983) An examination of the conceptual issues involved in developing credit-scoring models. J Bus Econ Stat 1(2):101–114

Runchi Z, Liguo X, Qin W (2023) An ensemble credit scoring model based on logistic regression with heterogeneous balancing and weighting effects. Expert Syst Appl 212:118732

Safi SA-D, Castillo PA, Faris H (2022) Cost-sensitive metaheuristic optimization-based neural network with ensemble learning for financial distress prediction. Appl Sci 12(14):6918

Şen D, Dönmez CÇ, Yıldırım UM (2020) A hybrid bi-level metaheuristic for credit scoring. Inf Syst Front 22(5):1009–1019

Shadravan S, Naji HR, Bardsiri VK (2019) The sailfish optimizer: a novel nature-inspired metaheuristic algorithm for solving constrained engineering optimization problems. Eng Appl Artif Intell 80:20–34

Simumba N, Okami S, Kodaka A, Kohtake N (2022) Multiple objective metaheuristics for feature selection based on stakeholder requirements in credit scoring. Decis Support Syst 155:113714

Singh P (2017) Comparative study of individual and ensemble methods of classification for credit scoring. In: 2017 International conference on inventive computing and informatics (ICICI). IEEE, pp 968–972

Suleiman S, Ibrahim A, Usman D, Isah BY, Usman HM (2021) Improving credit scoring classification performance using self organizing map-based machine learning techniques. Eur J Adv Eng Technol 8(10):28–35

Transpire Online sailfish optimizer (sfo) (2019) A novel method motivated from the behavior of sailfish for optimal solution

Tripathi D, Edla DR, Cheruku R, Kuppili V (2019) A novel hybrid credit scoring model based on ensemble feature selection and multilayer ensemble classification. Comput Intell 35(2):371–394

Tripathi D, Edla DR, Kuppili V, Bablani A (2020a) Evolutionary extreme learning machine with novel activation function for credit scoring. Eng Appl Artif Intell 96:103980

Tripathi D, Edla DR, Kuppili V, Dharavath R (2020b) Binary bat algorithm and rbfn based hybrid credit scoring model. Multimedia Tools Appl 79(43):31889–31912

Van Gestel IT, Baesens B, Garcia IJ, Van Dijcke P (2003) A support vector machine approach to credit scoring. In: Forum Financier-Revue Bancaire et Financiaire Bank en Financiewezen, pp 73–82

West D (2000) Neural network credit scoring models. Comput Oper Res 27(11–12):1131–1152

Xia Y, Liu C, Da B, Xie F (2018) A novel heterogeneous ensemble credit scoring model based on bstacking approach. Expert Syst Appl 93:182–199

Xia Y, Zhao J, He L, Li Y, Niu M (2020) A novel tree-based dynamic heterogeneous ensemble method for credit scoring. Expert Syst Appl 159:113615

Xiao XJ, Zhong ZY, Xie L, Xin G, Liu D (2020) Cost-sensitive semi-supervised selective ensemble model for customer credit scoring. Knowl-Based Syst 189:105118

Zhang W, Yang D, Zhang S (2021a) A new hybrid ensemble model with voting-based outlier detection and balanced sampling for credit scoring. Expert Syst Appl 174:114744

Zhang W, Yang D, Zhang S, Ablanedo-Rosas JH, Xin W, Lou Yu (2021b) A novel multi-stage ensemble model with enhanced outlier adaptation for credit scoring. Expert Syst Appl 165:113872

Zhou Y, Shen L, Ballester L (2023) A two-stage credit scoring model based on random forest: evidence from Chinese small firms. Int Rev Financ Anal 89:102755

Download references

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and affiliations.

Department of Computer Science and Engineering, Delhi Technological University, Delhi, 110042, India

Indu Singh, S. Aditya & Mihir Rajora

Department of Electrical Engineering, Visvesvaraya National Institute of Technology, Nagpur, Maharashtra, 440010, India

D. P. Kothari

Department of Electronics and Communication Engineering, National Institute of Technology, Hamirpur, Himachal Pradesh, 177005, India

Charu Agarwal

Department of Electronics and communication Engineering, Delhi Technological University, Delhi, 110042, India

Vibhor Gautam

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Indu Singh .

Ethics declarations

Conflict of interest.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Informed consent

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Singh, I., Kothari, D.P., Aditya, S. et al. A hybrid metaheuristic optimised ensemble classifier with self organizing map clustering for credit scoring. Oper Res Int J 24 , 57 (2024). https://doi.org/10.1007/s12351-024-00864-3

Download citation

Received : 23 June 2023

Revised : 18 August 2024

Accepted : 23 August 2024

Published : 18 September 2024

DOI : https://doi.org/10.1007/s12351-024-00864-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning
  • Credit scoring
  • Self organizing maps
  • Ensemble classification
  • Sailfish optimizer
  • Find a journal
  • Publish with us
  • Track your research

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Optimized Application of the Decision Tree ID3 Algorithm Based on Big Data in Sports Performance Management

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, index terms.

Applied computing

Operations research

Decision analysis

Information systems

Data management systems

Information storage systems

Record storage systems

Directory structures

Information systems applications

Data mining

Decision support systems

Data analytics

Mathematics of computing

Discrete mathematics

Graph theory

Recommendations

An improved decision tree classification algorithm based on id3 and the application in score analysis.

The Decision Tree is an important classification method in data mining classification. Aiming at deficiency of ID3 algorism, a new improved classification algorism is proposed in this paper. The new algorithm combines principle of Taylor Formula with ...

Improved Decision Tree Method for Imbalanced Data Sets in Digital Forensics

Improved decision tree ID3 algorithm for suiting digital forensics is presented in the study. Forensics data are imbalanced, inconstant, noisy and dispersive. Based on these characteristic, we improve ID3 algorithm by adopting correction factor and two ...

Sports Big Data: Management, Analysis, Applications, and Challenges

With the rapid growth of information technology and sports, analyzing sports information has become an increasingly challenging issue. Sports big data come from the Internet and show a rapid growth trend. Sports big data contain rich information such as ...

Information

Published in.

United States

Publication History

Author tags.

  • Sports Performance Management
  • Big Data Function
  • Decision Tree
  • User Interest

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

We've detected unusual activity from your computer network

To continue, please click the box below to let us know you're not a robot.

Why did this happen?

Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. For more information you can review our Terms of Service and Cookie Policy .

For inquiries related to this message please contact our support team and provide the reference ID below.

IMAGES

  1. A Research on Different Clustering Algorithms and Techniques

    research paper on clustering algorithms

  2. Classification of clustering algorithms

    research paper on clustering algorithms

  3. compares the clustering algorithms surveyed in this paper including

    research paper on clustering algorithms

  4. Top 50 Papers in Clustering Algorithms for Machine Learning

    research paper on clustering algorithms

  5. (PDF) Top Papers on Clustering Algorithms

    research paper on clustering algorithms

  6. Top 10 Research Papers in Clustering Algorithms

    research paper on clustering algorithms

VIDEO

  1. Introduction to Clustering Algorithms

  2. Types of Methods of Clustering Algorithms

  3. Cluster Analysis

  4. The DBSCAN Clustering Algorithm Explained

  5. Multi-feature trajectory clustering using mean shift

  6. Bibliometric Analysis Using CitNetExplorer and VOSviewer (bibliometrc) (vosviewer) (citnetexploree)

COMMENTS

  1. A comprehensive survey of clustering algorithms: State-of-the-art

    Defined possible future research trends and directions regarding the implementation and application of clustering algorithms in different research domains. 2. ... Evolutionary algorithms: This paper provides an up-to-date overview of evolutionary algorithms for clustering, including advanced topics such as multiobjective and ensemble-based ...

  2. [2401.07389] A Rapid Review of Clustering Algorithms

    A Rapid Review of Clustering Algorithms. Clustering algorithms aim to organize data into groups or clusters based on the inherent patterns and similarities within the data. They play an important role in today's life, such as in marketing and e-commerce, healthcare, data organization and analysis, and social media.

  3. Clustering algorithms: A comparative approach

    Two approaches were considered: clustering algorithms focused in minimizing a distance based objective function and a Gaussian models-based approach. The following algorithms were compared: k-means, random swap, expectation-maximization, hierarchical clustering, self-organized maps (SOM) and fuzzy c-means.

  4. K-means clustering algorithms: A comprehensive review, variants

    Despite these limitations, the K-means clustering algorithm is credited with flexibility, efficiency, and ease of implementation. It is also among the top ten clustering algorithms in data mining [59], [217], [105], [94].The simplicity and low computational complexity have given the K-means clustering algorithm a wide acceptance in many domains for solving clustering problems.

  5. A Rapid Review of Clustering Algorithms

    comprehensive and highly cited research on clustering algorithms since 2000. Pub-lished in 2005, 2015, and 2022, these works have made significant contributions to the ... we removed duplicate papers not directly related to clustering algorithms to ensure that the remaining content comprises algorithm intro-ductions, articles on algorithm ...

  6. A Comprehensive Survey of Clustering Algorithms

    4.1 Clustering Algorithm Based on Partition. The basic idea of this kind of clustering algorithms is to regard the center of data points as the center of the corresponding cluster. K-means [] and K-medoids [] are the two most famous ones of this kind of clustering algorithms.The core idea of K-means is to update the center of cluster which is represented by the center of data points, by ...

  7. (PDF) A Review of Clustering Algorithms

    The review concludes with a reflection on the challenges faced by existing clustering algorithms and proposes avenues for future research. This paper aims to provide a valuable resource for ...

  8. A detailed study of clustering algorithms

    Various clustering algorithms have been developed under different paradigms for grouping scattered data points and forming efficient cluster shapes with minimal outliers. This paper attempts to address the problem of creating evenly shaped clusters in detail and aims to study, review and analyze few clustering algorithms falling under different ...

  9. K-means clustering algorithms: A comprehensive review, variants

    In this paper, the following focal research question was proposed to reflect the purpose of this comprehensive review work: ... This study aims at conducting a review of the K-means clustering algorithm variants. The research methodology adopted for the study is presented in this section. Kitchenham et al.'s [113] guidelines for a systematic ...

  10. A Systematic Literature Review on Identifying Patterns Using ...

    It investigates the evolution measures for clustering algorithms with an emphasis on metrics used to gauge clustering quality, such as the F-measure and the Rand Index. ... A key point in this SLR is unsupervised clustering algorithms. As a result, the research papers should include only the role of data mining in unsupervised clustering ...

  11. Data Clustering: Algorithms and Its Applications

    Data is useless if information or knowledge that can be used for further reasoning cannot be inferred from it. Cluster analysis, based on some criteria, shares data into important, practical or both categories (clusters) based on shared common characteristics. In research, clustering and classification have been used to analyze data, in the field of machine learning, bioinformatics, statistics ...

  12. Comprehensive survey on hierarchical clustering algorithms and the

    The algorithm is a two-stage parallel clustering method that integrates subspace clustering algorithm and conventional agglomerative hierarchical clustering algorithm. In the paper of Cheung and Zhang , they have proposed the growing multilayer topology training algorithm called GMTT, which trains a collection of seed points that are linked in ...

  13. (PDF) Hierarchical Clustering: A Survey

    This paper focuses on document clustering algorithms that build such hierarchical so- lutions and (i) presents a comprehensive study of partitional and agglomerative algorithms that use different ...

  14. Quantum-inspired clustering with light

    The algorithm at the core of this experiment is an unsupervised quantum clustering algorithm, designed for the classification of data points within a given dataset where no prior information about ...

  15. A study of hierarchical clustering algorithms

    This paper focuses on hierarchical agglomerative clustering. In this paper, we also explain some agglomerative algorithms and their comparison. Published in: Article #: Date of Conference: 11-13 March 2015. Date Added to IEEE Xplore: 04 May 2015. ISBN Information: Electronic ISBN: 978-9-3805-4416-8. Print ISBN: 978-9-3805-4415-1.

  16. Clustering Algorithms Research

    The research actuality and new progress in clustering algorithm in recent years are summarized in this paper and can give a valuable reference for data clustering and data mining. The research actuality and new progress in clustering algorithm in recent years are summarized in this paper. First, the analysis and induction of some representative clustering algorithms have been made from several ...

  17. [2210.04142] Deep Clustering: A Comprehensive Survey

    Deep Clustering: A Comprehensive Survey. Cluster analysis plays an indispensable role in machine learning and data mining. Learning a good data representation is crucial for clustering algorithms. Recently, deep clustering, which can learn clustering-friendly representations using deep neural networks, has been broadly applied in a wide range ...

  18. ESCHR: a hyperparameter-randomized ensemble approach for robust

    Clustering is widely used for single-cell analysis, but current methods are limited in accuracy, robustness, ease of use, and interpretability. To address these limitations, we developed an ensemble clustering method that outperforms other methods at hard clustering without the need for hyperparameter tuning. It also performs soft clustering to characterize continuum-like regions and quantify ...

  19. A review of clustering techniques and developments

    A very popular neural algorithm for clustering is the self-organizing map (SOM) [104], [105]. SOM is commonly used for vector quantization, feature extraction and data visualization along with clustering analysis. This algorithm constructs a single-layered network as shown in Fig. 9. The learning process takes place in a "winner-takes-all ...

  20. Research and Application of Clustering Algorithm for Text Big Data

    1. Introduction. Text detection is the fundamental step in many computer vision applications. This paper introduces a novel text detection technique, and we verify the utility of the method in the YouTube video text dataset and find that the method runs more than 2 times [] of the classical method.This paper expounds the existing research work on the decomposition and inference of clustering ...

  21. An Enhanced K-Means Clustering Algorithm for Phishing Attack ...

    This paper presents a comprehensive study on the application of Machine Learning (ML) techniques for identifying phishing websites, with a focus on enhancing detection accuracy and efficiency. We propose an approach that integrates the CfsSubsetEval attribute evaluator with the K-Means Clustering algorithm to improve phishing detection ...

  22. Title: Modern hierarchical, agglomerative clustering algorithms

    This paper presents algorithms for hierarchical, agglomerative clustering which perform most efficiently in the general-purpose setup that is given in modern standard software. Requirements are: (1) the input data is given by pairwise dissimilarities between data points, but extensions to vector data are also discussed (2) the output is a "stepwise dendrogram", a data structure which is shared ...

  23. A hybrid metaheuristic optimised ensemble classifier with self

    By undersampling the majority class with various clustering algorithms and evaluating performance using different supervised and ensemble learning methods. ... Learning Repository has various real-world public datasets for credit scoring that are commonly utilized in many research papers to evaluate the performance of the proposed algorithms ...

  24. Research on k-means Clustering Algorithm: An Improved k-means

    Abstract: Clustering analysis method is one of the main analytical methods in data mining, the method of clustering algorithm will influence the clustering results directly. This paper discusses the standard k-means clustering algorithm and analyzes the shortcomings of standard k-means algorithm, such as the k-means clustering algorithm has to calculate the distance between each data object ...

  25. Improved Security for Multimedia Data Visualization using Hierarchical

    In this paper, a realization technique is designed with a unique analytical model for transmitting multimedia data to appropriate end users. ... In the proposed method one best type of clustering termed as Hierarchical Clustering Algorithm (HCA) ... Research on the application of multimedia elements in visual communication art under the ...

  26. Optimized Application of the Decision Tree ID3 Algorithm Based on Big

    The research results show that the data query and data display functions are widely used in the system, and the utilization rate of each function of big data in the system can be clearly seen. ... Zhou, Y., Zhou, L., & Li, H. (2019). Information classification algorithm based on decision tree optimization. Cluster Computing, 22(S3), 7559-7568 ...

  27. A review of the development of YOLO object detection algorithm

    This paper mainly discusses the development processes of the YOLO algorithm series, focuses on the changes and innovations in network structure, training strategies, and performance optimization.

  28. Online Dating Caused a Rise in US Income Inequality, Research Paper

    Online dating may be partially to blame for an increase in income inequality in the US in recent decades, according to a research paper. Since the emergence of dating apps that allow people to ...