Big Data Clustering

Developing effective, efficient, and highly scalable clustering algorithms for big datasets

During my PhD studies, I explored the following research questions:

Is it possible to obtain an accurate clustering solution using only a small fraction of the available points in a big dataset?
What would be a more precise and practically useful definition of big data?
Is it necessary to employ more complex hybrid algorithms in the search for more accurate clustering solutions? Can we follow the ``less is more’’ principle in designing big data clustering algorithms instead?
Can a decomposition principle be used to achieve global optimization properties in the big data clustering problem?
Is it possible to develop a simple, scalable, and parallelizable big data clustering algorithm that is more effective and efficient than the existing state-of-the-art hybrid approaches?

References

Journal Articles

2025

ApplSciences

BiModalClust: Fused Data and Neighborhood Variation for Advanced K-Means Big Data Clustering

Ravil Mussabayev, and Rustam Mussabayev

Abs HTML Preprint Code

K-means clustering is a fundamental tool in data mining, yet its scalability and efficacy decline when faced with massive datasets. In this work, we introduce BiModalClust, a novel clustering algorithm that leverages a bimodal optimization paradigm to overcome these challenges. Our approach simultaneously optimizes two interdependent modalities: the input data stream and the neighborhood structure of the solution landscape, which emerges from iterative restrictions of the Minimum Sum-of-Squares Clustering (MSSC) objective function to sampled subsets of the data. By integrating the Variable Neighborhood Search (VNS) metaheuristic, we systematically explore and refine these landscapes through dynamic reinitialization of degenerate centroids and adaptive exploration of expanding neighborhoods. This dual-stream optimization not only transforms traditional local search into a more global and robust process but also ensures computational scalability and precision. Extensive experimentation on diverse real-world datasets demonstrates that BiModalClust achieves superior clustering performance among K-means-based methods in big data environments.

2024

Mathematics

High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall Data

Ravil Mussabayev, and Rustam Mussabayev

Abs HTML Preprint Code

This paper introduces a novel formulation of the clustering problem, namely, the minimum sum-of-squares clustering of infinitely tall data (MSSC-ITD), and presents HPClust, an innovative set of hybrid parallel approaches for its effective solution. By utilizing modern high-performance computing techniques, HPClust enhances key clustering metrics: effectiveness, computational efficiency, and scalability. In contrast to vanilla data parallelism, which only accelerates processing time through the MapReduce framework, our approach unlocks superior performance by leveraging the multi-strategy competitive–cooperative parallelism and intricate properties of the objective function landscape. Unlike other available algorithms that struggle to scale, our algorithm is inherently parallel in nature, improving solution quality through increased scalability and parallelism and outperforming even advanced algorithms designed for small- and medium-sized datasets. Our evaluation of HPClust, featuring four parallel strategies, demonstrates its superiority over traditional and cutting-edge methods by offering better performance in the key metrics. These results also show that parallel processing not only enhances the clustering efficiency, but the accuracy as well. Additionally, we explore the balance between computational efficiency and clustering quality, providing insights into optimal parallel strategies based on dataset specifics and resource availability. This research advances our understanding of parallelism in clustering algorithms, demonstrating that a judicious hybridization of advanced parallel approaches yields optimal results for MSSC-ITD. Experiments on the synthetic data further confirm HPClust’s exceptional scalability and robustness to noise.
ApplSoftComp

Comparative Analysis of Optimization Strategies for K-means Clustering in Big Data Contexts: A Review

Ravil Mussabayev, and Rustam Mussabayev

Abs Preprint

This paper presents a comparative analysis of different optimization techniques for the K-means algorithm in the context of big data. K-means is a widely used clustering algorithm, but it can suffer from scalability issues when dealing with large datasets. The paper explores different approaches to overcome these issues, including parallelization, approximation, and sampling methods. The authors evaluate the performance of various clustering techniques on a large number of benchmark datasets, comparing them according to the dominance criterion provided by the "less is more" approach (LIMA), i.e., simultaneously along the dimensions of speed, clustering quality, and simplicity. The results show that different techniques are more suitable for different types of datasets and provide insights into the trade-offs between speed and accuracy in K-means clustering for big data. Overall, the paper offers a comprehensive guide for practitioners and researchers on how to optimize K-means for big data applications.

2023

PattRecog

How to Use K-means for Big Data Clustering?

Rustam Mussabayev, Nenad Mladenovic, Bassem Jarboui, and Ravil Mussabayev

Abs HTML Preprint Code

K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of data. Therefore, it is crucial to improve K-means by scaling it to big data using as few of the following computational resources as possible: data, time, and algorithmic ingredients. We propose a new parallel scheme of using K-means and K-means++ algorithms for big data clustering that satisfies the properties of a “true big data” algorithm and outperforms the classical and recent state-of-the-art MSSC approaches in terms of solution quality and runtime. The new approach naturally implements global search by decomposing the MSSC problem without using additional metaheuristics. This work shows that data decomposition is the basic approach to solve the big data clustering problem. The empirical success of the new algorithm allowed us to challenge the common belief that more data is required to obtain a good clustering solution. Moreover, the present work questions the established trend that more sophisticated hybrid approaches and algorithms are required to obtain a better clustering solution.

Books & Conferences

2024

SpringerNature

Superior Parallel Big Data Clustering Through Competitive Stochastic Sample Size Optimization in Big-Means

Rustam Mussabayev, and Ravil Mussabayev

Abs HTML Preprint

This paper introduces a novel K-means clustering algorithm, an advancement on the conventional Big-means methodology. The proposed method efficiently integrates parallel processing, stochastic sampling, and competitive optimization to create a scalable variant designed for big data applications. It addresses scalability and computation time challenges typically faced with traditional techniques. The algorithm adjusts sample sizes dynamically for each worker during execution, optimizing performance. Data from these sample sizes are continually analyzed, facilitating the identification of the most efficient configuration. By incorporating a competitive element among workers using different sample sizes, efficiency within the Big-means algorithm is further stimulated. In essence, the algorithm balances computational time and clustering quality by employing a stochastic, competitive sampling strategy in a parallel computing setting.
SpringerNature

Optimizing Parallelization Strategies for the Big-Means Clustering Algorithm

Ravil Mussabayev, and Rustam Mussabayev

Abs HTML Preprint

This study focuses on the optimization of the Big-means algorithm for clustering large-scale datasets, exploring three distinct parallelization strategies. We conducted extensive experiments to assess the computational efficiency, scalability, and clustering performance of each approach, revealing their benefits and limitations. The paper also delves into the trade-offs between computational efficiency and clustering quality, examining the impacts of various factors. Our insights provide practical guidance on selecting the best parallelization strategy based on available resources and dataset characteristics, contributing to a deeper understanding of parallelization techniques for the Big-means algorithm.