Publications | Ravil Mussabayev

Journal Articles

2025

ApplSciences

BiModalClust: Fused Data and Neighborhood Variation for Advanced K-Means Big Data Clustering

Ravil Mussabayev, and Rustam Mussabayev

Abs HTML Preprint Code

K-means clustering is a fundamental tool in data mining, yet its scalability and efficacy decline when faced with massive datasets. In this work, we introduce BiModalClust, a novel clustering algorithm that leverages a bimodal optimization paradigm to overcome these challenges. Our approach simultaneously optimizes two interdependent modalities: the input data stream and the neighborhood structure of the solution landscape, which emerges from iterative restrictions of the Minimum Sum-of-Squares Clustering (MSSC) objective function to sampled subsets of the data. By integrating the Variable Neighborhood Search (VNS) metaheuristic, we systematically explore and refine these landscapes through dynamic reinitialization of degenerate centroids and adaptive exploration of expanding neighborhoods. This dual-stream optimization not only transforms traditional local search into a more global and robust process but also ensures computational scalability and precision. Extensive experimentation on diverse real-world datasets demonstrates that BiModalClust achieves superior clustering performance among K-means-based methods in big data environments.
IEEEAccess

Finetuning Large Language Models for Vulnerability Detection

Aleksei Shestov, Rodion Levichev, Ravil Mussabayev, Evgeny Maslov, Pavel Zadorozhny, Anton Cheshkov, Rustam Mussabayev, Alymzhan Toleu, Gulmira Tolegen, and Alexander Krassovitskiy

Abs HTML Preprint Code

This paper presents the results of finetuning large language models (LLMs) for the task of detecting vulnerabilities in source code. We leverage WizardCoder, a recent improvement of the state-of-the-art LLM StarCoder, and adapt it for vulnerability detection through further finetuning. To accelerate training, we modify WizardCoder’s training procedure, also we investigate optimal training regimes. For the imbalanced dataset with many more negative examples than positive, we also explore different techniques to improve classification performance. The finetuned WizardCoder model achieves improvement in ROC AUC and F1 measures on balanced and imbalanced vulnerability datasets over CodeBERT-like model, demonstrating the effectiveness of adapting pretrained LLMs for vulnerability detection in source code. The key contributions are finetuning the state-of-the-art code LLM, WizardCoder, increasing its training speed without the performance harm, optimizing the training procedure and regimes, handling class imbalance, and improving performance on difficult vulnerability detection datasets. This demonstrates the potential for transfer learning by finetuning large pretrained language models for specialized source code analysis tasks.

2024

Mathematics

High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall Data

Ravil Mussabayev, and Rustam Mussabayev

Abs HTML Preprint Code

This paper introduces a novel formulation of the clustering problem, namely, the minimum sum-of-squares clustering of infinitely tall data (MSSC-ITD), and presents HPClust, an innovative set of hybrid parallel approaches for its effective solution. By utilizing modern high-performance computing techniques, HPClust enhances key clustering metrics: effectiveness, computational efficiency, and scalability. In contrast to vanilla data parallelism, which only accelerates processing time through the MapReduce framework, our approach unlocks superior performance by leveraging the multi-strategy competitive–cooperative parallelism and intricate properties of the objective function landscape. Unlike other available algorithms that struggle to scale, our algorithm is inherently parallel in nature, improving solution quality through increased scalability and parallelism and outperforming even advanced algorithms designed for small- and medium-sized datasets. Our evaluation of HPClust, featuring four parallel strategies, demonstrates its superiority over traditional and cutting-edge methods by offering better performance in the key metrics. These results also show that parallel processing not only enhances the clustering efficiency, but the accuracy as well. Additionally, we explore the balance between computational efficiency and clustering quality, providing insights into optimal parallel strategies based on dataset specifics and resource availability. This research advances our understanding of parallelism in clustering algorithms, demonstrating that a judicious hybridization of advanced parallel approaches yields optimal results for MSSC-ITD. Experiments on the synthetic data further confirm HPClust’s exceptional scalability and robustness to noise.
ApplSoftComp

Comparative Analysis of Optimization Strategies for K-means Clustering in Big Data Contexts: A Review

Ravil Mussabayev, and Rustam Mussabayev

Abs Preprint

This paper presents a comparative analysis of different optimization techniques for the K-means algorithm in the context of big data. K-means is a widely used clustering algorithm, but it can suffer from scalability issues when dealing with large datasets. The paper explores different approaches to overcome these issues, including parallelization, approximation, and sampling methods. The authors evaluate the performance of various clustering techniques on a large number of benchmark datasets, comparing them according to the dominance criterion provided by the "less is more" approach (LIMA), i.e., simultaneously along the dimensions of speed, clustering quality, and simplicity. The results show that different techniques are more suitable for different types of datasets and provide insights into the trade-offs between speed and accuracy in K-means clustering for big data. Overall, the paper offers a comprehensive guide for practitioners and researchers on how to optimize K-means for big data applications.
AnnOpRes

Variable Landscape Search: A Novel Metaheuristic Paradigm for Unlocking Hidden Dimensions in Global Optimization

Rustam Mussabayev, and Ravil Mussabayev

Abs Preprint

This paper presents the Variable Landscape Search (VLS), a novel metaheuristic designed to globally optimize complex problems by dynamically altering the objective function landscape. Unlike traditional methods that operate within a static search space, VLS introduces an additional level of flexibility and diversity to the global optimization process. It does this by continuously and iteratively varying the objective function landscape through slight modifications to the problem formulation, the input data, or both. The innovation of the VLS metaheuristic stems from its unique capability to seamlessly fuse dynamic adaptations in problem formulation with modifications in input data. This dual-modality approach enables continuous exploration of interconnected and evolving search spaces, significantly enhancing the potential for discovering optimal solutions in complex, multi-faceted optimization scenarios, making it adaptable across various domains. In this paper, one of the theoretical results is obtained in the form of a generalization of the following three alternative metaheuristics, which have been reduced to special cases of VLS: Variable Formulation Search (VFS), Formulation Space Search (FSS), and Variable Search Space (VSS). As a practical application, the paper demonstrates the superior efficiency of a recent big data clustering algorithm through its conceptualization using the VLS paradigm.

2023

PattRecog

How to Use K-means for Big Data Clustering?

Rustam Mussabayev, Nenad Mladenovic, Bassem Jarboui, and Ravil Mussabayev

Abs HTML Preprint Code

K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of data. Therefore, it is crucial to improve K-means by scaling it to big data using as few of the following computational resources as possible: data, time, and algorithmic ingredients. We propose a new parallel scheme of using K-means and K-means++ algorithms for big data clustering that satisfies the properties of a “true big data” algorithm and outperforms the classical and recent state-of-the-art MSSC approaches in terms of solution quality and runtime. The new approach naturally implements global search by decomposing the MSSC problem without using additional metaheuristics. This work shows that data decomposition is the basic approach to solve the big data clustering problem. The empirical success of the new algorithm allowed us to challenge the common belief that more data is required to obtain a good clustering solution. Moreover, the present work questions the established trend that more sophisticated hybrid approaches and algorithms are required to obtain a better clustering solution.

Books & Conferences

2025

SpringerNature

Boosting K-means for Big Data by Fusing Data Streaming with Global Optimization

Ravil Mussabayev, and Rustam Mussabayev

Abs HTML Preprint Code

K-means clustering is a cornerstone of data mining, but its efficiency deteriorates when confronted with massive datasets. To address this limitation, we propose a novel heuristic algorithm that leverages the Variable Neighborhood Search (VNS) metaheuristic and the data streaming approach to optimize K-means clustering for big data. Our approach is based on the sequential optimization of the partial objective function landscapes obtained by restricting the Minimum Sum-of-Squares Clustering (MSSC) formulation to random samples from the original big dataset. Within each landscape, systematically expanding neighborhoods of the currently best (incumbent) solution are explored by reinitializing all degenerate and a varying number of additional centroids. Extensive and rigorous experimentation on a large number of real-world datasets reveals that by transforming the traditional local search into a global one, our algorithm significantly enhances the accuracy and efficiency of K-means clustering in big data environments, becoming the new state of the art in the field.

2024

SpringerNature

Superior Parallel Big Data Clustering Through Competitive Stochastic Sample Size Optimization in Big-Means

Rustam Mussabayev, and Ravil Mussabayev

Abs HTML Preprint

This paper introduces a novel K-means clustering algorithm, an advancement on the conventional Big-means methodology. The proposed method efficiently integrates parallel processing, stochastic sampling, and competitive optimization to create a scalable variant designed for big data applications. It addresses scalability and computation time challenges typically faced with traditional techniques. The algorithm adjusts sample sizes dynamically for each worker during execution, optimizing performance. Data from these sample sizes are continually analyzed, facilitating the identification of the most efficient configuration. By incorporating a competitive element among workers using different sample sizes, efficiency within the Big-means algorithm is further stimulated. In essence, the algorithm balances computational time and clustering quality by employing a stochastic, competitive sampling strategy in a parallel computing setting.
SpringerNature

Optimizing Parallelization Strategies for the Big-Means Clustering Algorithm

Ravil Mussabayev, and Rustam Mussabayev

Abs HTML Preprint

This study focuses on the optimization of the Big-means algorithm for clustering large-scale datasets, exploring three distinct parallelization strategies. We conducted extensive experiments to assess the computational efficiency, scalability, and clustering performance of each approach, revealing their benefits and limitations. The paper also delves into the trade-offs between computational efficiency and clustering quality, examining the impacts of various factors. Our insights provide practical guidance on selecting the best parallelization strategy based on available resources and dataset characteristics, contributing to a deeper understanding of parallelization techniques for the Big-means algorithm.
IEEEXplore

WRDScore: New Metric for Evaluation of Natural Language Generation Models

Ravil Mussabayev

Abs HTML Preprint Code

Evaluating natural language generation models, particularly for method name prediction, poses significant challenges. A robust metric must account for the versatility of method naming, considering both semantic and syntactic variations. Traditional overlap-based metrics, such as ROUGE, fail to capture these nuances. Existing embedding-based metrics often suffer from imbalanced precision and recall, lack normalized scores, or make unrealistic assumptions about sequences. To address these limitations, we leverage the theory of optimal transport and construct WRDScore, a novel metric that strikes a balance between simplicity and effectiveness. In the WRDScore framework, we define precision as the maximum degree to which the predicted sequence’s tokens are included in the reference sequence, token by token. Recall is calculated as the total cost of the optimal transport plan that maps the reference sequence to the predicted one. Finally, WRDScore is computed as the harmonic mean of precision and recall, balancing these two complementary metrics. Our metric is lightweight, normalized, and precision-recall-oriented, avoiding unrealistic assumptions while aligning well with human judgments. Experiments on a human-curated dataset confirm the superiority of WRDScore over other available text metrics.

2018

IEEEXplore

Reinforcement Learning Intersection Controller

Gulnur Tolebi, Nurlan S. Dairbekov, Daniyar Kurmankhojayev, and Ravil Mussabayev

Abs HTML Code

This paper presents an online model-free adaptive traffic signal controller for an isolated intersection using a Reinforcement Learning (RL) approach. We base our solution on the Q-learning algorithm with action-value approximation. In contrast with other studies in the field, we use the queue length in adddition to the average delay as a measure of performance. Also, the number of queuing vehicles and the green phase duration in four directions are aggregated to represent a state. The duration of phases is a precise value for the nonconflicting directions. Therefore, cycle length is non-fixed. Finally, we analyze and update the equilibrium and queue reduction terms in our previous equation of an immediate reward. Also, the delay based reward is tested in the given control system. The performance of the proposed method is compared with an optimal symmetric fixed signal plan.

2015

IEEEXplore

Colour-based object detection, inverse kinematics algorithms and pinhole camera model for controlling robotic arm movement system

Ravil Mussabayev

Abs HTML Code

Colour-based computer vision algorithm allows to rather precisely distinguish a number of objects disposed on an input camera frame provided that their colours differ from that of a background. Once the border of an object is detected the position of its approximate mass centre may be easily obtained. That coordinate is considered within 2D frame image coming directly from the camera and is subsequently projected to the reference frame associated with the real world by making use of the pinhole camera model and homogeneous coordinates. Given appropriate real world coordinates of the object (which is believed to be put onto a flat plane) iterative Jacobian inverse method provides efficient solution to the problem of inverse kinematics. Thus derived solution is sent to the Arduino microcontroller which in its turn rotates servo joints of the arm by means of the pulse-width modulation (PWM) technique.

Technical Reports

2023

arXiv

Structure-Aware Code Vulnerability Analysis With Graph Neural Networks

Ravil Mussabayev

Abs Preprint

This study explores the effectiveness of graph neural networks (GNNs) for vulnerability detection in software code, utilizing a real-world dataset of Java vulnerability-fixing commits. The dataset’s structure, based on the number of modified methods in each commit, offers a natural partition that facilitates diverse investigative scenarios. The primary focus is to evaluate the general applicability of GNNs in identifying vulnerable code segments and distinguishing these from their fixed versions, as well as from random non-vulnerable code. Through a series of experiments, the research addresses key questions about the suitability of different configurations and subsets of data in enhancing the prediction accuracy of GNN models. Experiments indicate that certain model configurations, such as the pruning of specific graph elements and the exclusion of certain types of code representation, significantly improve performance. Additionally, the study highlights the importance of including random data in training to optimize the detection capabilities of GNNs.