Using the instruments of machine learning, my work at Huawei revolved around the following broad research questions:
- What is the minimal context size necessary to represent and unambiguously classify a vulnerability in a static piece of source code (written in Java or C++)?
- How can we build a tool that would automatically locate this minimal context size inside a given big piece of source code (like the entire code of a GitHub project) and rank the relevance of different parts of the context?
- How can we train this vulnerability-locating tool so that it maximizes the coverage of potential vulnerabilities in the input code?
- What more powerful model could we use to more accurately analyze the extracted context? Would a Large Language Model (LLM), like GPT-4, suffice for that purpose? If yes, what combination of generation regimes would be most effective (like data flow analysis with chain-of-thought reasoning and self-reflection)?
- How to collect a real-world, high-quality vulnerability dataset without mislabeling to train and test all the above-mentioned hypotheses?
References
Working Papers
2024
-
Finetuning Large Language Models for Vulnerability Detection
Alexey Shestov, Rodion Levichev, Ravil Mussabayev, Evgeny Maslov, Anton Cheshkov, and Pavel Zadorozhny
Working paper
This paper presents the results of finetuning large language models (LLMs) for the task of detecting vulnerabilities in source code. We leverage WizardCoder, a recent improvement of the state-of-the-art LLM StarCoder, and adapt it for vulnerability detection through further finetuning. To accelerate training, we modify WizardCoder’s training procedure, also we investigate optimal training regimes. For the imbalanced dataset with many more negative examples than positive, we also explore different techniques to improve classification performance. The finetuned WizardCoder model achieves improvement in ROC AUC and F1 measures on balanced and imbalanced vulnerability datasets over CodeBERT-like model, demonstrating the effectiveness of adapting pretrained LLMs for vulnerability detection in source code. The key contributions are finetuning the state-of-the-art code LLM, WizardCoder, increasing its training speed without the performance harm, optimizing the training procedure and regimes, handling class imbalance, and improving performance on difficult vulnerability detection datasets. This demonstrates the potential for transfer learning by finetuning large pretrained language models for specialized source code analysis tasks.
Technical Reports
2023
-
Structure-Aware Code Vulnerability Analysis With Graph Neural Networks
Ravil Mussabayev
This study explores the effectiveness of graph neural networks (GNNs) for vulnerability detection in software code, utilizing a real-world dataset of Java vulnerability-fixing commits. The dataset’s structure, based on the number of modified methods in each commit, offers a natural partition that facilitates diverse investigative scenarios. The primary focus is to evaluate the general applicability of GNNs in identifying vulnerable code segments and distinguishing these from their fixed versions, as well as from random non-vulnerable code. Through a series of experiments, the research addresses key questions about the suitability of different configurations and subsets of data in enhancing the prediction accuracy of GNN models. Experiments indicate that certain model configurations, such as the pruning of specific graph elements and the exclusion of certain types of code representation, significantly improve performance. Additionally, the study highlights the importance of including random data in training to optimize the detection capabilities of GNNs.