VulBERTa: simplified source code pre-training for vulnerability detection

Document Type

Conference Item

Publication Date

1-1-2022

Abstract

This paper presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.

Keywords

Vulnerability detection, Software vulnerabilites, Pre-training, Deep learning, Representation learning

Divisions

Software

Funders

Google [Grant no. GCP19980904]

Publisher

IEEE

Publisher Location

345 E 47TH ST, NEW YORK, NY 10017 USA

Event Title

2022 International Joint Conference on Neural Networks, IJCNN 2022

Event Location

Padua

Event Dates

18-23 July 2022

Event Type

conference

This document is currently not available here.

Share

COinS