CodeExtract: Enhancing Binary Code Similarity Detection with Code Extraction Techniques (LCTES 2024 - Languages, Compilers, Tools and Theory of Embedded Systems)

Who

Lichen Jia, Chenggang Wu, Zhe Wang, Peihua Zhang

Track

LCTES 2024

Time Zone

The program is currently displayed in (GMT+02:00) Windhoek.

Use conference time zone: (GMT+02:00) WindhoekSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 24 Jun 2024 16:15 - 16:30 at Iceland - Analysis and Testing Chair(s): Jason Xue

Abstract

In the field of Binary Code Similarity Detection (BCSD), when dealing with functions in binary form, the conventional approach is to identify a set of functions that are most similar to the target function. These similar functions often originate from the same source code but may differ due to variations in compilation settings. Such analysis is crucial for applications in the security domain, involving vulnerability discovery, malware identification, and more. Function inlining, an optimization technique employed by compilers, embeds the code of called functions directly into the calling function. Due to different compilation options (such as O1 and O3) leading to varying levels of function inlining, this results in significant discrepancies between binary functions derived from the same source code under different compilation settings, posing challenges to the accuracy of state-of-the-art (SOTA) learning-based binary code similarity detection (LB-BCSD) methods. In contrast to function inlining, code extraction technology can identify and separate duplicate code within a program, replacing it with corresponding function calls. To overcome the impact of function inlining, this paper introduces a novel approach, CodeExtract. This method initially utilizes code extraction techniques to transform code introduced by function inlining back into function calls. Subsequently, it actively inlines functions that cannot undergo code extraction, effectively eliminating the differences introduced by function inlining. Experimental validation shows that CodeExtract enhances the performance of LB-BCSD models by 20% in addressing the challenges posed by function inlining.

Lichen Jia

Institute of Computing Technology, Chinese Academy of Sciences

Chenggang Wu

Institute of Computing Technology at Chinese Academy of Sciences; University of Chinese Academy of Sciences; Zhongguancun Laboratory

China

Zhe Wang

Institute of Computing Technology at Chinese Academy of Sciences; Zhongguancun Laboratory