一种自然语言处理方法，用于指导设置体系结构识别

论文标题

一种自然语言处理方法，用于指导设置体系结构识别

A Natural Language Processing Approach for Instruction Set Architecture Identification

论文作者

Sahabandu, Dinuka, Mertoguno, Sukarno, Poovendran, Radha

论文摘要

软件的二进制分析是网络取证应用程序中的关键一步，例如程序漏洞评估和恶意软件检测。这涉及解释软件执行的指令，并且通常需要将软件的二进制文件数据转换为汇编语言。转换过程需要有关二进制文件的目标指令集体系结构（ISA）的信息。但是，由于编译错误，部分下载或文件元数据的对抗性损坏，ISA信息可能不包含在二进制文件中。机器学习（ML）是一种有前途的方法，可用于使用二进制文件的对象代码部分中的二进制数据来识别目标ISA。在本文中，我们提出了一个二进制代码提取模型，以提高基于ML的ISA识别方法的准确性和可扩展性。我们的特征提取模型可以在没有有关ISA的领域知识的情况下使用。具体而言，我们将模型从自然语言处理（NLP）调整到i）识别在二进制代码中通常观察到的连续字节模式，ii）估计每个字节模式对二进制文件的重要性，iiii）估计每个字节模式在ISAS之间区分区分ISA之间的相关性。我们介绍了编码二进制文件的字符级特征，以识别每个ISA固有的细粒位模式。我们使用来自12种不同ISA的二进制文件的数据集来评估我们的方法。经验评估表明，在基于ML的ISA识别中使用我们的字节级特征会比基于字节 - 历史图和字节模式签名的最先进特征的精度高8％。我们观察到角色级特征允许将设置的功能的大小降低到16倍，同时将精度保持在97％以上。

Binary analysis of software is a critical step in cyber forensics applications such as program vulnerability assessment and malware detection. This involves interpreting instructions executed by software and often necessitates converting the software's binary file data to assembly language. The conversion process requires information about the binary file's target instruction set architecture (ISA). However, ISA information might not be included in binary files due to compilation errors, partial downloads, or adversarial corruption of file metadata. Machine learning (ML) is a promising methodology that can be used to identify the target ISA using binary data in the object code section of binary files. In this paper we propose a binary code feature extraction model to improve the accuracy and scalability of ML-based ISA identification methods. Our feature extraction model can be used in the absence of domain knowledge about the ISAs. Specifically, we adapt models from natural language processing (NLP) to i) identify successive byte patterns commonly observed in binary codes, ii) estimate the significance of each byte pattern to a binary file, and iii) estimate the relevance of each byte pattern in distinguishing between ISAs. We introduce character-level features of encoded binaries to identify fine-grained bit patterns inherent to each ISA. We use a dataset with binaries from 12 different ISAs to evaluate our approach. Empirical evaluations show that using our byte-level features in ML-based ISA identification results in an 8% higher accuracy than the state-of-the-art features based on byte-histograms and byte pattern signatures. We observe that character-level features allow reducing the size of the feature set by up to 16x while maintaining accuracy above 97%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题