August 4, 2016
Conference Paper

A Biosequence-based Approach to Software Characterization

Abstract

For many applications, it is desirable to have some process for recognizing when software binaries are closely related without relying on them to be identical or have identical segments. Some examples include monitoring utilization of high performance computing centers or service clouds, detecting freeware in licensed code, and enforcing application whitelists. But doing so in a dynamic environment is a nontrivial task because most approaches to software similarity require extensive and time-consuming analysis of a binary, or they fail to recognize executables that are similar but nonidentical. Presented herein is a novel biosequence-based method for quantifying similarity of executable binaries. Using this method, it is shown in an example application on large-scale multi-author codes that 1) the biosequence-based method has a statistical performance in recognizing and distinguishing between a collection of real-world high performance computing applications better than 90% of ideal; and 2) an example of using family tree analysis to tune identification for a code subfamily can achieve better than 99% of ideal performance.

Revised: January 5, 2017 | Published: August 4, 2016

Citation

Oehmen C.S., E.S. Peterson, A.R. Phillips, and D.S. Curtis. 2016. A Biosequence-based Approach to Software Characterization. In IEEE Security and Privacy Workshops (SPW 2016), May 22-26, 2016, San Jose, California, 118-125. Palo Alto, California:IEEE Computer Society. PNNL-SA-115860. doi:10.1109/SPW.2016.43