September 21, 2022
Toward the Detection of Polyglot Files
AbstractStandardized file types play a key role in the development and use of computer software. However, it is possible to confound standardized file type processing by creating a file that is valid in multiple file types. The resulting polyglot (many languages) file can confuse file type identification, allowing elements of the file to evade analysis. This is especially problematic for malware detection systems that rely on file type identification for feature extraction. Although work has been done to identify file types using more comprehensive methods than file signatures, accurate identification of polyglot files remains an open problem. Since malware detection systems routinely perform file type-specific feature extraction, polyglot files need to be filtered out prior to ingestion by these systems. Otherwise, malicious content could pass through undetected. To address the problem of polyglot detection we assembled a data set using the mitra tool. We then evaluated the performance of the most commonly used file identification tools, including file, polydet, binwalk, and TrID. Our analysis demonstrates that existing file type detection tools fail to provide reliable polyglot detection. We then evaluated the ability of a range of machine and deep learning models to detect polyglot files. The most performant models were MalConv2 and Catboost, which demonstrated the highest recall on our data set with 95.16% and 95.45%, respectively. These models outperformed existing methods and could be incorporated into a malware detector’s file processing pipeline to filter out potentially malicious polyglots before file type-dependent feature extraction takes place.
Published: September 21, 2022