Status: Ongoing

FuncTracker: Discovering Shared Code to Aid Malware Forensics


Malware research is currently undergoing a shift in focus. With the rise of Advanced Persistent Threats (APT), it is no longer sufficient to simply know what happened. There is a growing need to understand the who and the why as well. Essential to this new focus of malware research is the discovery of relationships between various pieces of malware. Every attack, every instance of malware provides a tiny piece of the larger puzzle. It is only when these pieces are put together that the important questions can be answered. Relationships between malware help the researcher and analyst determine where each piece of the puzzle fits.

One of the key indicators of the existence of relationships is shared code. Code share between two malware can imply shared authorship, behaviors, purpose, lineage, tool chains, etc. For example, shared code was a major contributor in a report by Gostev and Soumenkov (2011) that postulated close ties between the authors of Duqu and Stuxnet. Malware forensics, then, would greatly benefit from a systematic way of discovering, encoding, and searching these relationships based on shared code.

There have been efforts in clustering related malware, but the application of these efforts in the general case of shared code is limited. Clustering algorithms are generally inefficient, taking on the order of O(n2). This can be an expensive prospect when dealing with millions of malware; it is an unaffordable cost when these millions of malware are each broken into hundreds or thousands of pieces. In addition, these clustering applications generally focus on the whole binary. Binaries that share only a single function will have very low similarity between them, and a single shared function will be missed.

We present FuncTracker, a system for discovering, exploring, and visualizing relationships between malware. Relationships are determined based on shared functions. This level of comparison was shown to provide a reasonable level of granularity by Cohen and Havrilla (2009) and Jin et al. (2012). Functions are determined to be similar based on matching semantic hash. These semantic hashes are computed using features at multiple levels of abstraction, both syntactic and semantic, designed to be robust against common obfuscations. The use of hashes instead of a similarity function, such as Jaccard Index, allows for scalability as the comparison is constant time.

The discovered relationships between malware are represented in a graph with annotated edges and nodes, thus allowing for efficient searching and navigation. In addition to shared functions, this graph based representation allows for any type of additional relationship to be represented, such as URLs contacted, APIs called, etc. Different types of searches are enabled by either traversing the graph or by searching for nodes with a specific or matching annotations.

Discovering Shared Functions

Our central problem in FuncTracker is locating small, but non-trivial pieces of shared code in otherwise unrelated binaries. Computing whole binary similarity is too coarse as we expect the overall similarity to be low. Using a block level is too fine grain as it is likely there will be a large number of small, trivial pieces of code matching causing much noise. Functions, however, should provide a nice middle ground.

Functions are represented by the set of blocks (from the control flow graph) they contain. Blocks are represented by four features created using BinJuice Lakhotia, Preda, and Giacobazzi (2013): Code, Semantics, GenSemantics (termed “Juice”), and GenCode. Each of these features is a single string representing a different layer of abstraction. An illustrative example is given below.

mov  eax,0x5
add  ebx,0x4
imul eax,ebx
eax = 5
ebx = def(ebx) x 5 + 20
(a) Code (c) Semantics
mov  A, N1
add  B, N2
imul A, B
A = N1
B = def(B) x N1 + N2
(b) GenCode (d) GenSemantics

Code is, as the name implies, the disassembled code of the function. This includes the actual instruction (its mnemonic) and the operands. A common abstraction from code is to use only the instruction mnemonics and not the operands (Khoo, Mycroft, and Anderson 2013; Lakhotia et al. 2013; Zynamics 2013). However, by ignoring operands any data flow information is lost. BinJuice defines the feature GenCode that abstracts away from Code any specific registers and memory locations, using logic variable to capture the data flow relationships. The abstraction is performed such that two Code features that differ only in the names of registers and addresses used will map to the same GenCode feature.

The two syntax based features described above allow for strong matches, but they are also easy to break. Simply reordering instructions or substituting one instruction for a semantically equivalent one will result in a different Code and GenCode feature. Thus, BinJuice provides the Semantics feature using symbolic interpretation and algebraic simplification to represent the operational semantics of a block. The Semantics feature is designed to be the same for two blocks that are functionally equivalent, regardless of what the code looks like. GenSemantics is analogous to GenCode in both motivation and creation. Both Semantics and GenSemantics have a canonical form that allows for constant time comparison.

A major concern in searching for similar code is scale. Malware collections often number in the millions of binaries and each individual malware additionally contains many functions, multiplying the required scale by a hundred or thousand fold. In order to efficiently handle this scale of comparisons, we use hashes instead of classic distance functions. Four different kinds of hashes are created, one corresponding to each type of BinJuice feature. The hashes are created by sorting the features representing the blocks in the function, concatenating them, and taking a cryptographic hash. Lest this method appear naïve, work by Cohen and Havrilla (2009) showed that even a hash of raw bytes of a function, abstracted simply by zero-ing bytes representing memory addresses, is very effecting in finding similar code. Using hashes of semantics ought to provide a larger set of functions allowing a constant time lookup of similar functions.

Exploring Relationships

Discovered relationships must still be presented to an analyst in a usable and comprehensible format. We represent these relationships using a graph structure. Nodes in the graph represent objects such as a binary or a function; edges indicate a relationship between objects. Properties can additionally be attached to a node. Such a representation allows for intuitive visualizations and an easy method of constructing queries to explore the relationships. Searches can be performed by either querying for nodes with similar properties or by traversing the graph.

In our selected graph structure, the permitted types of nodes are binaries, functions, and blocks. A binary contains (has edges between) functions and a function contains blocks. Blocks have their respective BinJuice features attached as properties and functions have the various hashes as properties. An example query is to start from a binary of interest, traverse the graph to find all contained functions, query for functions with similar GenCode hash, and then traverse from the returned nodes to binaries. This results in returning binaries that share some function based on GenCode hash, along with the matching functions.

Graph representation offers the additional benefit of extensibility. The graph may be populated with additional types of nodes, edges, and properties; and all of the connections explored using graph traversal algorithms. For example, if known, we may associate binaries with their authors. We can then identify binaries likely written by the same author by traversing connections through similar functions. Similarly, if a set of functions in a particular binary are known to implement a certain behavior, this information may be associated with other functions with similar semantics.


We have built an initial prototype titled FuncTracker on top of the intelligence platform Palantir1. New binaries are first disassembled and split into functions using IDA Pro2. The resulting functions are then passed to BinJuice (Lakhotia, Preda, and Giacobazzi 2013), ignoring functions that IDA has marked as library code. FuncTracker next computes the hashes as described in Section [sec:discovering]. The graph structure described in Section [sec:exploring] is created and loaded into Palantir. The built in capabilities of Palantir can then be used to explore the newly discovered relationships, draw additional connections, and collaborate with other analysts.


This work was supported by Air Force Research Laboratory, Rome, NY and the Defense Advanced Research Projects Agency under contract FA8750-10-C-0171, with thanks to Mr. Timothy Fraser. The views expressed are those of the authors and do not reflect the official policy or position of the US Department of Defense or the U.S. Government.


Bayer, Ulrich, Paolo Milani Comparetti, Clemens Hlauschek, Christopher Kruegel, and Engin Kirda. 2009. “Scalable, Behavior-Based Malware Clustering.” In 16th Annual Network & Distributed Security Symposium (NDSS’2009), 9:8–11. Internet Society.

Christodorescu, M., and S. Jha. 2003. “Static Analysis of Executables to Detect Malicious Patterns.” In Proc. of the 12th USENIX Security Symposium.

Cohen, Cory, and Jeffrey S. Havrilla. 2009. “Function Hashing for Malicious Code Analysis.” In CERT Research Annual Report, 26–29. Software Engineering Institute, Carnegie Mellon University.

Gao, Debin, Michael K. Reiter, and Dawn Song. 2008. “Binhunt: Automatically finding semantic differences in binary programs.” In Information and Communications Security, 238–255. Springer.

Gostev, Alexander, and Igor Soumenkov. 2011. “Stuxnet/Duqu: The Evolution of Drivers.”

Jang, Jiyong, David Brumley, and Shobha Venkataraman. 2011. “BitShred: Feature hashing malware for scalable triage and semantic analysis.” In Proc. of the 18th ACM Conf. on Computer and Communications Security (CCS), 309–320. ACM.

Jin, W., S. Chaki, C. Cohen, A. Gurfinkel, J. Havrilla, C. Hines, and P. Narasimhan. 2012. “Binary Function Clustering Using Semantic Hashes.” In 11th International Conf. on Machine Learning and Applications (ICMLA), 1:386–391. doi:10.1109/ICMLA.2012.70.

Khoo, Wei Ming, Alan Mycroft, and Ross Anderson. 2013. “Rendezvous: A Search Engine for Binary Code.” In Proc. of the 10th Working Conf. on Mining Software Repositories (MSR), 329–338. IEEE Press.

Kruegel, Christopher, Engin Kirda, Darren Mutz, William Robertson, and Giovanni Vigna. 2006. “Polymorphic worm detection using structural information of executables.” In Recent Advances in Intrusion Detection, 207–226. Springer.

Lakhotia, Arun, Mila Dalla Preda, and Roberto Giacobazzi. 2013. “Fast location of similar code fragments using semantic ’juice’.” In Proc. of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop (PPREW). ACM.

Lakhotia, Arun, Andrew Walenstein, Craig Miles, and Anshuman Singh. 2013. “VILO: A rapid learning nearest-neighbor classifier for malware triage.” Journal of Computer Virology and Hacking Techniques (mar): 1–15. doi:10.1007/s11416-013-0178-3.

Roussev, Vassil. 2010. “Data fingerprinting with similarity digests.” In Advances in Digital Forensics VI, 207–226. Springer.

Saebjornsen, Andreas, Jeremiah Willcock, Thomas Panas, Daniel Quinlan, and Zhendong Su. 2009. “Detecting code clones in binary executables.” In Proc. of the 18th International Symposium on Software Testing and Analysis (ISSTA), 117–128. ACM.

Zynamics. “BinDiff 3.2 Manual.”