Static analysis extracts as much information as possible without actually executing the application. In many ways it is similar to reverse engineering. Analysts want to determine the nature of a suspicious application, and this requires understanding what it does and how it works. Even without source code there are several ways to extract relevant information from compiled Android applications. This article will discuss static analysis techniques, comprehensive tools for static analysis, and which problems remain unsolved.
Features
The goal in static analysis of applications is often to classify an application as malicious or benign. In classification features are used to make informed decisions. For example an automatic system for classifying rats and mice may use the features weight and length. Given some weight and length the automated system could determine with some accuracy whether the features were extracted from a rat or a mouse. With applications the features are more complex, but will serve the same purpose.
A significant part of Android’s built in security is the permissions system. Permissions allow an application to access potentially dangerous API functionality. Many applications require several permissions to function properly. These permissions must be listed explicitly in the application’s AndroidManifest.xml file and accepted by the user during installation. Figure 1 presents a screen shot of a human-readable AndroidManifest.xml file. Analysts have observed that malicious applications have significantly more permissions than benign ones (Wu, Mao, Wei, Lee, Wu, 2012). This is expected since permissions allow applications to perform actions that can potentially harm the user. Malicious applications also tend to request unusual permissions when compared to others of their genre (Barrera, Kayacik, van Oorschot, Somayaji, 2010). For example mobile games do not normally request permission to send SMS messages. The permissions an application requests and its genre are often used as features in static analysis.
Figure 1: AndroidManifest.xml. Note the permissions listed near the bottom.
Image Source: http://dj-android.blogspot.com/2011/11/how-use-google-map-in-android-apps-part.html
Android applications are largely a collection of components. The component types are activities, services, broadcast receivers, and intents. There are four types of components in Android applications: activities, services, broadcast receivers, and content providers. Activities act as user interfaces and control what is currently visible to the user. Services perform operations that are not visible to the user. Broadcast receivers act as listeners and can act when a predefined broadcast is received. These three components can communicate with other components via messages called intents. Intents sent to services or activities contain data on operations to be performed. Intents sent to broadcast receivers describe an event that has happened and are called broadcasts. A broadcast could signal that the phone has been powered on, a phone call is incoming, or many other events. Finally, content providers act as databases and usually use provider client objects to communicate with other applications. Researchers have noted that malicious applications tend to use more services and broadcast receivers than benign applications (Wu, Mao, Wei, Lee, Wu, 2012). Using broadcast receivers and services instead of activities helps malware hide activities from users. Some applications have vulnerabilities in the way their components communicate. For example a vulnerable service may respond to messages from a malicious application in a confused deputy attack. Static analysis of inter-component communications can mitigate this risk (Chin, Felt, Greenwood, Wagner, 2011).
Many static analysis techniques construct graphs based on code structure. One such graph is the data flow graph. This illustrates the flow of data by tracking which parts of a program define data and connecting them with those that use the data. Elish et al. (Elish, Yao, Ryder, ) use data flow as a source of features by determining if risky API calls are connected in some way to data input by a user. For example if an application sends a text message containing information provided by the user then it is probably safe. If a text message is sent containing information generated without the user’s input it is probably malicious.
A second useful graph is the control flow graph. It maps blocks of a program to nodes and possible paths of execution are the edges. Figure 2 provides an example control flow graph. Woodpecker (Grace, Zhou, Wang, Jiang, 2012) is a tool for detecting security leaks in Android devices. It uses control flow graphs to search for paths of execution that allow the use of dangerous API calls without requesting permissions. Out of 13 permissions examined by the creators of Woodpecker 11 of them could be bypassed in this fashion.
Figure 2: Simple control flow graph. Note that edges represent possible paths of execution and can split into multiple paths.
The last graph I will discuss is the dependence graph. In this graph the nodes represent components of the program and the edges represent some sort of dependence. For example a class dependence graph would represent which graphs use methods from other graphs. Walenstein et al. (Walenstein, Deshotels, Lakhotia, 2012) use such a class dependence graph to detect isolated classes within repackaged applications. Such classes were hypothesized as more likely to have been injected by the repacker. It is possible to detect isolated nodes in a graph with centrality metrics, betweenness and closeness. The closeness centrality of a node refers to the average distance from that node to all other nodes in the graph (Bastian, Heymann, Jacomy, 2009). Betweenness centrality measures how often a node appears on shortest paths between nodes in the graph (Brandes, 2001). A node with high closeness and low betweenness is considered isolated. Figure 3 demonstrates a graph with nodes of varying betweenness and closeness.
Figure 3: The example graph above is colored with closeness centrality (from black=low to white=high). Nodes are sized with betweenness centrality (from small=low to large=high). Our treatment seeks out nodes with high closeness and low betweenness. Such targets would appear as white and small in this figure.
Opcodes are often used as features in static analysis. An opcode is the part of a machine language instruction that determines the type of operation to be performed. Many malicious activities have characteristic opcode patterns that can be detected by classifiers. Opcodes are also useful in detecting similarities between parts of programs. DroidMOSS (Zhou, Zhou, Jiang, Ning, 2012) is a tool for detecting repackaged applications. DroidMOSS uses a sliding window to efficiently compare large sections of opcodes in an unverified application to those of other applications. Even if malicious code has been injected into a benign application DroidMOSS is likely to detect plagiarism as it scans benign sections of the code.
Each application can import packages to utilize code reuse. Some packages are more likely to be used by malware than others and they are a significant feature for static analysis. DroidRanger (Zhou, Wang, Zhou, Jiang, 2012) uses imported packages and other features extracted from the structural layout of the application in malware classification.
Teufl et al. (Teufl, Kraxberger, Orthacker, Lackner, Gissing, Marsalek, Leibetseder, Prevenhueber, 2012) use metadata from market descriptions of applications as features for static analysis. They extract the permissions required, description, download count, price, and category of each application analyzed. Their system uses this data to answer several security related questions. One such question, “Is there a difference in the typical permissions when comparing free and payed apps with the terms ‘hot’ and ‘girl’ in their description”, prompted significant results. According to their data free application with mature content request significantly more permissions and are more likely to contain malware than similar payed applications.
Android allows users to program in C or C++, and this process is referred to as native development. It is called native because the code is compiled into machine code instead of Dalvik byte code. This technique is normally used to increase the speed of certain types of cpu intensive code and for reuse of C or C++ code. Static analysis can utilize the contents of native code and how it is used as features. DroidRanger (Zhou, Wang, Zhou, Jiang, 2012) extracts system calls made by native code and looks for attempts to hide native code. It is considered suspicious if an application stores native code in non-default locations. Such behavior could be an attempt to thwart static analysis for malware detection.
Comprehensive Tools
While many of the features listed previously are discussed in academic papers, the experiments validating their usefulness often focus on the feature being proposed. However there are tools that attempt to combine as many relevant features as possible in order to most effectively perform static analysis.
Androguard is an open source tool for detecting malware and reverse engineering Android applications. It provides several useful features such as detection of repackaging and visualization through Gephi (Bastian, Heymann, Jacomy, 2009). Developers can even create new plugins to expand Androguard’s functionality.
DroidMat (Wu, Mao, Wei, Lee, Wu, 2012) claims to outperform Androguard in accuracy and efficiency. Wu et al. present an incredibly thorough overview of both static and dynamic analysis techniques. DroidMat utilizes several features including permissions, deployment of components, intent messages, and API calls. It also performs different types of clustering to classify applications as benign or malicious.
Open Problems
DroidMat and Androguard are very effective at detecting malware, but to compete with the adaptations of attackers, new tools will be necessary. Malware classification systems can always be more accurate, robust, or efficient. Users must be aware of their existence and able to easily use such tools.
When analyzing a malicious program it is sometimes difficult to determine which part of the program is malicious. Many classification systems would be more accurate and efficient if they only analyzed the malicious code in an executable. A means of separating malicious code from benign would be a great boon to malware analysis.
Overcoming code obfuscation is a difficult task. Obfuscation can often prevent static analysis especially if the obfuscater is aware of the techniques analysts will use. For example an obfuscation technique that modifies the control flow of a program might make control flow analysis less effective. To combat obfuscation more robust analysis systems are necessary that utilize a wide range of features.
Applications can download code to execute during run time. This code is impossible to analyze statically because it is not present before the program has been executed. However, it can still be helpful to detect whether a program will dynamically load code or not. The location it will download from is significant as well. If this location is a well known source of malware then the program can be considered malicious.
Many malware analysts are interested in tracing malware back to its authors or being able to show how a malicious executable is related to others. Such information is useful when prosecuting cyber criminals or building robust security systems. The Android Malware Genome Project (Zhou, Jiang, 2012) attempts to solve such problems. However, more tools are needed that can detect similarities in malware and help analysts share malware and relevant data associated with it.
Android struggles with application verification. Several partial solutions have been implemented, but these have been shown by researchers to be insufficient. Android 4.2 includes a default application scanning service that checks applications for malware during installation. Jiang tested this service and found that it only detected 15.32% of the malware in the Android Malware Genome Project VirusTotal, a malware analysis service. This acquisition could improve Android’s application verification in the future.
References
Joint Conference On, 62–69. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6298136.