Semantic Representation of Assembly Architectures

It is a very common need in areas such as binary translation, program analysis, and compilers to represent the low level assembly instructions in a more abstract semantic representation. This semantic representation is often referred to as an Intermediate Language or Representation (IL/IR). These IRs provide a standardized method of performing the same operations over many different types of architectures without having to write separate operations for every supported architecture. For example, an optimizing compiler can contain a single optimization procedure that operates on an IR rather than many procedures, each having to handle the quirks of the architecture they were targeting.

The pie in the sky is thus a semantic representation that all architectures could be translated to and from upon which all types of analysis could be performed. However, like most pies in the sky, this dream does not conform to reality. What does exists is a large number of custom solutions: semantic representations created to solve a specific problem on specific architectures. Custom solutions, however, can often be adapted to solve another problem, as long as you know the custom solution exists and how its problem differs from yours. Thus, we have taken it upon ourselves to catalog and describe the various semantic representations that already exist and the problem they seek to solve. Our intention is that this information be used to determine if there currently exists a solution close enough to the problem to be solved that it can be adapted, or if a completely new solution is warranted.

The semantic representations we will cover are the following:

  • The Semantic Specification Language (SSL) and the Jakstab platform
  • The static analysis component of the BitBlaze project (Vine)
  • The Binary Analysis Platform (BAP)
  • The ROSE Compiler
  • The Transformer Specification Language (TSL)
  • A GCC-based infrastructure for binary analysis

Background

So what is a semantic representation of an instruction set architecture (ISA)? Semantics of an ISA describe how the instructions operate on a high level, removed from any specific architecture.

What do people use semantic representations for? Binary translation: we need to translate blocks of assembly from one architecture to another. We want to write only one piece of software to analyze executables created from any popular architecture. Developing a semantic representation of the same operations across different languages allows for this.

When undergoing translation (usually referred to as “lifting”) the specific assembly instructions are transformed into a higher-level semantic language that describes the operations of each instruction, sometimes including any side-effects included with the instruction operations: stack changes, memory changes, status flag changes, etc.

For the most part, the frameworks for semantic representation all attempt to solve the problem in similar ways. The main difference between tools are the architectures supported, formatting, and explicit listing of side effects of instructions.

SSL / Jakstab

Overview

The Semantic Specification Language (SSL), (Cifuentes, Sendall, 1998) is a framework for creating a high-level semantic description of an architecture. It was specifically developed for integration into a retargetable binary analysis framework. SSL describes the semantics of each assembly instruction by means of multiple statements. Each instruction is broken down to its simple expressions (arithmetic, bitwise, logical, and ternary), values, and statements. SSL also allows for register transfers for a group of instructions to be grouped into a table.

As an example of a framework using SSL, we look at the Jakstab project Boomerang Decompiler to describe the multiple assembly architectures it supports, including x86, PowerPC, 68K, and SPARC. SSL maps each assembly instruction to its semantic specification, which creates a representation of the program that Jakstab uses to reconstruct the flow of control.

The problem SSL sets out to solve is to create a framework that can easily be utilized by any analysis framework in order to handle multiple architectures well. As such, SSL specifications lose very little information when mapping from an ISA. Each and every instruction in an ISA maps to a unique instruction in the specification, thus preserving side effects. However, SSL is only meant to specify one language at a time. To introduce a new language, a new specification is needed for the tool to process the new language. This is seen with the Boomerang Decompiler and the Jakstab tool.

Example

Here is an example of the IR (intermediate representation) Jakstab produces from a chunk of assembly, using an SSL definition for x86. The original assembly is first, followed by the IL.

mov esi, [0x38498] jmp 0x1fae2 push [ebp – 4] call[0x38588] lea eax, [ebp – 4] push eax push[ebp + 8] call esi cmp[ebp – 4],0 jne 0x1fad9 esi := mem_32[0x38498]; goto L2; L1: mem_32[esp – 4] := mem_32[ebp – 4]; esp := esp – 4; esp := esp – 4; mem_32[esp] := 0x1FAE2; goto mem_32[0x38588]; L2: eax := ebp – 4; mem_32[esp – 4] := eax; esp := esp – 4; mem_32[esp – 4] := mem_32[ebp + 8]; esp := esp – 4; esp := esp – 4; mem_32[esp] := 0x1FAEB; goto esi; tmp := mem_32[ebp – 4] – 0; CF := tmp@31 & (!mem_32[ebp – 4]@31); OF := mem_32[ebp – 4]@31 & (!tmp@31); NF := tmp@31; if (tmp = 0) then ZF := 1 else ZF := 0; if (ZF = 0) then goto L1;

BAP

Overview

BAP, the Binary Analysis Platform, (Brumley, Jager, Avgerinos, Schwartz, 2011) is an infrastructure for binary analysis developed by a team from Carnegie Mellon. BAP performs binary analysis in three steps: disassemble the binary into assembly, lift assembly to an intermediate language, and perform analysis on the IL. Originally Vine, the previous iteration of the BAP project, lacked formal semantics for its IL. Vine was eventually redesigned into BAP, with two intersting goals in mind: develop a simple IL with formally-defined semantics, and explicitly represent all side-effects of assembly to allow for semantic analysis.

Currently, the BAP package uses an internal linear sweep disassembler built on top of GNU-libopcodes (the disassembler library distributed as part of GNU binutils), but it may also interface with IDA Pro. Finally, the assembly is passed to VEX, Valgrind’s RISC-like IR, to translate the assembly into the VEX intermediate language, which is then translated again into BIL. We’re mostly interested with this last part, the BIL translation.

Example

The following is an assembly program given as input to BAP’s toil tool on the left, with un-simplified IL output on the right.

mov %eax, 0 mov %ebx, 0 add %eax, %ebx jz target jmp target2 target: nop target2: nop addr 0×0 @asm “mov %eax,0×0″ label pc_0x0 mem_45:?u32 = mem_45:?u32 with [0:u32, e_little]:u32 = R_EAX_5:u32 addr 0×5 @asm “mov %ebx,0×0″ label pc_0x5 mem_45:?u32 = mem_45:?u32 with [0:u32, e_little]:u32 = R_EBX_6:u32 addr 0xb @asm “add %eax,%ebx” label pc_0xb t_66:u32 = R_EBX_6:u32 R_EBX_6:u32 = R_EBX_6:u32 + R_EAX_5:u32 R_CF_10:bool = R_EBX_6:u32 < t_66:u32 R_AF_12:bool = 0×20:u32 == (0×20:u32 & (R_EBX_6:u32 ^ t_66:u32 ^ R_EAX_5:u32)) R_OF_15:bool = high:bool((t_66:u32 ^ ~R_EAX_5:u32) & (t_66:u32 ^ R_EBX_6:u32)) R_PF_11:bool = ~low:bool(R_EBX_6:u32 >> 7:u32 ^ R_EBX_6:u32 >> 6:u32 ^ R_EBX_6:u32 >> 5:u32 ^ R_EBX_6:u32 >> 4:u32 ^ R_EBX_6:u32 >> 3:u32 ^ R_EBX_6:u32 >> 2:u32 ^ R_EBX_6:u32 >> 1:u32 ^ R_EBX_6:u32) R_SF_14:bool = high:bool(R_EBX_6:u32) R_ZF_13:bool = 0:u32 == R_EBX_6:u32 addr 0xd @asm “je 0×0000000000000011″ label pc_0xd cjmp R_ZF_13:bool, 0×11:u32, “nocjmp0″ label nocjmp0 addr 0xf @asm “jmp 0×0000000000000012″ label pc_0xf jmp 0×12:u32 addr 0×11 @asm “nop” label pc_0x11 addr 0×12 @asm “nop” label pc_0x12

ROSE Compiler

Overview

The ROSE Compiler is a framework for constructing source-to-source translators. Currently, it supports C, C++, and Fortran as source code input and output, and x86, Power-PC, and ARM instruction sets for binary analysis.

When given source code or a binary, ROSE generates an AST, representing the explicit structure of the code or executable. The nodes on this AST graph are the IR

ROSE uses SAGE III for its intermediate representation language. SAGE III is an improvement of SAGE II, which was based on SAGE++. SAGE III is automatically generated, using ROSETTA, a tool packaged with ROSE. ROSETTA is an IR generator that, for the most part, generates SAGE III. More information on the SAGE projects and ROSETTA can possibly be found here..

The ROSE Compiler uses a set of templated classes to translate the x86 instructions, using hard-coded instruction semantics. From their documentation, the templated
X86InstructionSemantics class does all the translation work, by translating x86 instructions to RISC-like instructions, and using them within their semantic policies.

BDD-Based Infrastructure for Binary Analysis

Overview

Niranjan Hasabnis’ research proficieny exam (Hasabnis, 2011) describes a method by which GCC’s register transfer language (RTL) may be used as a low-level IR. There are two forms of GCC RTL: strict and non-strict. Strict RTL is machine-specific and is tied to specific registers and instructions. Non-strict RTL, however, is not tied to any specific machine semantics. In the method described, the program is first disassembled, then converted to strict RTL, then lifted to non-strict RTL. Two plugins to GCC were developed: one that managed the instrumentation of the strict RTL, and another that lifted strict RTL to non-strict RTL. To get access to the strict RTL from GCC, the latter plugin is inserted into the normal GCC passes right before assembly is generated. The strict RTL is treated as if it was generated from assembly to strict RTL, and then lifted to non-strict RTL.

TSL

Overview

TSL (Transformer Specification Language) generates tools for use in machine-code interpreting; precisely, TSL is an “abstract-transformer- generator generator.” In creating tools, the user specifies the semantics of a language, and from that semantic specification, the TSL compiler creates an intermediate representation of that language. This is very useful, as TSL can technically support any architecture, given the user provides the semantic specification. From the IR, TSL creates multiple implementations of abstract interpreters for the original language.

Example

Here’s a fragment of the TSL specification of x86′s semantics, including MOV and ADD


// User-defined abstract syntax
reg: EAX() | EBX() | . . . ;
flag: ZF() | SF() | . . . ;
operand: Indirect(reg reg INT8 INT32)
    | DirectReg(reg)
    | Immediate(INT32) | ...;
instruction
    : MOV(operand operand)
    | ADD(operand operand| . . . ;
state: State(MAP[INT32,INT8] // memory-map
    MAP[reg32,INT32] // register-map
    MAP[flag,BOOL]); // flag-map
// User-defined functions
INT32 interpOp(state S, operand op) { . . . };
state updateFlag(state S, . . . ) { . . . };
state updateState(state S, . . . ) { . . . };
state interpInstr(instruction I, state S) {
    with(I) (
        MOV(dstOp, srcOp):
            let srcVal = interpOp(S, srcOp);
            in ( updateState( S, dstOp, srcVal ) ),
        ADD(dstOp, srcOp):
            let dstVal = interpOp(S, dstOp);
                srcVal = interpOp(S, srcOp);
                res = dstVal + srcVal;
                S2 = updateFlag(S, dstVal, srcVal, res);
           in ( updateState( S2, dstOp, res ) ),
        . . .
    );
};

Here’s an example of a simplified snippet of the common intermediate representation (CIR) of x86 generated from the semantic specification


template  class CIR {
class reg { . . . };
class EAX : public reg { . . . }; . . .
class flag { . . . };
class ZF : public flag { . . . }; . . .
class operand { . . . };
class Indirect: public operand { . . . }; . . .
class instruction { . . . };
class MOV : public instruction { . . .
    operand op1; operand op2; . . .
};
class ADD : public instruction { . . . }; . . .
class state { . . . };
class State: public state { . . . };
INTERP::INT32 interpOp(state S, operand op) { . . . };
state updateFlag(state S, . . . ) { . . . };
state updateState(state S, . . .) { . . . };
state interpInstr(instruction I, state S) {
    switch(I.id) {
        case ID MOV: . . .
        case ID ADD:
            operand dstOp = I.get child1();
            operand srcOp = I.get child2();
            INTERP::INT32 dstVal = interpOp(S, dstOp);
            INTERP::INT32 srcVal = interpOp(S, srcOp);
            INTERP::INT32 res = INTERP::Plus(dstVal, srcVal);
            state S2 = updateFlag(S, dstVal, srcVal, res);
            ans = updateState( S2, dstOp, res );
            break;
        . . . }
}
};

References

Cifuentes, Cristina, and Shane Sendall. 1998. “Specifying the Semantics of Machine Instructions.” In 6th International Workshop on Program Comprehension, 126–133. doi:10.1109/WPC.1998.693332.
Kinder, Johannes, and Helmut Veith. 2008. “Jakstab: A Static Analysis Platform for Binaries.” In International Conference on Computer Aided Verification.
Brumley, David, Ivan Jager, Thanassis Avgerinos, and Edward J. Schwartz. 2011. “BAP: A Binary Analysis Platform.” In Computer Aided Verification, edited by Ganesh Gopalakrishnan and Shaz Qadeer, 463–469. Lecture Notes in Computer Science 6806. Springer Berlin Heidelberg. http://link.springer.com/chapter/10.1007/978-3-642-22110-1_37.
Hasabnis, Niranjan. 2011. “Infrastructure for Architecture-Independent Binary Analysis and Transformation.”

Leave a Reply

Your email address will not be published. Required fields are marked *

CAPTCHA *