The first thing needed in order to disassemble a program is (obviously) a place to start. For Windows executables, this is usually discovered by parsing the PE header of the executable. This well known, well documented data structure contains the information about the executable that the program loader will need to execute it. This information includes things such as a mapping of the program’s file contents to memory, which libraries need to be loaded into memory, and what we are looking for, the entry point of the program.
As widely documented as the PE format is, most PE parsers can still fail on a great number of valid programs. Many of the fields in the header are not necessary beyond alignment requirements, and severely malformed programs can still sometimes execute while crashing common disassemblers. Many PE parsers attempt to fully map out every single data field in a PE file, but this is unnecessary in a disassembler.
To counteract these deficiencies in most parsers, we needed to define which header values must be true. The disassembler has its own custom-written PE parser, which only relies on certain information that absolutely must be correct for a PE file to be disassembled:
E_MAGIC (From the old DOS header)
E_LFANEW (Pointer to the new header)
Image Base (Beginning of the program in virtual memory)
PE Signature (Like E_MAGIC)
Number of Sections (This would likely crash the windows loader if incorrect)
Option Header Type (Needed to specify which loading structure to use)
Option Header Entry Point Address (Although not exactly necessary)
Section Header Virtual Address (So the loader doesn’t write kernel mem)
Section Header Characteristics (Only for later windows versions)
TLS Information (If this is present and invalid, the program cannot finish loading)
*Note, these fields are slightly different from what is absolutely necessary for a program to execute, because we don’t actually care about physical execution.
The only important information is which parts of the program might contain code, and what parts are definitely code. We can discard almost all of the data in a PE file except those which must be absolutely correct to achieve some useful function. Some information that can be discarded includes the function import table, function export table, any debug symbols, and resource tables. We do try to take advantage of the export table, but we perform a bit of sanity checking on the data as well since this is one of the more common anti-disassembler attacks.
I mentioned in the earlier list why the Entry Point address might not need to be correct. This is because an entire program can be encoded in a program’s Thread Local Storage (TLS) initializers, which will run in the order listed in the headers. When we detect TLS functions from the headers, we immediately assume the entry point may not be valid (a not-uncommon anti-debugging tactic). I’ll talk about the reasoning for and consequences of that assumption in a later post.
If you want to look more into PE Header parsing, check out this link.