Reverse engineer’s toolkit

February 13th, 2007

If you’re planning on reverse engineering something (or debugging unfaimilar code), then it’s important to have the right tools for the job. This is a short list of some of the various tools that I find useful for this line of work (some are free, others are not; they are good tools, though, so I would encourage you to support the authors); if I am going to be doing extensive reversing (or debugging) work on a system, I’ll typically install most (or all) of these tools:

  • WinDbg (Debugging Tools for Windows); the de-facto debugger for Windows NT-based systems (and it’s freely available). Although there are other good debuggers out there (such as OllyDbg), it’s hard to beat the fact that Microsoft develops the debugger, and the designers have a great deal of information about and experience with the internals of the operating system. WinDbg comes with a plethora of plugin modules (extensions) that automate and improve many common tasks, and it has a well-documented and powerful interface to allow third parties (such as yourself) to program new extension modules for customized tasks. Additionally, WinDbg supports a (relatively crude, but effective nevertheless) scripting language. WinDbg can be used on 32-bit and 64-bit (IA64 and x64) targets in both user mode and kernel mode, and is the only supported facilities for debugging problems at customer sites or debugging certain specialized scenarios (like certain infrastructure processes, or fullscreen processes). The Debugging Tools for Windows distribution also includes ntsd, cdb, and kd, which are different (command-line) front-ends to the same debugger engine used by WinDbg. It also includes various other useful utilities, such as Logger (an API call spy program).
  • IDA Pro, or the Interactive Disassembler. IDA is the most full-featured disassembler (by a huge margin) out there that I know of, and it supports a wide variety of target architectures and operating systems to boot (although its support for x86 is probably the most full-featured of all the target architectures that you can use it on). IDA automates many of the aspects of disassembly and reverse engineering that have historically been tedious and error-prone, such as stack variable tracking. The full version of IDA is relatively pricey (on the order of $500..$900 or so, depending on whether you need x64 support), but if you’re serious about reverse engineering, it’s a good investment. For the casual reverse engineer, DataRescue has released a previous version, IDA 4.30, for free. If you are just working with x86 or simply want to get your feet wet with reversing, the free version is a good place to start. Although tools like WinDbg include a built-in disassembler, IDA’s advanced code analysis and interactive aspects (such as allowing the user to describe and annotate types, names, and variables in a structured fashion) make it far superior for non-trivial reverse engineering projects.
  • HIEW (Hacker’s vIEW) is a powerful (console-based) hex editor / disassembler / assembler for 16-bit, 32-bit, and 64-bit (x64) targets. It understands the various Windows image formats, but it can also be used on raw binary files as well. In addition to serving all of your hex editor needs, HIEW is just the perfect tool to use when you need to make quick on-disk patches to executables (and the integrated disassembler and assembler makes creating and applying such patches on-the-fly a walk in the park compared to traditional hex editors, which wourld require you to manually build the opcodes yourself, a pain for non-trivial patches). HIEW includes some additional power-features as well, such as a way to create and run simple programs to decrypt sections in a file (very useful if you’re working on something that is self-decrypting, and you know how to decrypt it but don’t have a program to do so already). It also includes a fairly simple plugin extension interface to allow custom actions to be integrated with the HIEW user interface. HIEW isn’t free, although it is fairly reasonably priced (and there is a (limited) demo that you can play around with).
  • The Windows Vista SDK is an essential tool for many reverse engineering tasks. It includes extensive documentation (as well as headers) for all public Win32 APIs, and it also includes several useful utilities as well (such as link.exe /dump, otherwise known as dumpbin.exe, which can be used to quickly extract various bits of information from a binary (like a list of imports) without having to load it up into a full-featured disassembler tool). The Vista SDK also includes OleView, which can be useful for inspecting a third-party COM library, as it has integrated support for turning a type library into MSIL (which can be trivially converted to a C header as needed).
  • Process Monitor, from SysInternals, is a great utility for quickly inspecting what file and registry operations a program is making. Depending on what you are trying to figure out when analyzing a program, simply looking at its file and registry activity can often save you hours of wading through disassembly or working with a debugger. Process Monitor allows you to perform (potentially filtered) in-depth logging of low-level file and registry activity, including operations that fail and the data returned by successful operations in some circumstances. Process Monitor is freely available from Microsoft’s website.
  • Process Explorer, formerly known as HandleEx, is another freely available tool from SysInternals. It allows quick and easy searching of open handles and loaded modules within a process, which is handy for initial information gathering (such as finding which process uses a DLL that you might be interested in). The ability to quickly check active processes for a DLL is also quite handy if you are trying to track down where a shared DLL (like a Windows hook DLL) is getting loaded in as a precursor to in-depth analysis.
  • SDbgExt is a WinDbg extension that I’ve developed (shameless plug). It provides a number of features that are useful for analyzing unfamiliar targets, such as the ability to import map files and create symbolic names out of them (particularly useful in conjunction with IDA’s export map file feature, if you want to synchronize your process between IDA and WinDbg), support for identifying writable function pointers within a process address space, and support for locating and displaying x64 exception/unwind handlers. SDbgExt presently only supports the 32-bit version of WinDbg. It is freely available (see the link above), and requires that WinDbg and the VC2005 CRT be installed.

There are other useful utilities out there (this is hardly an all-inclusive list), but these are the ones that I use the most often in a wide variety of situations.

Compiler tricks in x86 assembly, part 2

February 12th, 2007

A common operation in many C programs is a division by a constant quantity. This sort of operation occurs all over the place; in explicit mathematical calculations, and in other, more innocuous situations such as where the programmer does something like take a size in bytes and convert it into a count of structures.

This fundamental operation (division by a constant) is also one that the compiler will tend to optimize in ways that may seem a bit strange until you know what is going on.

Consider this simple C program:

int
__cdecl
wmain(
	int ac,
	wchar_t** av)
{
	unsigned long x = rand();

	return x / 1518;
}

In assembly, one would expect to see the following operations emitted by the compiler:

  1. A call to rand, to populate the ‘x’ variable.
  2. A div instruction to perform the requested division of ‘x’ by 1518.

However, if optimizations are turned on and the compiler is not configured to prioritize small code over performance exclusively, then the result is not what we might expect:

; int __cdecl main(int argc,
  const char **argv,
  const char *envp)
_main proc near
call    ds:__imp__rand
mov     ecx, eax
mov     eax, 596179C3h
mul     ecx
mov     eax, ecx
sub     eax, edx
shr     eax, 1
add     eax, edx
shr     eax, 0Ah ; return value in eax
retn

The first time you encounter something like this, you’ll probably be wondering something along the lines of “what was the compiler thinking when it did that, and how does that set of instructions end up dividing out ‘x’ by 1518?”.

What has happened here is that the compiler has used a trick sometimes known as magic number division to improve performance. In x86 (and in most processor architectures, actually), it is typically much more expensive to perform a division than a multiplication (or most any other basic mathematical primative, for that matter). While ordinarily, it is typically not possible to make division any faster than what the ‘div’ instruction gives you performance-wise, under special circumstances it is possible to do a bit better. Specifically, if the compiler knows that one of the operands to the division operation is a constant (specifically, the divisor), then it can engage some clever tricks in order to eliminate the expensive ‘div’ instruction and turn it into a sequence of other instructions that, while initially longer and more complicated on the surface, are actually faster to execute.

In this instance, the compiler has taken one ‘div’ instruction and converted it into a multiply, subtract, add, and two right shift operations. The basic idea behind how the code the compiler produced actually gives the correct result is that it typically involves a multiplication with a large constant, where the low 32-bits are then discarded and then remaining high 32-bits (recall that the ‘mul’ instruction returns the result in edx:eax, so the high 32-bits of the result are stored in edx) have some transform applied, typically one that includes fast division by a power of 2 via a right shift.

Sure enough, we can demonstrate that the above code works by manually comparing the result to a known value, and then manually performing each of the individual operations. For example, assume that in the above program, the call to rand returns 10626 (otherwise known as 7 * 1518, for simplicity, although the optimized code above works equally well for non-multiples of 1518).

Taking the above set of instructions, we perform the following operations:

  1. Multiply 10626 by the magic constant 0x596179C3. This yields 0x00000E7E00001006.
  2. Discard the low 32-bits (eax), saving only the high 32-bits of the multiplication (edx; 0xE7E, or 3710 in decimal).
  3. Subtract 3710 from the original input value, 10626, yielding 6916.
  4. Divide 6916 by 2^1 (bit shift right by 1), yielding 3458.
  5. Add 3710 to 3458, yielding 7168.
  6. Divide 7168 by 2^10 (bit shift right by 10), yielding 7, our final answer. (10626 / 1518 == 7)

Depending on the values being divided, it is possible that there may only be a right shift after the large multiply. For instance, if we change the above program to divide by 1517 instead of 1518, then we might see the following code generated instead:

mov     eax, 5666F0A5h
mul     ecx
shr     edx, 9

Coming up with the “magical divide” constants in the first place involves a bit of math; interested readers can read up about it on DevDev.

If you’re reverse engineering something that has these characteristics (large multiply that discards the low 32-bits and performs some transforms on the high 32-bits), then I would recommend breaking up each instruction into the corresponding mathematical operation (be aware of tricks like shift right to divide by a power of 2, or shift left to multiply by a power of 2). As a quick “litmus test”, you can always just pick an input and do the operations, and then simply compare the result to the input and see if it makes sense in the context of a division operation.

More on other optimizer tricks in a future posting…

Compiler optimizer tricks in x86 assembly, part 1

February 10th, 2007

The compiler is often very clever about speeding up some common operations in C (with how they might appear in assembler), in a way that might at first appear a bit non-obvious. With a bit of practice, you can train yourself to quickly identify these optimizations and see what they really do. These kinds of optimizations are very common with constant values.

With many of these little “assembly tricks”, the compiler is not simply taking into account instruction speed, but also instruction size, in order to reduce the total amount of opcode bytes required to do an optimization.

This post series attempts to provide a basic overview (and rationale) of several of the most common compiler optimizations that you might see while analyzing x86 assembly code. Knowing what these “assembly tricks” do is a great benefit if you are debugging or reverse engineering something, as it allows one to quickly recognize certain compiler patterns and gain clues as to what a program is doing.

Without further ado, here’s a brief overview of several of these such optimizations:

  1. Clearing a register value by xor reg, reg is a very common sequence in x86 code generated by a compiler. You will almost never see an instruction of the form mov reg, 0. Instead, the compiler will virtually always use the above-mentioned xor construct.

    The reasoning behind this is that the xor reg, reg construct is a very small instruction in terms of opcode bytes (2 bytes), whereas assigning a constant 32-bit value is typically much more expensive in terms of opcode length (say, 5 bytes).

    The gain with this optimization is reduced code size. Reducing code size is always good, and can lead to improved performance in that it improves the cacheability of a particular chunk of code (remember, most processor cache sizes are still very small compared to main system memory or hard drives). Also, if the compiler can shrink down the total image size by even one page (e.g. 4096 bytes) with optimizations like this, that’s one less page fault that needs to be taken when loading the program. This can be noticible even in lengths less than a page, if a function can be kept within one page instead of spilling over to a neighboring page (which might not be used for the current code path), which can eliminate “extraneous” page faults, where most of the data brought in is unnecessary.

    (It’s worth noting that this sort of optimization has come a long way recently, in the form of profile-guided BBT-style optimizations that reorder “hot” code paths to be on the same page in an attempt to make every byte that is in-paged from disk be as relevant as possible to the current execution path.)

  2. The constant zero register another very common optimization technique used by the compiler. Since the value zero appears so frequently in many functions (e.g. default parameter values are typically zero, and it is very common to compare values against zero or assign values to zero), the compiler will sometimes dedicate an entire register to contain the value zero throughout most of the function. Here’s an example, taken from nt!MmZeroPageThread:

    xor     esi, esi
    cmp     [ebp+var_14], esi
    [...]
    push    esi             ; WaitBlockArray
    push    esi             ; Timeout
    push    esi             ; Alertable
    push    esi             ; WaitMode
    push    8               ; WaitReason
    xor     ebx, ebx
    inc     ebx
    push    ebx             ; WaitType
    lea     eax, [ebp+Object]
    push    eax             ; Object
    push    2               ; Count
    call    _KeWaitForMultipleObjects@32

    Here, the compiler has dedicated the “esi” register to be the constant zero value for this function. It is used in many cases; in just a small part of nt!MmZeroPageThread, for instance, we can see that it is being used as both an argument to the “cmp” instruction for a test against constant zero, and we can also see it being used to fill many constant zero parameter values for a call to KeWaitForMultipleObjects.

    The gain from using a constant zero register is typically reduced code size. In x86, in many cases, it takes less opcode bytes to assign or compare a value to a register than to an immediate constant operand. The compiler takes advantage of this fact by only “paying the price” for referencing the value zero in an immediate operand for an instruction once, by storing it into a register. Then, the compiler can simply refer to that register for smaller overall code size if references to constant zero are common enough. For example, in the above code fragment, the compiler saves one opcode byte per push esi instruction over doing a push 0 instruction.

  3. Fast multiplication or addition with the lea instruction is another fairly common optimization that one is likely to run into frequently. The lea instruction (load effective address) was originally intended for use with pointer math, but it also turns out to be quite useful for doing fast constant math on non-pointer values as well, in part due to the wide range of operations that can be done with this instruction.

    Consider the following code fragment:

    mov     eax, [ebp+some_local]
    movzx   ecx, bx
    lea     eax, [eax+eax*4]
    lea     eax, [ecx+eax*2-30h]
    mov     [ebp+other_local], eax

    This instruction sequence may seem a bit convoluted at first, but it’s not too bad if we break it down into its constituent parts (and then combine it into one operation).

    We have the following operations:

    lea     eax, [eax+eax*4]

    This operation is equivalent to the following in C:

    other_local = some_local;
    other_local *= 5;

    Then, we’ve got the second lea operation:

    lea     eax, [ecx+eax*2-30h]

    In C, this might look like so:

    other_local = other_local * 2 + X - 0x30;

    …(where X corresponds to bx (and then ecx)).

    If we combine the two together, we get the following expression in C:

    other_local = some_local * 10 + X - 0x30;

    Now, the compiler could have used a combination of mul, add, and sub instructions to achieve the same effect. However, this would be more expensive in terms of instruction size, as those instructions are designed to work with values that are not known at runtime. By using the lea instruction, the compiler can take advantage of the fact that the lea instruction can perform multiple operations with one instruction in order to reduce code size.

  4. The lea instruction is also useful for scaling array indicies to their native type. Recall that in C, if you subscript an array, the subscript is performed on whole array elements; in the x86 instruction set, however, the processor has no way of magically knowing the size of an array element when the programmer subscripts an array.

    For example, consider the following structure:

    0:000> dt nt!_MMPFN
       +0x000 u1               : __unnamed
       +0x004 PteAddress       : Ptr32 _MMPTE
       +0x008 u2               : __unnamed
       +0x00c u3               : __unnamed
       +0x010 OriginalPte      : _MMPTE
       +0x010 AweReferenceCount : Int4B
       +0x014 u4               : __unnamed ; 4 bytes
    

    In this case, sizeof(nt!_MMPFN) == 24 (0x18). Consider an array of _MMPFN structures like so:

    _MMPFN MmPfnDatabase[ array_size ];

    If the programmer wants to index MmPfnDatabase (i.e. retrieve a pointer to a particular _MMPFN element within the array), then the compiler needs to convert an index into a pointer to an _MMPFN structure contained within the array.

    For example, the programmer might write the following C code:

    _MMPFN* Pfn = &MmPfnDatabase[ PfnIndex ];

    At the x86 instruction set level, though, the compiler needs to convert PfnIndex into a pointer. This requires two operations: First, PfnIndex needs to be scaled to the array size (or multipled by sizeof(_MMPFN). Second, the resultant value needs to be added to the base of MmPfnDatabase to form a complete pointer value to the requested _MMPFN element.

    In order to accomplish this, the compiler might emit the following instruction sequence:

    mov     eax, [ebp+PfnIndex]
    mov     ecx, _MmPfnDatabase
    push    ebx
    mov     ebx, [ebp+arg_4]
    lea     eax, [eax+eax*2]
    push    esi
    lea     esi, [ecx+eax*8]
    

    Here, the lea instruction is used to take the PfnIndex and MmPfnDatabase values and combine them into an _MMPFN pointer (stored in “esi”). If we break down the individual operations performed, what’s going on here isn’t all that difficult to understand:

    1. The initial LEA operation is equivalent to multiplying PfnIndex by 3 (PfnIndex is stored into “eax”).
    2. The final LEA operation multiplies the result of the first LEA operation by 8 (which can be simplified to say that PfnIndex has been multiplied by 24, which is conveniently equal to sizeof(_MMPFN).
    3. Finally (also part of the last LEA operation), “ecx” (which was loaded with the base address of MmPfnDatabase) is added to the intermediate result, which is then placed into “esi”, forming a completed _MMPFN pointer.

    Like with performing constant math for non-array indexing, the advantage of using lea over a series of mul and add operations is primarily code size, as lea allows for several distinct operations (e.g. multiply/add, or multiply/subtract) to be combined into a single compact operation. Most processors also provide very fast implementations of the lea instruction (as compared to the mul instruction for constant multiply operations).

    In order to be able to differentiate between an array indexing operation and plain constant math using lea, I would recommend checking to see whether any of the input values are treated as pointers or if the output value is treated as a pointer. Usually, it’s fairly easy to make this determination, though.

    As an aside, if you are reverse engineering something, constructs like array index operations are very handy as they will definitively tell you the size of the structure comprising an array.

The next post in this series will continue this discussion and examine several more common (and more complicated) optimizations and assembly tricks that the compiler may emit on x86. Stay tuned…

Enabling the local kernel debugger on Vista RTM

February 9th, 2007

If you’re a kernel developer, and you’ve upgraded to Vista, then one of the changes that you may have noticed is that you can’t perform local kernel debugging anymore.

This is true even if you elevate WinDbg. If you try, you’ll get an error message stating that the debugger failed to get KD version information (error 5), which corresponds to the Win32 ERROR_ACCESS_DENIED error code.

This is due to a change from Vista RC2 to Vista RTM, where the kernel function responsible for supporting much of the local KD functionality in WinDbg (KdSystemDebugControl) was altered to require the system to be booted with /DEBUG. This is apparent if we compare RC2 to RTM.

RC2 has the following check in KdSystemDebugControl (one comparison against KdpBootedNodebug):

nt!KdSystemDebugControl:
push    0F4h
push    81841938
call    nt!_SEH_prolog4
xor     ebx,ebx
mov     dword ptr [ebp-28h],ebx
mov     dword ptr [ebp-20h],ebx
mov     dword ptr [ebp-24h],ebx
cmp     byte ptr [nt!KdpBootedNodebug],bl
je      nt!KdSystemDebugControl+0x2c ; Success
mov     eax,0C0000022h ; STATUS_ACCESS_DENIED

On Vista RTM, two additional checks were added against nt!KdPitchDebugger and nt!KdDebuggerEnabled (disregard the fact that the RTM disassembly is from the x64 version; both the x86 and x64 Vista versions have the same checks):

nt!KdSystemDebugControl:
mov     qword ptr [rsp+8],rbx
mov     qword ptr [rsp+10h],rdi
push    r12
sub     rsp,170h
mov     r10,rdx
and     dword ptr [rsp+44h],0
and     qword ptr [rsp+48h],0
and     qword ptr [rsp+50h],0
cmp     byte ptr [nt!KdpBootedNodebug)],0
jne     nt!KdSystemDebugControl+0x8b7 ; Fail
cmp     byte ptr [nt!KdPitchDebugger],0
jne     nt!KdSystemDebugControl+0x8b7 ; Fail
cmp     byte ptr [nt!KdDebuggerEnabled],0
je      nt!KdSystemDebugControl+0x8b7 ; Fail

The essence of these checks is that you need to be booted with /DEBUG enabled in order for local kernel debugging to work.

There is a simple way to accomplish this, however, without the usual painful aspects of having a kernel debugger attached (e.g. breaking on user mode exceptions or breakpoints).

All you have to do is enable kernel debugging, and then disable user mode exception handling. This requires the following options to be set via BCDEdit.exe, the Vista boot configuration database manager:

  1. bcdedit /debug on. This enables kernel debugging for the booted OS configuration.
  2. bcdedit /dbgsettings <type> /start disable /noumex (where type corresponds to a usable KD type on your computer, such as 1394). This disables user mode exception handling for the kernel debugger. You should still be able to boot the system without a kernel debugger attached.

After setting these options, reboot, and you should be set. You’ll now be able to use local KD (you must still remember to elevate the debugger, though), but you won’t have user mode programs try to break into the kernel debugger when they crash without a user mode debugger attached.

Note, however, that you’ll still be able to break in to the system with a kernel debugger after boot if you choose these options (and if the box crashes in kernel mode, it’ll freeze waiting for a debugger to attach). However, at least you will not have to contend with errant user mode programs causing the system to break into the kernel debugger.

Programming against the x64 exception handling support, part 7: Putting it all together, or building a stack walk routine

February 5th, 2007

This is the final post in the x64 exception handling series, comprised of the following articles:

  1. Programming against the x64 exception handling support, part 1: Definitions for x64 versions of exception handling support
  2. Programming against the x64 exception handling support, part 2: A description of the new unwind APIs
  3. Programming against the x64 exception handling support, part 3: Unwind internals (RtlUnwindEx interface)
  4. Programming against the x64 exception handling support, part 4: Unwind internals (RtlUnwindEx implementation)
  5. Programming against the x64 exception handling support, part 5: Collided unwinds
  6. Programming against the x64 exception handling support, part 6: Frame consolidation unwinds
  7. Programming against the x64 exception handling support, part 7: Putting it all together, or building a stack walk routine

Armed with all of the information in the previous articles in this series, we can now do some fairly interesting things. There are a whole lot of possible applications for the information described here (some common, some estoric); however, for simplicities sake, I am choosing to demonstrate a classical stack walk (albeit one that takes advantage of some of the newly available information in the unwind metadata).

Like unwinding on x86 where FPO is disabled, we are able to do simple tasks like determine frame return addresses and stack pointers throughout the call stack. However, we can expand on this a great deal on x64. Not only are our stack traces guaranteed to be accurate (due to the strict calling convention requirements on unwind metadata), but we can retrieve parts of the nonvolatile context of each caller with perfect reliability, without having to manually disassemble every function in the call stack. Furthermore, we can see (at a glance) which functions modify which non-volatile registers.

For the purpose of implementing a stack walk, it is best to use RtlVirtualUnwind instead of RtlUnwindEx, as the RtlUnwindEx will make irreversible changes to the current execution state (while RtlVirtualUnwind operates on a virtual copy of the execution state that we can modify to our heart’s content, without disturbing normal program flow). With RtlVirtualUnwind, it’s fairly trivial to implement an unwind (as we’ve previously seen based on the internal workings of RtlUnwindEx). The basic algorithm is simply to retrieve the current unwind metadata for the active function, and execute a virtual unwind. If no unwind metadata is present, then we can simply set the virtual Rip to the virtual *Rsp, and increment the virtual *Rsp by 8 (as opposed to invoking RlVirtualUnwind). Since RtlVirtualUnwind does most of the hard work for us, the only thing left after that is to interpret and display the output (or save it away, if we are logging stack traces).

With the above information in mind, I have written an example x64 stack walking routine that implements a basic x64 stack walk, and displays the nonvolatile context as viewed by each successive frame in the stack. The example routine also makes use of the little-used ContextPointers argument to RtlVirtualUnwind in order to detect functions that have used a particular non-volatile register (other than the stack pointer, which is immediately obvious). If a function in the call stack writes to a non-volatile register, the stack walk routine takes note of this and displays the modified register, its original value, and the backing store location on the stack where the original value is stored. The example stack walk routine should work “as-is” in both user mode and kernel mode on x64.

As an aside, there is a whole lot of information that is being captured and displayed by the stack walk routine. Much of this information could be used to do very interesting things (like augment disassembly and code analysis by defintiively identifying saved registers and parameter usage. Other possible uses are more estoric, such as skipping function calls at run-time in a safe fashion, or altering the non-volatile execution context of called functions via modification of the backing store pointers returned by RtlVirtualUnwind in ContextPointers. The stack walk use case, as such, really only begins to scratch the surface as it relates to some of the many very interesting things that x64 unwind metadata allows you to do.

Comparing the output of the example stack walk routine to the debugger’s built-in stack walk, we can see that it is accurate (and in fact, even includes more information; the debugger does not, presently, have support for displaying non-volatile context for frames using unwind metadata (Update: Pavel Lebedinsky points out that the debugger does, in fact, have this capability with the “.frame /r” command)):

StackTrace64: Executing stack trace...
FRAME 00: Rip=00000000010022E9 Rsp=000000000012FEA0
 Rbp=000000000012FEA0
r12=0000000000000000 r13=0000000000000000
 r14=0000000000000000
rdi=0000000000000000 rsi=0000000000130000
 rbx=0000000000130000
rbp=0000000000000000 rsp=000000000012FEA0
 -> Saved register 'Rbx' on stack at 000000000012FEB8
   (=> 0000000000130000)
 -> Saved register 'Rbp' on stack at 000000000012FE90
   (=> 0000000000000000)
 -> Saved register 'Rsi' on stack at 000000000012FE88
   (=> 0000000000130000)
 -> Saved register 'Rdi' on stack at 000000000012FE80
   (=> 0000000000000000)

FRAME 01: Rip=0000000001002357 Rsp=000000000012FED0
[...]

FRAME 02: Rip=0000000001002452 Rsp=000000000012FF00
[...]

FRAME 03: Rip=0000000001002990 Rsp=000000000012FF30
[...]

FRAME 04: Rip=00000000777DCDCD Rsp=000000000012FF60
[...]
 -> Saved register 'Rbx' on stack at 000000000012FF60
   (=> 0000000000000000)
 -> Saved register 'Rsi' on stack at 000000000012FF68
   (=> 0000000000000000)
 -> Saved register 'Rdi' on stack at 000000000012FF50
   (=> 0000000000000000)

FRAME 05: Rip=000000007792C6E1 Rsp=000000000012FF90
[...]


0:000> k
Child-SP          RetAddr           Call Site
00000000`0012f778 00000000`010022c6 DbgBreakPoint
00000000`0012f780 00000000`010022e9 StackTrace64+0x1d6
00000000`0012fea0 00000000`01002357 FaultingSubfunction3+0x9
00000000`0012fed0 00000000`01002452 FaultingFunction3+0x17
00000000`0012ff00 00000000`01002990 wmain+0x82
00000000`0012ff30 00000000`777dcdcd __tmainCRTStartup+0x120
00000000`0012ff60 00000000`7792c6e1 BaseThreadInitThunk+0xd
00000000`0012ff90 00000000`00000000 RtlUserThreadStart+0x1d

(Note that the stack walk routine doesn’t include DbgBreakPoint or the StackTrace64 frame itself. Also, for brevity, I have snipped the verbose, unchanging parts of the nonvolatile context from all but the first frame.)

Other interesting uses for the unwind metadata include logging periodic stack traces at choice locations in your program for later analysis when a debugger is active. This is even more powerful on x64, especially when you are dealing with third party code, as even without special compiler settings, you are guaranteed to get good data with no invasive use of symbols. And, of course, a good knowledge of the fundamentals of how the exception/unwind metadata works is helpful for debugging failure reporting code (such as custom unhandled exception filter) on x64.

Hopefully, you’ve found this series both interesting and enlightening. In a future series, I’m planning on running through how exception dispatching works (as opposed to unwind dispatching, which has been relatively thoroughly covered in this series). That’s a topic for another day, though.

Hooray for bad hardware… (and upcoming blog maintenance)

January 31st, 2007

Well, I just spent the past hour and a half trying to resurrect the box that hosts the blog after it froze again. I finally succeeded in getting it working (cross my fingers…). At this point, though, I think that it’s time to start working on moving the blog to a more reliable location. The current box has locked up a couple times this month already (to the point where I can’t break in with either KD or SAC), which is worrisome. Furthermore, as the current box also happens to be routing for my apartment LAN, it locking up also means the primary Internet connection will for my entire apartment will go as well – ick. The fact that it took about an hour and a half of disassembling it and putting things back together to get it running is definitely something I take as a strong sign that it’s going to take a turn for the worst rather soon. Granted, this box was donated to me, so I suppose I can’t complain too much, but it’s time to get something more reliable.

Since I’m going to be switching boxes for the blog anyway (as the current box looks as if it is going to die, permanently, any day now), I am going to move it to a more capable hosting location (i.e. not my apartment) while I’m at it. Furthermore, I’ll probably be going ahead and upgrading to the next major WordPress version soon as well now that it’s out of beta. Assuming that goes well on the new blog server, it’ll be running 2.1 when I cut it over from the old box.

Hopefully, this should end up with the blog being a bit more responsive (the burst data rate at my apartment is a bit less than I would like for hosting the site), and (finally!) drive a stake through the periodic but short outages I’ve had due to flaky hardware over the past few months.

Thoughts on PatchGuard (otherwise known as Kernel Patch Protection)

January 29th, 2007

Recently, there has been a fair bit of press about PatchGuard. I’d like to clarify a couple of things (and clear up some common misconceptions that appear to be floating around out there).

First of all, these opinions are my own, based on the information that I have available to me at this time, and are not sponsored by either Microsoft or my employer. Furthermore, these views are based on PatchGuard as it is implemented today, and do not relate to any theoretical extensios to PatchGuard that may occur sometime in the future. That being said, here’s some of what I think about PatchGuard:

PatchGuard is (mostly) not a security system.

Although some people out there might try to tell you this, I don’t buy it. The thing about PatchGuard is that it protects the kernel (and a couple of other Microsoft-supplied core kernel libraries) from being patched. PatchGuard also protects a couple of kernel-related processor registers (MSRs) that are used in conjunction with functionality like making system calls. However, this doesn’t really equate to improving computer security directly. Some persons out there would like to claim that PatchGuard is the next great thing in the anti-malware/anti-rootkit war, but they’re pretty much just wrong, and here’s why:

  1. Malware doesn’t need to patch the kernel in the vast majority of all cases. Virtually all of the “interesting” things out there that malware does on compromised computers (create botnets, blast out spam emails, and the like) don’t require anything that is even remotely related to what PatchGuard blocks in terms of kernel patch prevention. Even in the rootkit case, most of the hiding of nefarious happenings could be done with clever uses of documented APIs (such as creating threads in other processes, or filesystem filters) without even having to patch the kernel at all. I will admit that many rootkits out there now do simply patch the kernel, but I attribute this mostly to a 1) lack of knowledge about how the kernel works by the part of rootkit authors, and 2) the fact that it might be “easier” to simply patch the kernel and introduce race conditions that crash rootkit’d computers rarely than to do things the right way. Once things like system call hooks in rootkits start to run afoul of PatchGuard, though, rootkit authors have innumerable other choices that are completely unguarded by PatchGuard. As a result, it’s not really correct (in my opinion) to call PatchGuard an anti-malware technology.
  2. Malware authors are more agile than Microsoft. What I mean when I say “more agile” is that malware vendors can much more easily release new software versions without the “burdens” of regression testing, quality assurance, and the like. After all, if you’re already in the malicious software business, it probably doesn’t matter if 5% of your customers (sorry, victims) will have systems that crash with your software installed because you didn’t fully test your releases and fix a corner case bug. On the other hand, Microsoft does have to worry about this sort of thing (and Microsoft’s install base for Windows is huge), which means that Microsoft needs to be very, very careful about releasing updates to something as “dangerous” as PatchGuard. I say “dangerous” in the sense that if a PatchGuard version is released with a bug, it could very well cause some number of Microsoft’s customers to bluescreen on boot, which would clearly not fly very well. Given the fact that Microsoft can’t really keep up with malware authors (many who are dedicated to writing malicious code with financial incentives to keep doing so) as it comes to the “cat and mouse” game with PatchGuard, it doesn’t make sense to try and use PatchGuard to stop malware.
  3. PatchGuard is targetted at vendors that are accountable to their customers. This point seems to be often overlooked, but PatchGuard works by making it painful for vendors to patch the kernel. This pain comes in the fact that ISVs who choose to bypass PatchGuard are at risk of causing customer computers to bluescreen on boot en-masse the next time that Microsoft releases a PatchGuard update. For malware authors, this is really much less of an issue; one compromised computer is the same as another, and many flavors of malware out there will try to do things like break automatic updates anyway. Furthermore, it is generally easier for malware authors to push out new versions of software to victims (that might counter a PatchGuard update) than it is for most ISVs to deploy updates to their paying customers, who often tend to be stubborn on the issue of software updates, typically insisting on internal testing before mass deployment.
  4. By the time malware is in a position to have to deal with PatchGuard as a potential blocking point, the victim has already lost. In order for PatchGuard to matter to malware, said malware must be able to run kernel level code on the victim’s computer. At this point, guarding the computer is really somewhat of a moot point, as the victim’s passwords, personal information, saved credit card numbers, secret documents, and whatnot are already toast in that the malware already has complete access to anything on the system. Now, there are a couple of scenarios (like trojaning a library computer kiosk) where the information desired by an attacker is not yet available, and the attacker has to hide his or her malicious code until a user comes around to type it in on a keyboard, but for the typical home or business computer scenario, the game is over the moment malicious code gets kernel level access to the box in question.

Now, for a bit of clarification. I said that PatchGuard was mostly not a security system. There is one real security benefit that PatchGuard conferns on customers, and that is the fact that PatchGuard keeps out ill-written software that does things like patch system calls and then fail to validate parameters correctly, resulting in new security issues being introduced. This is actually a real issue, believe it or not. Now, not everything that hooks the kernel introduces race conditions, security problems, or the like; it is possible (though difficult) to do so in a safe and correct manner in many circumstances, depending on what it is exactly that you’re trying to do. However, most of the software out there does tend to do things incorrectly, and in the process often inadvertently introduces security holes (typically local privilege escalation issues).

PatchGuard is not a DRM mechanism.

People who try to tell you that PatchGuard is DRM-related likely don’t really understand what it does. Even with PatchGuard installed, it’s still possible to extend the kernel with custom drivers. Additionally, the things that PatchGuard protects are only related to the new HD DRM mechanisms in Vista in a very loose sense. Many of the DRM provisions in Windows Vista are implemented in the form of drivers and not the kernel image itself, and are thus not protected by PatchGuard. If nothing else, third party drivers are responsible for the negotiation and lowest level I/O as relating to most of the new HD DRM schemes, and present an attractive target for DRM-bypass attempts that PatchGuard has no jurasdiction over. Unless Microsoft supplies all multimedia-output-related drivers in the system, this is how it will have to stay for the forseeable future, and it would be extremely difficult for Microsoft to protect just DRM-related drivers with PatchGuard in a non-trivially bypassable faction.

Preventing poorly written code from hooking the kernel is good for Windows customers.

There is a whole lot of bad code out there that hooks things incorrectly, and typically introduces race conditions and crash conditions. I have a great deal of first hand experience with this, as our primary product here at Positve has run afoul of third party software that hooks things and causes hard to debug random breakage that we get the blame for from a customer support perspective. The most common offenders that cause us pain are third party hooks that inject themselves into networking-related code and cause issues by subtlely breaking API semantics (for instance, we’ve run into certain versions of McAfee’s LSPs where if you call select in a certain way (Yes, I know that select is hardly optimal on Windows), the LSP will close a garbage handle value, sometimes nuking something important in the process. We’ve also run into a ton of cases where poorly behaved Internet Explorer add-ons will cause various corruption and other issues that we tend to get the blame for when our (rather complicated, code-wise, being a full-fledged VPN client) browser-based product is used in a partially corrupted iexplore process and eventually falls over and crashes. Another common trouble case relates to third party software that attempts to hook the CreateProcess* family of routines in order to propagate hooked code to child processes, and in the process manages to break legitimate usage of CreateProcess in certain obscure scenarios.

In our case, though, this is all just third-party “junkware” that breaks stuff in usermode. When you get to poorly written code that breaks the kernel, things get just that much worse; instead of a process blowing up, now the entire system hangs, crashes (i.e. bluescreens), experiences subtle filesystem corruption, or has local guest-to-kernel privilege escalation vulnerabilities introduced. Not only are the consequences of failure that much worse with hooking in kernel mode, but you can guess who gets the blame when a Windows box bluescreens: Microsoft, even though that sleazy anti-virus (or whatnot) software you just installed was really to blame. Microsoft claims that there are a significant amount of crashes on x86 Windows from third parties hooking the kernel wrong, and based on my own analysis of certain anti-virus software (and my own experience with hooking things blowing up our code in usermode), I don’t have any basis for disagreeing with Microsoft’s claim. From this perspective, PatchGuard is a good thing for consumers, as it represents a non-trivial attempt to force the industry to clean house and fix code that is at best questionable.

PatchGuard makes it significantly more difficult for ISVs out there which provide value on top of Windows through the use of careful hooking (or other non-blessed) means for extending the kernel.

Despite the dangers involved in kernel hooking, it is in fact possible to do it right in many circumstances, and there really are products out there which require kernel-level alterations that simply can’t be done without hooking (or other aspects that are blocked by PatchGuard). Note that, for the most part, I consider AV software as not in this class of software, and if nothing else, Microsoft’s Live OneCare demonstrates that it is perfectly feasible to deploy an AV solution without kernel hacks. Nonetheless, there are a number of niche solutions that revolve around things like kernel-level patching, which are completely shut out by PatchGuard. This presents an unfortunate negative atmosphere to any ISV that falls into this zone of having deployed technology that is now blocked in principle by PatchGuard, just because XYZ large anti-virus (the most common offenders, though there are certainly plenty of non-AV programs out there that are similarly “kernel-challenged”) vendor couldn’t be bothered to clean up their code and do things the correct way.

From this perspective, PatchGuard is damaging to customers (and affected ISVs), as they are essesntially forced to stay away from x64 (or take the risky road of trying to play cat-and-mouse with Microsoft and hack PatchGuard). This is unfortunate, and until recently, Microsoft has provided an outward appearance that they were taking a hard-line, stonewall stance against anything that ran afoul of PatchGuard, regardless of whether the program in question really had a “legitimate” need to alter the kernel. Fortunuately, for ISVs and customers, Microsoft appears to have very recently (or at least, very recently started saying the opposite thing in public) warmed up to the fact that completely shutting out a subset of ISVs and Windows customers isn’t such a great idea, and has reversed its previous statements regarding its willingness to cooperate with ISVs that run into problems with PatchGuard. I see this as a completely positive thing myself (keeping in mind that at work here, we don’t ship anything that conflicts with PatchGuard), as it signals that Microsoft is willing to work with ISVs that have “legitimate” needs that are being blocked by PatchGuard.

There is no giant conspiracy in Microsoft to shut out the ISVs of the world.

While I may not completely agree with the new reality that PatchGuard presents to ISVs, I have no illusions that Microsoft is trying to take over the world with PatchGuard. Like it or not, Microsoft is quite aware that the success of the Windows platform depends on the applications (and drivers) that run on top of it. As a result, it would be grossly stupid of Microsoft to try and leverage PatchGuard as a way to shut out other vendors entirely; customers don’t like changing vendors and ripping out all the experience and training that they have with an existing install base, and so you can bet that Microsoft trying to take over the software market “by force” with the use of PatchGuard wouldn’t go over well with Windows customers (or help Microsoft in its sales case for future Windows upgrades featuring PatchGuard), not to mention the legal minefield that Microsoft would be waltzing into should they attempt to do such a thing. Now, that being said, I do believe that somebody “dropped the ball”, so to speak, as far as cooperating with ISVs when PatchGuard was initially deployed. Things do appear to be improving from the perspective of Microsoft’s willingness to work with ISVs on PatchGuard, however, which is a great first step in the right direction (though it will remain to be seen how well Microsoft’ s current stance willl work out with the industry).

PatchGuard does represent, on some level, Microsoft exerting control over what users do with their hardware.

This is the main aspect of PatchGuard that I am most uncomfortable with, as I am of the opinion that when I buy a computer and pay for software, that I should absolutely be permitted to do what I want with it, even if that involves “dangerous” things like patching my own box’s kernel. In this regard, I think that Microsoft is attempting to play the “benevolent dictrator” with respect to kernel software; drivers that are dangerous to the reliablity of Windows computers, on average, are being blocked by Microsoft. Now, I do trust Microsoft and its code a whole lot more than most ISVs out there (I know first-hand how much more seriously Microsoft considers issues like reliablity and security at the present day (I’m not talking about Windows 95 here…) than a whole lot of other software vendors out there. Still, I don’t particularly like the fact that PatchGuard is akin to Microsoft telling me that “sorry, that new x64 computer and Windows license you bought won’t let you do things that we have classified as dangerous to system stability”. For example, I can’t write a program on Windows x64 that patches things blocked by PatchGuard, even for research and education purposes. I can understand that Microsoft is finding themselves up against the wall here against an uncooperative industry that doesn’t want to clean up its act, but still, it’s my computer, so I had damn well better be able to use it how I like. (No, I don’t consider the requirement of having a kernel debugger attached at boot time as an acceptable one, especially in automated scenarios.)

PatchGuard could make it more difficult to prove that a system is uncompromised, or to analyze a known compromised system for clues about an attack, if the malware involved is clever enough.

If you are trying to analyze a compromised (or suspected compromised) system for information about an attack, PatchGuard represents a dangerous unknown: a deliberately obfuscated chunk of code with sophisticated anti-analysis/anti-debugging/anti-reverse-engineering that ships with the operating system. This makes it very difficult to look at a system and determine if, say, it’s really PatchGuard that is running every so often, or perhaps some malware that has hijacked PatchGuard for nefarious purposes. Without digging though layer upon layer of obfuscation and anti-debugging code, definitively saying that PatchGuard on a system is uncompromised is just plain not do-able. In this respect, PatchGuard’s obfuscation presents a potentially attractive place for malicious code to hide and avoid being picked up in the course of post-compromise forensics of a running system that has been successfully compromised.

PatchGuard will not be obfuscation-based forever.

Eventually, I would expect that it will be replaced by hardware-enforced (hypervisor) based systems that utilize hardware-supported virtualization technology in new processors to provide a “ring -1” for code presently guarded by PatchGuard to execute in. This approach would move the burden of guarding kernel code to the processor itself, instead of the current “cat and mouse” game in software that exists with PatchGuard, as PatchGuard executes at the same privilege isolation level as code that might try to subvert it. Note that, in a hypervisor based system, hardware drivers would ideally be unable to cause damage (in terms of things like memory corruption and the like) to the kernel itself, which might eventually allow the system to continue functioning even if a driver fails. Of course, if drivers rely on being able to rewrite the kernel, this goal is clearly unattainable, and PatchGuard helps to ensure that in the future, there won’t be a backwards compatibility nightmare caused by a plethora of third-party drivers that rely on being able to directly alter the behavior of the kernel. (I suspect that x64 will supplant x86 in terms of being the operating sytem execution environment in the not-too-distant future). In usermode, x86 will likely live on for a very long time, but as x64 processors can execute x86 code at native speed, this is likely to not be an issue.

When PatchGuard is hypervisor-backed, it won’t be feasible to simply patch it out of existance, which means that ISVs will either have to comply with Microsoft’s requirements or find a way to evade PatchGUard entirely.

Overall, I think that there are both significant positive (and negative) aspects for PatchGuard. Whether it will turn out to be the best business decision (or the best experience for customers) remains to be seen; in an ideal world, though, only developers that really understand the full implications of what they are doing would patch the kernel (and only in safe ways), and things like PatchGuard would be unnecessary. I fear that this has become too much to expect for every programmer to do the right thing, though; one need only look at the myriad security vulnerabilities in software all over to see how little so many programmers care about correctness.

Programming against the x64 exception handling support, part 6: Frame consolidation unwinds

January 14th, 2007

In the last post in the programming x64 exception handling series, I described how collided unwinds were implemented, and just how they operate. That just about wraps up the guts of unwinding (finally), except for one last corner case: So-called frame consolidation unwinds.

Consolidation unwinds are a special form of unwind that is indicated to RtlUnwindEx with a special exception code, STATUS_UNWIND_CONSOLIDATE. This exception code changes RtlUnwindEx’s behavior slightly; it suppresses the behavior of substituting the TargetIp argument to RtlUnwindEx with the unwound context’s Rip value.

As far as RtlUnwindEx goes, that’s all that there is to consolidation unwinds. There’s a bit more that goes on with this special form of unwind, though. Specifically, as with longjump style unwinds, there is special logic contained within RtlRestoreContext (used by RtlUnwindEx to realize the final, unwound execution context) that detects the consolidation unwind case (by virtue of the ExceptionCode member of the ExceptionRecord argument), and enables a special code path. As it is currently implemented, some of the logic relating to unwind consolidation is also resident within the pseudofunction RcFrameConsolidation. This function is tightly coupled with RtlUnwindContext; it is only separated into another logical function for purposes of describing RtlRestoreContext and RcFrameConsolidation in the unwind metadata for ntdll (or ntoskrnl).

The gist of what RtlRestoreContext/RcFrameConsolidation does in this case is essentially the following:

  1. Make a local copy of the passed-in context record.
  2. Treating ExceptionRecord->ExceptionInformation[0] as a callback function, this callback function is called (given a single argument pointing to the ExceptionRecord provided to RtlRestoreContext).
  3. The return address of the callback function is treated as a new Rip value to place in the context that is about to be restored.
  4. The context is restored as normal after the Rip value is updated based on the callback’s decision.

The callback routine pointed to by ExceptionRecord->ExceptionInformation[0] has the following signature:

// Returns a new RIP value to be used in the restored context.
typedef ULONG64 (* PFRAME_CONSOLIDATION_ROUTINE)(
   __in PEXCEPTION_RECORD ExceptionRecord
   );

After the confusing interaction between multiple instances of the case function that is collided unwinds, frame consolidation unwinds may seem a little bit anti-climactic.

Frame consolidation unwinds are typically used in conjunction with C++ try/catch/throw support. Specifically, when a C++ exception is thrown, a consolidation unwind is executed within a function that contains a catch handler. The frame consolidation callback is used by the CRT to actually invoke the various catch filters and handlers (these do not necessarily directly correspond to standard C SEH scope levels). The C++ exception handling routines use the additional ExceptionInformation fields available in a consolidation unwind in order to pass information about the exception object itself; this usage of the ExceptionInformation fields does not have any special support in the OS-level unwind routines, however. Once the C++ exception is going to cross a function-level boundary, in my observations, it is converted into a normal exception for purposes of unwinding and exception handling. Then, consolidation unwinds are used again to invoke any catch handlers encountered within the next function (if applicable).

Essentially, consolidation unwinds can be thought as a normal unwind, with a conditionally assigned TargetIp whose value is not determined until after all unwind handlers have been called, and the specified context has been unwound. In most circumstances, this functionality is not particularly useful (or critical) when speaking in terms of programming the raw OS-level exception support directly. Nonetheless, knowing how it works is still useful, if only for debugging purposes.

I’m not covering the details of how all of the C++ exception handling framework is built on top of the unwind handling routines in this post series; there is no further OS-level “awareness” of C++ exceptions within the OS’s unwind related support routines, however.

Next up: Tying it all together, or using x64’s improved exception handling support to our gain in the real world.

Things that don’t quite work right in Windows Vista

January 12th, 2007

Having switched to Windows Vista full-time, I’ve now had an opportunity to run into most of the little “daily annoyances” that detract from the general experience (at least, my experience, anyway – your mileage my vary). Many of these problems existed in Windows XP, but that doesn’t change the fact that they’re annoying and/or frustrating. Some of them are new to Vista. Following is a list of some of the big annoyances that bother me on a regular basis:

  1. Vista’s 1394 support is mostly broken. Specifically, the 1394 storage support (a-la sbp2port.sys) in Vista is pretty much horrific. This sad state of affairs is carried over from Windows XP and Windows Server 2003, and isn’t really new to Vista, but it’s highly frustrating nonetheless. It seems that the 1394 storage support is prone to randomly breaking (in my experience), and when it breaks, it breaks badly; usually, you end up with either a bluescreen in sbp2port.sys due to it corrupting one of it’s internal data structures, or the I/O system grinds to a halt as all IRPs that make their way to a 1394 storage device get “stuck” forever, essentially permanently pinning (and freezing) any process that touches a 1394 storage device. Since there are an amazing number of things that do this indirectly by asking for information about storage volumes, this typically manifests itself as an inability to do practically anything (logon, logoff, start programs, and soforth).

    This problem is particularly prone to happening when you disconnect a 1394 device, but I’ve also seen it spontaneously happen without user interaction (and without cables becoming lose, as far as I can tell). I’ve experienced these problems on Windows XP, Windows Server 2003, and Windows Vista (x64 and x86), across a wide range of 1394 chipsets and 1394 storage devices.

    In order to recover from this breakage, a hard reboot is usually necessary, since shutting down is virtually impossible due to any I/Os hitting a 1394 device getting frozen indefinitely.

    This is really a shame, as 1394 is a pretty nice bus for storage devices. Other uses of 1394 on Windows are fairly stable; kernel debugging works fine, and networking (used) to work fine, although Vista removes support for IP/1394 (this negatively impacted me, as it was fairly handy for transferring large amounts of data between laptops which typically have 1394 but not gigabit ethernet. Not the end of the world, but it is a feature that I used which disappeared out from under me).

  2. Hybrid sleep takes forever to complete the low power transition on a laptop with 2GB (or 1.5GB) of RAM. Microsoft advertises it as powering the system down in a few seconds for a laptop, but as far as I can tell, this is pure marketing fiction. It regularly takes 30-45 seconds (sometimes more) for hybrid sleep to finish doing its thing and transition the system to a low power state. I’ve observed this on at least two respectable laptops, so I’m fairly sure this isn’t just bad hardware and just a limitation of the hibernate-based technology. Still, despite the annoyance factor, I think hybrid sleep is an overall improvement (especially with being able to swap out batteries without fear if your system went into suspend due to low power).
  3. The Windows Vista RDP client has an ill-advised default behavior relating to the selection of the account domain name sent to the remote system when logging on. Unlike previous RDP client versions, the Vista RDP client requires you to enter credentials before connecting. The problem here is that if you don’t explicitly specify a domain name in the username part of your name, and you aren’t connecting to a target system with its netbios name, then your logon will virtually always fail. This is because the RDP client will prepend “hostname\” to the username sent to the remote system, where “hostname” is the hostname you tell the RDP client to connect to. This results in all sorts of stupidly broken scenarios, where mstsc provides an IP address as your logon domain, or a FQDN for a computer that isn’t a member of that domain, or a different FQDN that leads to a computer but doesn’t match the computer name. All of these will result in a failed logon, for no particularly good reason. The workaround to this is fairly simple; specify “\username” as your username in mstsc, and the local (SAM) account database of the remote system is used to log you on. Still, there is almost no reason to not default to this behavior…
  4. Credential saving in IE7 is unintuitive at best. Specifically, even if you check the “save my credentials” box, the credentials are mysteriously not used unless you explicitly put the target site in your Intranet zone. This is highly annoying until you figure out what’s going on, as the credential manager UI in the control panel shows your credentials as being saved, but they’re never actually used. At the very least, there needs to be some kind of UI feedback on the save credentials dialog if your current settings dictate that your saved credentials will never actually be used.
  5. Switching users will kill RAS (dial-up) connections. There used to be a nicely documented way for how to suppress this behavior on pre-Vista systems, but from disassembling and debugging Vista, it’s abundantly clear that there is no possible way to suppress this behavior in Vista, at least without patching rasmans.dll to not kill connections on logon/logoff). Vista does have an explicit exemption for a connection shared by Internet Connection Sharing (ICS), that prevents it from being killed on logon/logout, but even this is stupidly buggy: there are no checks done to make sure that dependant connections of the ICS RAS connection aren’t killed on logout. This is especially ironic, given all of the work that has been put into developing routing compartments for Vista and “Longhorn” (culminating in the nice advantages of them being completely ignored on Vista). For me, this is a real problem, as I often use a dependant RAS connection setup for remote Internet access: A dial-up connection through Bluetooth to my cell phone (EVDO) for physical Internet access, and then an L2TP/IPSec link on top of that for security. Unfortunately, this means that no matter what I do, each time I switch users in Vista, my mobile Internet access gets nuked. I’ve been pondering writing something to patch out this stupidness in rasmans.dll…
  6. Something seems to cause Vista to hold on to IP addresses after their associated adapter has gone down. I’ve noticed this with my Broadcom NetXtra 57xx gigabit NIC; my previous laptop had a similar problem with its Broadcom 100mbit NIC as well. I don’t know if this is a Broadcom problem or an OS-level problem, but what ends up happening is that tcpip still claims ownership of the IP address that you had assigned to you while that interface is disconnected (media unplugged). Normally, this isn’t a big problem, but if you have a setup where VPN’ing in and plugging into Ethernet will net you the same static IP on a network, you’ll occasionally run into bizzare problems where the VPN connection doesn’t get an IP address. This tends to be a result of Vista claiming that same IP address that you would be assigned via the VPN on an ethernet interface that is now media-disconnected (but was previously connected to the network in question and had the IP address in question). This problem isn’t new to Vista; I’ve seen it on Windows XP as well. I’ve pretty much given up on retaining the same IP for remoting and just being plugged directly into my network as a result of this problem…
  7. You can’t use runas on explorer or IE anymore. This means you’re forced to use Fast User Switching for things that UAC doesn’t have a convenient UI for, which also means that you’re in trouble if you’re using RAS for Internet access. Doh.
  8. Managing RAS connections is a royal pain. In order to make a copy of a RAS connection, rename it from the default name (“original name – Copy”), and edit the properties (presumably, you wanted to actually change the connection without just duplicating it for no reason, right?), you need to go through no less than three elevation prompts in rapid succession. This isn’t so terrible if you’re logged on as an admin, but having to type your admin password three different times (if you’re logged on as a limited user, the “recommended” configuration) is just stupidly redundant and annoying.
  9. There is a no way that I can tell to change the default domain name used in UAC elevation prompts (if you’re a non-admin user providing over-the-shoulder (OTS) credentials). This sucks if you’re on a domain, as the typical “recommended” scenario is you would be logged on with a domain account (and a limited user) on the local system. If you need to perform an administrative task, then you elevate using a local admin account (presumably, you don’t give all your users domain admin accounts). Unfortunately, Vista always defaults to your domain as the logon domain, instead of the local domain (for elevation prompts). This means that in the “recommended”, “secure” configuration, you have to type computername\ before your account name every single time you get an elevation prompt. This is a basic convenience thing that really needs a way to change the default logon domain, as you pretty much always use one or the other, and if you aren’t the network domain for elevation, you’re stuck doing completely redundant extra work each time you elevate.
  10. You can’t scroll down in the bottom pane of the “Networking” tab in Task Manager anymore, if you have enough adapters to cause a scroll bar to appear. Each time Task Manager refreshes, the listbox scrolls to the top and blows away whatever you were looking at. This (minor, but still annoying) issue is new to Vista.
  11. There is no way to modify Terminal Server connection ACLs. The tool to do this (tscc.msc) isn’t shipped on “workstation” systems. It’s there on Windows Server 2003, but not Windows XP. I suppose this is a “business decision” by Microsoft, but I happen to want to be able to make myself able to perform session connections from the command line or Task Manager without going through the switch user GUI. This is a pretty minor gripe, but it’s still something that wastes a bit of my time each time I need to switch sessions.

Okay, enough ranting about things that aren’t quite the way I would like in Vista. Despite these shortcomings (which you may or may not agree about), I still think that Vista’s worth having over XP (there are a whole lot of things that *do* work right in Vista, and plenty of useful new additions to boot). Nothing’s perfect, though, and improving awareness of issues can only improve the chances of them getting fixed, someday.

Oh, and if anyone’s got any suggestions or information about how to work around some of the problems I’ve talked about here, I’d be happy to hear them.

Don’t always trust the compiler… (or when reverse engineering comes in handy even when you’ve got source code)

January 10th, 2007

Usually, when a program breaks, you look for a bug in the program. On the rare occasion, however, compilers have been known to malfunction.

I ran into such a problem recently. At my apartment, I have a video streaming system setup, wherein I have a TV tuner plugged into a dedicated desktop box. That desktop box has been setup to run VLC noninteractively in order to stream (broadcast) TV from the TV tuner onto my apartment LAN. Then, if I want to watch TV, all I have to do is pull up VLC at a computer and tell it to display the MPEG stream I have configured to be broadcast on my local network.

This works fairly well (although VLC isn’t without it’s quirks), and it’s got the nice side effect of that I have a bit more flexibility as to where I want to watch TV at without having to invest in extra hardware (beyond a TV tuner). Furthermore, I can even do silly things like put TV up on multiple monitors if I really wanted to, something not normally doable if you just use a “plain” TV set (the old fashioned way!).

Recently, though, one of my computers ceased being able to play the MPEG stream I was running over my network. Investigation showed that other computers on the LAN weren’t having problems displaying the stream; only this one system in particular wouldn’t play the stream correctly. When I connected VLC to the stream, I’d get a blank black screen with no audio or video. I checked out the VLC debug message log and found numerous instances of this log message:

warning: received bufer in the future

Hmm. It seemed like VLC was having timing-related problems that were causing it to drop frames. My first reaction was that VLC had some broken-ness relating to the handling of large uptimes (this system in question had recently exceeded the “49.7 day boundary”, wherein the value returned by GetTickCount, a count in milliseconds of time elapsed since the system booted, wraps around to zero). I set out to prove this assumption by setting a breakpoint on kernel32!GetTickCount in the debugger and attaching VLC to the stream. While GetTickCount was occasionally called, it turned out that it wasn’t being used in the critical code path in question.

So, I set out to find that log message in the VLC source code (VLC is open source). It turned out to be coming from a function relating to audio decoding (aout_DecPlay). The relevant code turned out to be as follows (reformatting by me):

[...]
if ( p_buffer->start_date > mdate() +
     p_input->i_pts_delay           +
     AOUT_MAX_ADVANCE_TIME )

{
     msg_Warn( p_aout,
      "received buffer in the future ("I64Fd")",
       p_buffer->start_date - mdate());

[...]

After logging this warning, the function in question drops the frame with the assumption that it is probably bogus due to bad timing information.

Clearly, there was nothing wrong with the stream itself, as I could still play the stream fine on other computers. In fact, restarting VLC on the computer hosting the stream, or the computer hosting the VLC stream itself both did nothing to resolve the problem; other computers could play the stream, except for one system (with a high uptime) that would always fail due to bad timing information.

In this case, it turns out that the mdate function is an internal VLC function used all over the place for high resolution timing. It returns a microsecond-precision counter that is monotically incrementing since VLC started (or in the case of Win32 VLC, since Windows was started). I continued to suspect that something was wrong here (as the only system that was failing to play the stream had a fairly high uptime). Looking into the source for mdate, there were two code paths that could be taken on Win32; one that used GetTickCount for timing resolution (though this code path in question does handle tick count wraparound), and another path that utilizes QueryPerformanceCounter and QueryPerformanceFrequency for high resolution timing, if VLC thinks that the performance counter is slaved to the system timer clock. (Whether or not the latter is really a good thing to do period on Windows is debatable; I would say no, but it appears to work for VLC.)

As I had already ruled out GetTickCount as being used in the timing-critical parts of VLC, I ignored the GetTickCount-related code path in mdate. This left the following segment of code in the Win32 version of mdate:

/**
 * Return high precision date
 *
 * Uses the gettimeofday() function when
 *  possible (1 MHz resolution) or the
 * ftime() function (1 kHz resolution).
 */
mtime_t mdate( void )
{
 /* We don't need the real date, just the value of
    a high precision timer */
 static mtime_t freq = I64C(-1);

 if( freq == I64C(-1) )
 {
  /* Extract from the Tcl source code:
   * (http://www.cs.man.ac.uk/fellowsd-bin/TIP/7.html)
   *
   * Some hardware abstraction layers use the CPU clock
   * in place of the real-time clock as a performance counter
   * reference.  This results in:
   * - inconsistent results among the processors on
   *   multi-processor systems.
   * - unpredictable changes in performance counter frequency
   *   on "gearshift" processors such as Transmeta and
   *   SpeedStep.
   * There seems to be no way to test whether the performance
   * counter is reliable, but a useful heuristic is that
   * if its frequency is 1.193182 MHz or 3.579545 MHz, it's
   * derived from a colorburst crystal and is therefore
   * the RTC rather than the TSC.  If it's anything else, we
   * presume that the performance counter is unreliable.
   */

  freq = ( QueryPerformanceFrequency( (LARGE_INTEGER *)&freq )
      && (freq == I64C(1193182) || freq == I64C(3579545) ) )
      ? freq : 0;
 }

 if( freq != 0 )
 {
  LARGE_INTEGER counter;
  QueryPerformanceCounter (&counter);

  /* Convert to from (1/freq) to microsecond resolution */
  /* We need to split the division to avoid 63-bits
       overflow */
  lldiv_t d = lldiv (counter.QuadPart, freq);

  return (d.quot * 1000000)
    + ((d.rem * 1000000) / freq);
 }
[...]
}

This code isn’t all that hard to follow. The idea is that the first time around, mdate will check the performance counter frequency for the current system. If it is one of two magical values, then mdate will be configured to use the performance counter for timing. Otherwise, it is configured to use an alternate method (not shown here), which is based on GetTickCount. On the system in question, mdate was being set to use the performance counter and not GetTickCount.

Assuming that mdate has decided on using the system performance counter for timing purposes (which, again, I do not believe is a particularly good (portable) choice, though it does happen to work on my system), then mdate simply divides out the counter value by the frequency (count of counter units per second), adjusted to return a nanosecond value (hence the constant 1000000 vales). The reason why the original author split up the division into two parts is evident by the comment; it is an effort to avoid an integer overflow when performing math on large quantities (it avoids multiplying an already very large (64-bit) value by 1000000 before the divission, which might then exceed 64 bits in the resultant quantity). (In case you were wondering, lldiv is a 64-bit version of the standard C runtime function ldiv; that is, it performs an integral 64-bit division with remainder.)

Given this code, it would appear that mtime should be working fine. Just to be sure, though, I decided to double check what was going on the debugger. Although VLC was built with gcc (and thus doesn’t ship with WinDbg-compatible symbol files), mtime is a function exported by one of the core VLC DLLs (libvlc.dll), so there wasn’t any great difficulty in setting a breakpoint on it with the debugger.

What I found was that mdate was in fact returning a strange value (to be precise, a large negative value – mtime_t is a signed 64-bit integer). Given the expression used in the audio decoding function snippet I listed above, it’s no surprise why that would break if mdate returned a negative value (and it’s a good assumption that other code in VLC would similarly break).

The relevant code portions for the actual implementation of mdate that gcc built were as so:

libvlc!mdate+0xe0:
62e20aa0 8d442428     lea     eax,[esp+0x28]
62e20aa4 890424       mov     [esp],eax
;
; QueryPerformanceCounter(&counter)
;
62e20aa7 e874640800   call    QueryPerformanceCounter
62e20aac 83ec04       sub     esp,0x4
62e20aaf b940420f00   mov     ecx,0xf4240 ; 1000000
62e20ab4 8b742428     mov     esi,[esp+0x28]
62e20ab8 8b7c242c     mov     edi,[esp+0x2c]
62e20abc 89f0         mov     eax,esi
62e20abe f7e1         mul     ecx
62e20ac0 89c1         mov     ecx,eax
62e20ac2 69c740420f00 imul    eax,edi,0xf4240 ; 1000000
62e20ac8 890c24       mov     [esp],ecx
62e20acb 8b3dcc7a2763 mov     edi,[freq.HighPart]
62e20ad1 8d3402       lea     esi,[edx+eax]
62e20ad4 897c240c     mov     [esp+0xc],edi
62e20ad8 8b15c87a2763 mov     edx,[freq.LowPart]
62e20ade 89742404     mov     [esp+0x4],esi
62e20ae2 89542408     mov     [esp+0x8],edx
;
; lldiv(...)
;
62e20ae6 e815983e00   call    lldiv
62e20aeb 8b5c2430     mov     ebx,[esp+0x30]
62e20aef 8b742434     mov     esi,[esp+0x34]
62e20af3 8b7c2438     mov     edi,[esp+0x38]
62e20af7 83c43c       add     esp,0x3c
62e20afa c3           ret

This bit of code might look a bit daunting at first, but it’s not too bad. Translated into C, it looks approximately like so:

LARGE_INTEGER counter, tmp;

QueryPerformanceCounter(&counter);

tmp.LowPart  = Counter.LowPart  * 1000000;
tmp.HighPart = Counter.HighPart * 1000000 +
    (((unsigned __int64)counter.LowPart  * 1000000) >> 32);

d = lldiv(tmp.QuadPart, freq);

return d.quot;

This looks code looks a little bit weird, though. It’s not exactly the same thing that we see in the VLC source code, even counting for differences that might arise between original C source code and reverse enginereed C source code; in the compiled code, the expression in the return statement has been moved before the call to lldiv.

In fact, the code has been heavily optimized. The compiler (gcc, in this case) apparently assumed some knowledge about the inner workings of lldiv, and decided that it would be safe to pre-calculate an input value instead of perform post-calculations on the result of lldiv. The calculations do appear to be equivalent, at first; the compiler simply moved a multiply around relative to a division that used remainders. Basic algebra tells us that there isn’t anything wrong with doing this.

However, there’s one little complication: computers don’t really do “basic algebra”. Normally, in math, you typically assume an unlimited space for variables and intermediate values, but in computer-land, this isn’t really the case. Computers approximate the set of all integer values in a 32-bit (or 64-bit) number-space, and as a result, there is a cap on how large (or small) of an integer you can represent natively, at least without going to a lot of extra work to support truly arbitrarily large integers (as is often done in public key cryptography implementations).

Taking a closer look at this case, there is a problem; the optimizations done by gcc cause some of the intermediate values of this calculation to grow to be very large. While the end result might be equivalent in pure math, when dealing with computers, the rules change a bit due to the fact that we are dealing with an integer with a maximum size of 64 bits. Specifically, this ends up being a problem because the gcc-optimized version of mdate multiplies the raw value of “counter” by 1000000 (as opposed to multiplying the result of the first division by 1000000). Presumably, gcc has performed this optimization as multiply is fairly cheap as far as computers go (and division is fairly expensive in comparison).

Now, while one might naively assume that the original version of mdate and the instructions emitted by gcc are equivalent, with the above information in mind, it’s clear that this isn’t really the case for the entire range of values that might be returned by QueryPerformanceCounter. Specifically, if the counter value multiplied by 1000000 exceeds the range of a 64-bit integer, then the two versions of mdate will not return the same value, as in the second version, one of the intermediate values of this calculation will “wrap around” (and in fact, to make matters worse, mdate is dealing with signed 64-bit values here, which limits the size of an integer to 63 significant bits, with one bit reserved for the representation of the integer’s sign).

This can be experimentally confirmed in the debugger, as I previously alluded to. Stepping through mdate, around the call to lldiv specifically, we can see that the intermediate value has exceeded the limits of a 63-bit integer with sign bit:

0:007>
eax=d5ebcb40 ebx=00369e99 ecx=6f602e40
edx=00369e99 esi=d5f158ce edi=00000000
eip=62e20ae6 esp=01d0fbfc ebp=00b73068
iopl=0         ov up ei ng nz nape cy
cs=001b  ss=0023  ds=0023  es=0023
fs=003b  gs=0000  efl=00000a83
libvlc!mdate+0x126:
62e20ae6 e815983e00       call    lldiv
0:007> dd @esp
01d0fbfc  6f602e40 d5f158ce 00369e99 00000000
01d0fc0c  00000060 01d0ffa8 77e6b7d0 77e6bb00
01d0fc1c  ffffffff 77e6bafd 5d29c231 00000e05
01d0fc2c  00000000 00b73068 00000000 62e2a544
01d0fc3c  00000f08 ffffffff 01d0fd50 00b72fd8
01d0fc4c  03acc670 00a49530 01d0fd50 6b941bd2
01d0fc5c  00a49530 00000f30 00000000 00000003
01d0fc6c  00b24130 00b53e20 0000000f 00b244a8
0:007> p
eax=e1083396 ebx=00369e99 ecx=ffffffff
edx=ffffff3a esi=d5f158ce edi=00000000
eip=62e20aeb esp=01d0fbfc ebp=00b73068
iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023
fs=003b  gs=0000  efl=00000206
libvlc!mdate+0x12b:
62e20aeb 8b5c2430         mov     ebx,[esp+0x30]

Using our knowledge of calling conventions, it’s easy to retrieve the arguments to lldiv from the stack: tmp.QuadPart is 0xd5f158ce6f602e40, and freq is 0x0000000000369e99

It’s clear that counter.QuadPart has overflowed here; considering the sign bit is set, it now holds a (very large) negative quantity. Since the remainder of the function does nothing that would influence the sign bit of the result, after the division, we get another (large, but closer to zero) negative value back (stored in edx:eax, the value 0xffffff3ae1083396). This is the final return value of mdate, which explains the problems I was experiencing with playing video streams; large negative values were being returned and causing sign-sensitive inequality tests on the return value of mdate (or a derivative thereof) to operate unexpectedly.

In this case, it turned out that VLC failing to play my video stream wasn’t really the fault of VLC; it ended up being a bug in gcc’s optimizer that caused it to make unsafe optimizations that introduce calculation errors. What’s particularly insidious about this mis-optimization is that it is invisible until the input values for the operations involve grow to a certain size, after which calculation results are wildly off. This explains why nobody else has run into this problem in VLC enough to get it fixed by now; unless you run VLC on Windows systems with a high uptime, where VLC is convinced that it can use the performance counter for timing, you would never know that the compiler had introduced a subtle (but serious) bug due to optimizations.

As to fixing this problem, there are a couple of approaches the VLC team could take. The first is to update to a more recent version of gcc (if newer gcc versions fix this problem; I don’t have a build environment that would let me compile all of VLC, and I haven’t really had much luck in generating a minimal repro for this problem, unfortunately). Alternatively, the function could be rewritten until gcc’s optimizer decided to stop trying to optimize the division (and thus introduce calculation errors).

A better solution would be to just drop the usage of QueryPerformanceCounter entirely, though. For VLC’s usage, GetTickCount should be close enough timing-wise, and you can even increase the resolution of GetTickCount up to around 1ms (with good hardware) using timeBeginTime. GetTickCount does have the infamous 49.7-day wraparound problem, but VLC does have a workaround that works. Furthermore, on Windows Vista and later, GetTickCount64 could be used, turning the 49.7-day limit into a virtual non-issue (at least in our lifetimes, anyway).

(Oh, and in case you’re wondering why I didn’t just fix this myself and submit a patch to VLC (after all, it’s open source, so why can’t I just “fix it myself”?): VLC’s source code distribution is ~100mb uncompressed, and I don’t really want to go spending a great deal of time to find a cygwin version that works correctly on Vista x64 with ASLR and NX enabled (cygwin’s fault, not Vista’s) so that I can get a build environment for VLC up and running so that I could test any potential fix I make (after debugging the inevitable build environment difficulties along the way for such a large project). I might still do this at some point, perhaps to see if recent gcc versions fix this optimizer bug, though.)

For now, I just patched my running VLC instance in-memory to use GetTickCount instead, using the debugger. Until I restart VLC, that will have to do for now.