Archive for the ‘Windows’ Category

Programming against the x64 exception handling support, part 5: Collided unwinds

Tuesday, January 9th, 2007

Previously, I discussed the internal workings of RtlUnwindEx. While that posting covered most of the inner details regarding unwind support, I didn’t fully cover some of the corner cases.

Specifically, I haven’t yet discussed just what a “collided unwind” really is, other than providing vague hints as to its existance. A collided unwind occurs when an unwind handler initiates a secondary unwind operation in the context of an unwind notification callback. In other words, a collided unwind is what occurs when, in the process of a stack unwind, one of the call frames changes the target of an unwind. This has several implications and requirements in order to operate as one might expect:

  1. Some unwind handlers that were on the original unwind path might no longer be called, depending on the new unwind target.
  2. The current unwind call stack leading into RtlUnwindEx will need to be interrupted.
  3. The new unwind operation should pick up where the old unwind operation left off. That is, the new unwind operation shouldn’t start unwinding the exception handler stack; instead, it must unwind the original stack, starting from the call frame after the unwind handler which initiated the new unwind operation.

Because of these conditions, the implementation of collided unwinds is a bit more complicated than one might expect. The main difficulty here is that the second unwind operation is initiated within the call stack of an existing unwind operation, but what the unwind handler “wants” to do is to unwind the stack that was already being unwound, except to a different target and with different parameters.

From an unwind handler’s perspective, all that needs to be done to accomplish this is to make a call to RtlUnwindEx in the context of an unwind handler callback for an unwind operation, and RtlUnwindEx magically takes care of all of the work necessary to make the collided unwind “just work”.

Allowing this sort of unwind operation to “just work” requires a bit of creative thinking from the perspective of RtlUnwindEx, however. The main difficult here is that RtlUnwindEx, when called from the unwind handler, somehow needs a way to recover the original context that was being unwound in order to “pick up” where the original call to RtlUnwindEx “left off” (when it called an unwind handler that initiated a collided unwind). Because there is no provision for passing a context record to RtlUnwindEx and indicating that RtlUnwindEx should use it as a starting point for an unwind operation (RtlUnwindEx always initiates the unwind in the current call stack), this poses a problem; how is RtlUnwindEx to recover the original unwind parameters from where it should initiate the “real” unwind?

The way that Microsoft decided to solve this problem is an elegant little hack of sorts. The solution all comes down to that mysterious exception handler around RtlpExecuteHandlerForUnwind: RtlpUnwindHandler. Recall from the previous article that RtlUnwindEx calls RtlpExecuteHandlerForUnwind in order to invoke an exception handler for unwind purposes, and that RtlpExecuteHandlerForUnwind sets up an exception handler (RtlpUnwindHandler) before calling the requested exception handler for unwind. At the time, these extra steps (the use of RtlpExecuteHandlerForUnwind, and its exception handler) probably looked a bit redundant, and in the process of a “conventional” unwind operation, the extra work that RtlUnwindEx goes through before calling an unwind handler doesn’t even come into play as adding any value.

That all changes when a collided unwind occurs, however. In the collided unwind case, RtlpExecuteHandlerForUnwind and RtlpUnwindHandler are critical to solving the problem of how to recover the original unwind parameters so that RtlUnwindEx can perform an unwind operation on the correct call stack. In order to understand just how RtlpUnwindHandler and friends come into play with a collided unwind, it’s necessary to take a closer look about just what RtlUnwindEx will do when it is called from the context of an unwind handler.

Since RtlUnwindEx always begins a call frame unwind from the currently active call stack, the second call to RtlUnwindEx will start unwinding the call stack of the unwind handler that called RtlUnwindEx. But wait, you might say – this isn’t what is supposed to happen! It turns out that unwinding the unwind handler’s call stack will actually lead up to the “right thing” happening, through a bit of clever use of how “conventional” unwind operations work. To better understand what I mean, it’s helpful to look at the stack of a secondary call to RtlUnwindEx (initiating a collided unwind operation). For this purpose, I’ve put together a small problem that initiates a collided unwind (more on how and why you might see a collided unwind in the “real world” later). I’ve set a breakpoint on RtlUnwindEx, and skipped forward until I encountered the nested call to RtlUnwindEx that was initiating a collided unwind operation:

0:000> k
Child-SP          Call Site
00000000`0012e058 ntdll!RtlUnwindEx
00000000`0012e060 ntdll!local_unwind+0x1c
00000000`0012e540 TestApp!`FaultingFunction2'::`1'::fin$2+0x34
00000000`0012e570 ntdll!_C_specific_handler+0x140
00000000`0012e5e0 ntdll!RtlpExecuteHandlerForUnwind+0xd
00000000`0012e610 ntdll!RtlUnwindEx+0x236
00000000`0012ec90 TestApp!UnwindExceptionHandler2+0xf8
00000000`0012f1b0 TestApp!`FaultingFunction2'::`1'::filt$1+0xe
[...]

At this point, given what we know about RtlUnwindEx, it will start unwinding the stack downward. Since the target of the collided unwind will by definition be lower in the stack than the unwind handler’s stack pointer itself, RtlUnwindEx will continue unwinding downward, calling unwind handlers (if any) for each successive frame. Taking a look at the call stack, we can determine that there are no frames with an exception handler marked for unwind (denotated by a [ U ] in the !fnseh output):

0:000> !fnseh ntdll!RtlUnwindEx
ntdll!RtlUnwindEx L295 22,0A [   ]  (none)
0:000> !fnseh ntdll!local_unwind
ntdll!local_unwind L24 07,02 [   ]  (none)
0:000> !fnseh 00000000`01001f04 
1001ed0 L3a 06,02 [   ]  (none)
0:000> !fnseh ntdll!_C_specific_handler+0x140
ntdll!_C_specific_handler L16a 20,0C [   ]  (none)

(Here, 00000000`01001f04 corresponds to TestApp!`FaultingFunction2′::`1′::fin$2+0x34).

Because none of these call frames have an exception handler marked for unwind callbacks, we can surmise that RtlUnwindEx will blissfully unwind past all of these call frames just as one might expect. At this point, RtlUnwindEx is still unwinding the “wrong” stack though; we’d like it to be unwinding the stack passed to the original call to RtlUnwindEx, and not the unwind/exception handler call stack.

Something that one might not immediately expect happens when RtlUnwindEx reaches the next frame, however. Remember that the current call frame is now _C_specific_handler – the C-language exception handler for the current function that was originally being unwound after an exception occured. This means that the next call frame will be the original RtlUnwindEx, or more precisely, RtlpExecuteHandlerForUnwind.

This is where RtlpExecuteHandlerForUnwind and RtlpUnwindHandler get to shine. If we take a look at the next call frame in the debugger, we see that it is indeed RtlpExecuteHandlerForUnwind, and that it also has (as expected) an exception handler marked for unwind support: RtlpUnwindHandler.

0:000> !fnseh ntdll!RtlpExecuteHandlerForUnwind+0xd
ntdll!RtlpExecuteHandlerForUnwind L13 04,01 [EU ]
  ntdll!RtlpUnwindHandler (assembler/unknown)

Because this call frame does have an exception handler that supports unwind callouts, it will be returned to RtlUnwindEx by RtlVirtualUnwind. This, in turn, will lead to RtlUnwindEx calling RtlpUnwindHandler, as registered by RtlpExecuteHandlerForUnwind in the original call stack (by RtlUnwindEx). We can verify this in the debugger:

0:000> bp ntdll!RtlpUnwindHandler
0:000> g
Breakpoint 1 hit
ntdll!RtlpUnwindHandler:
00000000`779507e0 488b4220  mov rax,qword ptr [rdx+20h]
0:000> k
Child-SP          Call Site
00000000`0012d9a8 ntdll!RtlpUnwindHandler
00000000`0012d9b0 ntdll!RtlpExecuteHandlerForUnwind+0xd
00000000`0012d9e0 ntdll!RtlUnwindEx+0x236
00000000`0012e060 ntdll!local_unwind+0x1c
00000000`0012e540 TestApp!`FaultingFunction2'::`1'::fin$2+0x34
00000000`0012e570 ntdll!_C_specific_handler+0x140
00000000`0012e5e0 ntdll!RtlpExecuteHandlerForUnwind+0xd
00000000`0012e610 ntdll!RtlUnwindEx+0x236
00000000`0012ec90 TestApp!UnwindExceptionHandler2+0xf8
00000000`0012f1b0 TestApp!`FaultingFunction2'::`1'::filt$1+0xe
[...]

This is where things start to get a little interesting. From the discussion in the previous article, we know that RtlpUnwindHandler essentially does the following:

  1. Retrieve the PDISPATCHER_CONTEXT argument that RtlpExecuteHandlerForUnwind (the original instance, from the original unwind operation initiated by the first call to RtlUnwindEx) saved on its stack. This is done via the use of the EstablisherFrame argument to RtlpUnwindHandler.
  2. Copy the contents of RtlpExecuteHandlerForUnwind’s DISPATCHER_CONTEXT over the DISPATCHER_CONTEXT of the current RtlUnwindEx instance, through the PDISPATCHER_CONTEXT argument provided to RtlpUnwindHandler. Note that the TargetIp member of the DISPATCHER_CONTEXT is not copied from RtlpExecuteHandlerForUnwind’s DISPATCHER_CONTEXT.
  3. Return the manifest ExceptionCollidedUnwind constant to the caller (RtlpExecuteHandlerForUnwind, which will in turn return this value to RtlUnwindEx).

After all this is done, control returns to RtlUnwindEx. Because RtlpExecuteHandlerForUnwind returned ExceptionCollidedUnwind, though, a previously unused code path is activated. This code path (as described previously) copies the contents of the DISPATCHER_CONTEXT structure whose address was passed to RtlpExecuteHandlerForUnwind back into the internal state of RtlUnwindEx (including the context record), and then attempts to re-start unwinding of the current stack frame.

If you’ve been paying attention so far, then you probably understand what is going to happen next.

Because of the fact that RtlpUnwindHandler copied the DISPATCHER_CONTEXT from the original call to RtlUnwindEx over the DISPATCHER_CONTEXT from the current (collided unwind) call to RtlUnwindEx, the current instance of RtlUnwindEx now has access to all of the state information that the original RtlUnwindEx instance had placed into the PDISPATCHER_CONTEXT passed to RtlpExecuteHandlerForUnwind. Most importantly, this includes access to the original context record descibing the call frame that the original instance of RtlUnwindEx was in the process of unwinding.

Since all of this information has now been copied over the current RtlUnwindEx instance’s internal state, in effect, the current instance of RtlUnwindEx will (for the next unwind iteration) start unwinding the stack where the original RtlUnwindEx instance stopped; in other words, the stack being unwound “jumps” from the currently active call stack to the exception (or other) call stack that was originally being unwound.

At this point, the second instance of RtlUnwindEx is all setup to unwind the call stack to the new unwind target frame (and target instruction pointer; remember that TargetIp was omitted from the copying performed on the PDISPATCHER_CONTEXT in RtlpUnwindHandler) like a “conventional” unwind. The rest is, as they say, history.

Now that we know how collided unwinds work, it is important to know when one would ever see such a thing (after all, interrupting an unwind in-progress is a fairly invasive and atypical operation).

It turns out that collided unwinds are not quite as far-fetched as they might seem; the easiest way to cause such an event is to do something sleazy like execute a return/goto/continue/break to transfer control out of a __finally block. This, in effect, requires that the compiler stop the current unwind operation and transfer control to the target location (which is usually within the function that contained the __finally that the programmer jumped out of). Nevertheless, the compiler still has to deal with the fact that it has been called in the context of an unwind operation, and as such it needs a way to “break out” of the unwind call stack. This is done by executing a “local unwind”, or an unwind to a location within the current function. In order to do this, the compiler calls a small, runtime-supplied helper function known as local_unwind. This function is described below, and is essentially an extremely thin wrapper around RtlUnwindEx that, in practice, adds no value other than providing some default argument values (and scratch space on the stack for RtlUnwindEx to use to store a CONTEXT structure):

0:000> uf ntdll!local_unwind
ntdll!local_unwind:
00000000`7796f580 4881ecd8040000 sub  rsp,4D8h
00000000`7796f587 4d33c0         xor  r8,r8
00000000`7796f58a 4d33c9         xor  r9,r9
00000000`7796f58d 4889642420     mov  qword ptr [rsp+20h],rsp
00000000`7796f592 4c89442428     mov  qword ptr [rsp+28h],r8
00000000`7796f597 e844b0fdff     call ntdll!RtlUnwindEx
00000000`7796f59c 4881c4d8040000 add  rsp,4D8h
00000000`7796f5a3 c3             ret

When the compiler calls local_unwind as a result of the programmer breaking out of a __finally block in some fashion, then execution will eventually end up in RtlUnwindEx. From there, RtlUnwindEx eventually detects the operation as a collided unwind, once it unwinds past the original call to the original unwind handler that started the new unwind operation via local_unwind.

As a result, breaking out of a __finally block instead of allowing it to run to completion (which may result in control being transferred out of the “current function”, from the programmer’s perspective, and “into” the next function in the call stack for unwind processing) is how every-day programs can end up causing a collided unwind.

Next time: More unwind estorica, including details on how RtlUnwindEx and RtlRestoreContext lay the groundwork used to build C++ exception handling support.

Programming against the x64 exception handling support, part 4: Unwind internals (RtlUnwindEx implementation)

Monday, January 8th, 2007

In the previous article in this series, I discussed the external interface exposed by RtlUnwindEx (and some of how unwinding works at a high level). This posting continues that discussion, and aims to provide insight into the internal workings of RtlUnwindEx (and as such, the inner details of all of the different aspects of unwind support on x64 Windows).

As previously described, the main behavior of RtlUnwindEx is to systematically unwind call frames (with the help of RtlVirtualUnwind) until a specific call frame, which is identified by the TargetFrame argument, is reached. RtlUnwindEx is also responsible for all interactions with language exception handlers for purposes of unwind operations. Additionally, RtlUnwindEx also imposes various validations and restrictions on execution contexts being unwound, and on the behavior of exception handlers being called for an unwind operation.

The first order of business within RtlUnwindEx is to capture the execution context at the time of the call to RtlUnwindEx (specifically, the execution context inside RtlUnwindEx, not of the caller of RtlUnwindEx). This is done with the aid of two helper functions, RtlpGetStackLimits (which retrieves the bounds of the stack for the current thread from the NT_TIB region of the current threads’ TEB), and RtlCaptureContext (which records the complete execution context of its caller within a standard CONTEXT structure). Additionally, if an unwind table is supplied, a special flag is set in it that optimizes the behavior of subsequent calls to RtlLookupFunctionTable for lookups that are unwind-driven (this is a behavior new to Windows Vista, and is a further attempt to improve the performance of unwind support on x64).

If the caller did not supply an EXCEPTION_RECORD argument, RtlUnwindEx will create the default STATUS_UNWIND exception record at this time and substitute it for what would have otherwise been a caller-supplied EXCEPTION_RECORD block. The exception record is initialized with an ExceptionAddress pointing to the Rip value captured previously by RtlCaptureContext, and with no parameters. Additionally, an initial ExceptionFlags value of EXCEPTION_UNWINDING is set, to later indicate to any exception handlers that might be called that an unwind operation is in progress (the EXCEPTION_RECORD pointer, either caller supplied or locally allocated by RtlUnwindEx in the absence of a caller-supplied value, corresponds exactly to the EXCEPTION_RECORD argument passed to any LanguageHandler that is called during unwind processing).

In the event that the caller of RtlUnwindEx did not supply a TargetFrame argument (indicating that the requested unwind operation is an exit unwind), then the EXCEPTION_EXIT_UNWIND flag is set within RtlUnwindEx’s internal ExceptionFlags value. An exit unwind is a special form of unwind where the “target” of the unwind is unknown; in other words, the caller does not have a valid target frame pointer to supply to RtlUnwindEx. Initiating a target unwind is normally dangerous unless the caller has special knowledge of an unwind handler in the call stack that will halt the unwind operation prematurely (either by initiating a secondary unwind, which leads to what is called a collided unwind, or by exiting the thread entirely). The reason for this restriction is that as RtlUnwindEx doesn’t have a clear “stopping point” to halt the unwind cycle at, it will happily unwind past the end of the stack (typically resulting in an access violation) unless an unwind handler along the way does something to halt the unwind. Most unwind operations are not exit unwinds.

At this point, RtlUnwindEx is set up to enter the main loop of the unwind algorithm, which essentially involves repeated calls to RtlVirtualUnwind, and then to unwind handlers (if present). This main loop involves multiple steps:

  1. The RUNTIME_FUNCTION entry for the current frame (given by the Rip member of the context record captured above, and later updated in this loop) is located via RtlLookupFunctionEntry. If no function entry is present, then RtlUnwindEx will load Context->Rip with a ULONG64 value located at Context->Rsp, and then increment Context->Rsp by 8. The behavior when there is no RUNTIME_FUNCTION entry present accounts for leaf functions, for which unwind metadata is optional. If the current frame is a leaf function, then control skips forward to step 8.
  2. Assuming that a RUNTIME_FUNCTION was found, RtlUnwindEx makes a copy of the current execution context that will be unwound – something I call the “unwind context”. After duplicating the context (via the RtlpCopyContext helper function, which only duplicates the non-volatile context), RtlVirtualUnwind is called (with the unwind context), and requested to return the address any associated language handler that is marked for unwind support. RtlVirtualUnwind thus returns several useful pieces of information; a language handler supporting unwind (if any), an updated context describing the caller of the requested call frame, a language-handler-specific (i.e. C scope table) data pointer associated with the requested call frame (if any), and the stack pointer of the call frame being unwound (the establisher frame). These pieces of information are used later in communication with a returned exception handler with unwind support, if one exists.
  3. After calling RtlVirtualUnwind to establish the context of the next location on the stack frame (now contained within the “unwind context” location), RtlUnwindEx performs some validation of the returned EstablisherFrame value. Specifically, the EstablisherFrame value is ensured to be 8-byte aligned and within the stack limits of the current thread (in kernel mode, there is also special support for handling the case of an unwind occcuring within the context of a DPC, which may operate under a secondary stack). If either of these conditions does not hold true, a STATUS_BAD_STACK exception is raised, indicating that the stack pointer in the requested call frame is damaged or corrupted. Additionally, if a TargetFrame value is specified (that is, the unwind operation is not an exit unwind), then the TargetFrame value is validated to be greater than or equal to the EstablisherFrame value returned by RtlVirtualUnwind. This is, in effect, a sanity check designed to ensure that the unwind target actually refers to a previous call frame and not that one that has already be unwound. If this check fails, then a STATUS_BAD_STACK exception is raised.
  4. If a language handler was returned by RtlVirtualUnwind, then RtlUnwindEx sets up for a call to the language handler. This involves the initial setup of a DISPATCHER_CONTEXT structure created on the stack of RtlUnwindEx. The DISPATCHER_CONTEXT structure describes some internal state that RtlUnwindEx shares with all participants in the unwind process, such as language handlers being called for unwind. It contains all of the state information necessary to coordinate operation between RtlUnwindEx and any language handler. Furthermore, it is also instrumental in the processing of collided unwinds; more on that later. The newly initialized DISPATCHER_CONTEXT contains two fields of significance, initially; the TargetIp field (which is simply a copy of the TargetIp argument to RtlUnwindEx), and the ScopeIndex field (which is zero initialized). Both of these fields are unused by RtlUnwindEx itself, and are simply available for the conveniene of language handlers being called for an unwind operation. If no language handler was present for the requested call frame, then control skips forward to step 8.
  5. At this point, RtlUnwindEx is ready to make a call to an unwind handler. This first involves a quick check to determine whether the end of the unwind chain has been reached, through comparing the current frame’s EstablisherFrame value with the TargetFrame argument to RtlUnwindEx. If the two frame pointers match exactly, then the ExceptionFlags value passed in to the unwind handler has an additional bit set, EXCEPTION_TARGET_UNWIND. This flag bit lets the unwind handler know that it is the “last stop” in the unwind process (in other words, that there will be no further frame unwinds after the unwind handler finishes processing). At this point, the ReturnValue argument passed to RtlUnwindEx is copied into the Rax register image in the active context for the current frame (not the unwound context, which refers to the previous frame). Then, the last remaining fields of the DISPATCHER_CONTEXT structure are initialized based on the internal state of RtlUnwindEx; the image base, handler data, instruction pointer (ControlPc), function entry, establisher frame, and language handler values previously returned by RtlLookupFunctionEntry and RtlVirtualUnwind are copied into the DISPATCHER_CONTEXT structure, along with a pointer to the context record describing the execution state at the current frame. After the ExceptionFlags member of RtlUnwindEx’s EXCEPTION_RECORD structure is set, the stack-based exception flags image (from which the copy in the EXCEPTION_RECORD was copied from) has the EXCEPTION_TARGET_UNWIND and EXCEPTION_COLLIDED_UNWIND flags cleared, to ensure that these flags are not inadvertently passed to an exception routine unexpectedly in a future loop iteration.
  6. After preparing the DISPATCHER_CONTEXT for the unwind handler call, RtlUnwindEx makes a call to a small helper function, RtlpExecuteHandlerForUnwind. RtlpExecuteHandlerForUnwind is an assembly-language routine whose prototype matches that of the language specific handler, given below:
    typedef EXCEPTION_DISPOSITION (*PEXCEPTION_ROUTINE) (
        IN PEXCEPTION_RECORD               ExceptionRecord,
        IN ULONG64                         EstablisherFrame,
        IN OUT PCONTEXT                    ContextRecord,
        IN OUT struct _DISPATCHER_CONTEXT* DispatcherContext
    );

    RtlpExecuteHandlerForUnwind is fairly straightforward. All it does is store the DispatcherContext argument on the stack, and then make a call to the LanguageHandler member in the DISPATCHER_CONTEXT structure. RtlpExecuteHandler then returns the return value of the LanguageHandler itself.

    While this may seem like a rather useless helper routine at first, RtlpExecuteHandlerForUnwind actually does add some value, although it might not be immediately apparent unless one looks closely. Specifically, RtlpExecuteHandlerForUnwind registers an exception/unwind handler for itself (RtlpUnwindHandler). RtlpUnwindHandler does not go through _C_specific_handler; in other words, it is a raw exception handler registration. Like RtlpExecuteHandlerForUnwind, RtlpUnwindHandler is a raw assembly language routine. It, too, is fairly simple (and as a language-level exception handler routine, RtlpUnwindHandler is compatible with the LanguageHandler prototype described above); RtlpUnwindHandler uses the EstablisherFrame argument given to a LanguageHandler routine to locate the saved pointer to the DISPATCHER_CONTEXT on the stack of RtlpExecuteHandlerForUnwind, and then copies most of the DISPATCHER_CONTEXT structure passed to RtlpExecuteHandlerForUnwind over the DISPATCHER_CONTEXT structure that was passed to RtlpUnwindHandler itself (conspicuously omitted from the copy is the TargetIp member of the DISPATCHER_CONTEXT structure, for reasons that will become clear later). After performing the copy of the DISPATCHER_CONTEXT structure, RtlpUnwindHandler returns the manifest ExceptionCollidedUnwind constant. Although one might naively assume that all of this just leads up to protecting against the case of an unwind handler throwing an exception, it actually has a much more common (and significant) use; more on that later.

  7. After RtlpExecuteHandlerForUnwind returns, RtlUnwindEx decides what course of action to persue based on the return value. There are two legal return values from an exception handler called for unwind, ExceptionContinueSearch (the general “success”) return, and ExceptionCollidedUnwind. If any other value is returned, then RtlUnwindEx raises a STATUS_INVALID_DISPOSITION exception, indicating that an unwind handler has returned an illegal value (this is typically rarely seen in practice, as most unwind handlers are compiler generated, and therefore always get the return value correct). If ExceptionContinueSearch is returned, and the current EstablisherFrame doesn’t match the TargetFrame argument, then the unwind context and the context for the “current frame” are swapped (this positions the current frame context as referring to the context of the next function in the call chain, which will then be duplicated and unwound in the next loop iteration). If ExceptionCollidedUnwind is returned, then the execution path is a little bit more complicated. In the collided unwind case, all of the internal state information that RtlUnwindEx had previously copied into the DISPATCHER_CONTEXT structure passed to RtlpExecuteHandler back out of the DISPATCHER_CONTEXT structure. RtlVirtualUnwind is then executed to determine the next lowest call frame using the context copied out of the DISPATCHER_CONTEXT structure, the EXCEPTION_COLLIDED_UNWIND flag is set, and control is transferred to step 5. This step may initially seem strange, but it will become clear after it is explained later.
  8. If control reaches this point, then a frame has been successfully unwound, and any applicable unwind handler has been notified of the unwind operation. The next step is a re-validation of the EstablisherFrame value (as it may have changed in the collided unwind case). Assuming that EstablisherFrame is valid, if its value does not match the TargetFrame argument, then control is transferred to step 1. Otherwise, if there is a match, then the loop terminates. (If the EstablisherFrame is not valid, and is not the expected TargetFrame value, then either the unwind exception record is raised as an exception, or a STATUS_BAD_FUNCTION_TABLE exception is raised.)

At this point, RtlUnwindEx has arrived at its target frame, and all intermediary unwind handlers have been called. It is now time to transfer control to the unwind point. The ReturnValue argument is again loaded into the current frame’s context (Rax register), and if the exception code supplied by the RtlUnwindEx caller via the ExceptionRecord argument does not match STATUS_UNWIND_CONSOLIDATE, the Rip value in the current frame’s context is replaced with the TargetIp argument.

The final task is to realize the finalized context; this is done by calling RtlRestoreContext, passing it the current frame’s context and the ExceptionRecord argument (or the default exception record constructed if no ExceptionRecord argument was supplied). RtlRestoreContext will in most cases simply copy the given context into the currently active register set, although in two special cases (if a STATUS_LONGJUMP or STATUS_UNWIND_CONSOLIDATE exception code is set in the optional ExceptionRecord argument), this behavior deviates from the norm. In the long jump case (as previously documented), the ExceptionRecord argument is assumed to contain a pointer to a jmp_buf, which contains a nonvolatile register set to restore on top of the unwound context supplied by RtlUnwindEx. The unwind consolidate case is rather more complicated, and will be discussed in a future posting.

For reference, I have posted some annotated, reverse engineered C and assembler code describing the internal operations of RtlUnwindEx and several of its helper functions (such as RtlpUnwindHander). This C code is based off of the Windows Vista implementation of RtlUnwindEx, and as such takes advantage of new Windows Vista-specific optimizations to unwind handling. Specifically, the “Unwind” flag in the UNWIND_HISTORY_TABLE structure is new in Windows Vista (although the size of the structure has not changed; there used to be empty alignment padding at that offset in previous Windows versions). This flag is used as a hint to RtlLookupFunctionEntry, in order to expedite lookup of function entries for some commonly referenced functions in the unwind path. Between the provided comments and the above description of the overall functionality of RtlUnwindEx, the inner workings of it should begin to come clear. There are some aspects (in particular, collided unwind) that are a bit more complicated than one might initially imagine; I’ll discuss collided unwinds (and more) in the next posting in this series.

It would be best to call the system version of RtlUnwindEx instead of reimplementing it for general purpose use (which I have done so here primarily to illustrate how unwind works on x64 Windows). There have been improvements made to RtlUnwindEx between Windows Server 2003 SP1 x64 and Windows Vista x64, so it would be unwise to assume that RtlUnwindEx will remain devoid of new performance or feature additions forever.

Next up: Collided unwinds, and other things that go “bump” in the dark when you use compiler exception handling and unwind support.

Programming against the x64 exception handling support, part 3: Unwind internals (RtlUnwindEx interface)

Sunday, January 7th, 2007

Previously, I provided a brief overview of what each of the core APIs relating to x64’s extensive data-driven unwind support were, and when you might find them useful.

This post focuses on discussing the interface-level details of RtlUnwindEx, and how they relate to procedure unwinding on Windows (x64 versions, specifically, though most of the concepts apply to other architecture in principle).

The main workhorse of unwind support on x64 Windows is RtlUnwindEx. As previously described, this routine encapsulates all of the work necessary to restore execution context to a prior point in the call stack (relying on RtlVirtualUnwind for this task). RtlUnwindEx also implements all of the logic relating to interactions with unwind/exception handlers during the unwind process (which is essentially the value added by RtlUnwindEx on top of what RtlVirtualUnwind implements).

In order to understand the inner workings of how unwinding works, it is first necessary to understand the high level theory behind how RtlUnwindEx is used (as RtlUnwindEx is at the heart of unwind support on Windows). Although there have been previously posted articles that touch briefly on how unwind is implemented, none that I have seen include all of the details, which is something that this segment of the x64 exception handling series shall attempt to correct.

For the moment, it is simpler to just consider the unwind half of exception handling. The nitty-gritty, exhaustive details of how exceptions are handled and dispatched will be discussed in a future posting; for now, assume that we are only interested in the unwind code path.

When a procedure unwind is requested, by any place within the system, the first order of business is a call to RtlUnwindEx. The prototype for RtlUnwindEx was provided in a previous posting, but in an effort to ensure that everyone is on the same page with this discussion, here’s what it looks like for x64:

VOID
NTAPI
RtlUnwindEx(
   __in_opt ULONG64               TargetFrame,
   __in_opt ULONG64               TargetIp,
   __in_opt PEXCEPTION_RECORD     ExceptionRecord,
   __in     PVOID                 ReturnValue,
   __out    PCONTEXT              OriginalContext,
   __in_opt PUNWIND_HISTORY_TABLE HistoryTable
   );

These parameters deserve perhaps a bit more explanation.

  1. TargetFrame describes the stack pointer (rsp) value for the target of the unwind operation. In normal circumstances, this is always the EstablisherFrame argument to an exception handler that is handling an exception. In the context of an exception handler, EstablisherFrame refers to the stack pointer of the caller of the function that caused the exception being inspected. Likewise, in this context, TargetFrame refers to the stack pointer of the function that the call stack should be unwound to. Although given the fact that with data-driven unwind semantics, one might initially think that this argument is unnecessary (after all, one might assume that RtlUnwindEx could simply invoke RtlVirtualUnwind in order to determine the expected stack pointer value for the next function on the call stack), this argument is actually required. The reason is that RtlUnwindEx supports unwinding past multiple procedure frames; that is, RtlUnwindEx can be used to unwind to a function that is several levels down in the call stack, instead of the immediately lower function in the call stack. Note that the TargetFrame argument must match exactly the expected stack pointer value of the target function in the call stack.

    Observant readers may pick up on the SAL annotation describing the TargetFrame argument and notice that it is marked as optional. In general, TargetFrame is always supplied; it can be omitted in one specific circumstance, which is known as an exit unwind; more on that later.

  2. TargetIp serves a similar purpose as TargetFrame; it describes the instruction pointer value that execution should be unwound to. TargetIp must be an instruction in the same function on the call stack that corresponds to the target stack frame described by TargetFrame. This argument is supplied as a particular function may have multiple points that could be resumed in response to an exception (this typically the case if there are multiple try/except clauses).

    Like TargetFrame, the TargetIp argument is also optional (though in most cases, it will be present). Specifically, if a frame consolidation unwind operation is being executed, then the TargetIp argument will be ignored by RtlUnwindEx and may be set to zero if desired (it will, however, still be passed to unwind handlers for use as they see fit). This specialized unwind operation will be discussed later, along with C++ exception support.

  3. ExceptionRecord is an optional argument describing the reason for an unwind operation. This is typically the same exception record that was indicated as the cause of an exception (if the caller is an exception handler), although it does not strictly have to be as such. If no exception record is supplied, RtlUnwindEx constructs a default exception record to pass on to unwind handlers, with an exception code of STATUS_UNWIND and an exception address referring to an instruction within RtlUnwindEx itself.
  4. ReturnValue describes a pointer-sized value that is to be placed in the return value register at the completion of an unwind operation, just before control is transferred to the newly unwound context. The interpretation of this value is entirely up to the routine being unwound into. In practice, the Microsoft C/C++ compiler does not use the return value at all in typical cases. Usually, the Microsoft C/C++ compiler will indicate the exception code that caused the exception as the return value, but due to how unwinding across functions works with try/except, there is no language-level support for retrieving the return value of a function that has been unwound due to an exception. As a result, in most circumstances, the return value placed in the unwound execution context based on this argument is ignored.
  5. OriginalContext describes an out-only pointer to a context record that is updated with the execution context as procedure call frames are unwound. In practice, as RtlUnwindEx does not ever “return” to its caller, this value is typically only provided as a way for a caller to supply its own storage to be used as scratch space by RtlUnwindEx during the intermediate unwind operations comprimising an unwind to the target call frame. Typically, the context record passed in to an exception handler from the exception dispatcher is supplied. Because the initial contents of the OriginalContext argument are not used, however, this argument need not necessarily be the context record passed in from the exception dispatcher.
  6. HistoryTable describes a cache used to improve the performance of repeated function entry lookups via RtlLookupFunctionEntry. Under normal circumstances, this is the same history table passed in from the exception dispatcher to an exception handler, although it could also be a caller-allocated structure as well. This argument can also be safely omitted entirely, although if a non-trivial set of call frames are being unwound, passing in even a newly-initialized history table may improve performance.

Given all of the above information, RtlUnwindEx performs a procedure call unwind by performing a successive sequence of RtlVirtualUnwind calls (to determine the execution context of the next call frame in the call stack), followed by a call to the registered language handler for the call frame (if one exists and is marked for unwinding support). In most cases where there is a language unwind handler, it will point to _C_specific_handler, which internally searches all of the internal exception handling scopes (e.g. try/except or try/finally constructs), calling “finally” handlers as need be. There may also be internal unwind handlers that are present in the scope table for a particular function, such as for C++ destructor support (assuming asynchronous C++ exception handling has been enabled). Most users will thus interact with unwind handlers in the form of a “finally” handler in a try/finally construct in a function whose language handler refers to _C_specific_handler.

If RtlUnwindEx encounters a “leaf function” during the unwind process (a leaf function is a function that does not use the stack and calls no subfunctions), then it is possible that there will be no matching RUNTIME_FUNCTION entry for the current call frame returned by RtlLookupFunctionEntry. In this case, RtlUnwindEx assumes that the return address of the current call frame is at the current value of Rsp (and that the current call frame has no unwind or exception handlers). Because the x64 calling convention enforces hard rules as to what functions without RUNTIME_FUNCTION registrations can do with the stack, this is a valid assumption for RtlUnwindEx to make (and a necessary assumption, as there is no way to call RtlVirtualUnwind on a function with no matching RUNTIME_FUNCTION entry). The current call frame’s value of Rsp (in the context record describing the current call frame, not the register value of rsp itself within RtlUnwindEx) is dereferenced to locate the call frame’s return address (Rip value), and the saved Rsp value is then adjusted accordingly (increased by 8 bytes).

When RtlUnwindEx locates the endpoint frame of the unwind, a special flag (EXCEPTION_TARGET_UNWIND) is set in the ExceptionFlags member of the EXCEPTION_RECORD passed to the language handler. This flag indicates to the language handler (and possibly any C-language scope handlers) that the handler is being called as the “final destination” of the unwind operation. The Microsoft C/C++ compiler does not expose functionality to detect whether a “finally” handler is being called in the context of a target unwind or if the “finally” handler is simply being called as an intermediate step towards the unwind target.

After the last unwind handler (if applicable) has been called, RtlUnwindEx restores the execution context that has been continually updated by successive calls to RtlVirtualUnwind. This restoration is performed by a call to RtlRestoreContext (a documented, exported function), which simply transfers a given context record to the thread’s execution context (thus “realizing” it).

RtlUnwindEx does not return a value to its caller. In fact, it typically does not return to its caller at all; the only “return” path for RtlUnwindEx is in the case where the passed-in execution context is corrupted (typically due to a bogus stack pointer), or if an exception handler does something illegal (such as returning an unrecognized EXCEPTION_DISPOSITION) value. In these cases, RtlUnwindEx will raise a noncontinuable exception describing the problem (via RtlRaiseStatus). These error conditions are usually fatal (and are indicative of something being seriously corrupted in the process), and virtually always result in the process being terminated. As a result, it is atypical for a caller of RtlUnwindEx to attempt to handle these error cases with an exception handler block.

In the case where RtlUnwindEx performs the requested unwind successfully, a new execution context describing the state at the requested (unwound) call frame is directly realized, and as such RtlUnwindEx does not ever truly return in the success case.

Although RtlUnwindEx is principally used in conjunction with exception handling, there are other use cases implemented by the Microsoft C/C++ compiler which internally rely upon RtlUnwindEx in unrelated capacities. Specifically, RtlUnwindEx implements the core of the standard setjmp and longjmp routines (assuming the exception safe versions of these are enabled by use of the <setjmpex.h> header file) provided by the C runtime library in the Microsoft CRT.

In the exception-safe setjmp/longjmp case, the jmp_buf argument essentially contains an abridged version of the execution context (specifically, volatile register values are omitted). When longjmp is called, the routine constructs an EXCEPTION_RECORD with STATUS_LONGJUMP as the exception code, sets up one exception information parameter (which is a pointer to the jmp_buf), and passes control to RtlUnwindEx (for the curious, the x64 version of the jmp_buf structure is described as _JUMP_BUFFER in setjmp.h under the _M_AMD64_ section). In this particular instance, the ReturnValue argument of RtlUnwindEx is significant; it corresponds to the value that is seemingly returned by setjmp when control is being transferred to the saved setjmp context as part of a longjmp call (somewhat similar in principal as to how the UNIX fork system call indicates whether it is returning to the child process or the parent process). The internal operations of RtlUnwindEx are identical whether it is being used for the implementation of setjmp/longjmp, or for conventional exception-handler-based triggering of procedure call frame unwinding.

However, there are differences that appear when RtlUnwindEx restores the execution context via RtlRestoreContext. There is special support inside RtlRestoreContext for STATUS_LONGJUMP exceptions with one exception information parameter; if this situation is detected, then RtlRestoreContext internally reinitializes portions of the passed-in context record based on the jmp_buf pointer stored in the exception information parameter block of the exception record provided to RtlRestoreContext by RtlUnwindEx. After this special-case partial reinitialization of the context record is complete, RtlRestoreContext realizes the context record as normal (causing execution control to be transferred to the stored Rip value). This can be seen as a hack (and a violation of abstraction layers; there is intended to be a logical separation between operating system level SEH support, and language level SEH support; this special support in RtlRestoreContext blurs the distinction between the two for C language support with the Microsoft C/C++ compiler). This layering violation is not the most egregious in the x64 exception handling scene, however.

This concludes the basic overview of the interface provided by RtlUnwindEx. There are some things that I have not yet covered, such as exit unwinds, collided unwinds, or the deep integration and support for C++ try/catch, and some of the highly unsavory things done in the name of C++ exception support. Next time: A walkthrough of the complete internal implementation of RtlUnwindEx, including undocumented, never-before-seen (or barely documented) corner cases like exit unwinds or collided unwinds (the internals of C++ exception support from the perspective of RtlUnwindEx are reserved for a future posting, due to size considerations).

Think before you optimize

Friday, December 29th, 2006

“Premature optimization is the root of all evil” is a famous quote in computer science, and it absolutely holds true. Before optimizing a problem, you must make sure that you are optimizing the bottleneck, and that your optimization doesn’t actually make things worse.

These rules may seem obvious, but not everyone adheres to them; you’d be surprised how many newsgroup postings I see where people are asking how to solve the wrong problem because they didn’t take the time to profile their program and locate their real bottleneck.

One example of this kind of premature (or perhaps just not all the way thought-through) optimization that bothers me on a daily basis is in the Microsoft Terminal Server client (mstsc.exe). Terminal Server is a remote windowing protocol, and as such it is designed to take great pains to improve responsiveness to users. In most cases, improving responsiveness over the network involves minimizing the amount of data sent from the server to the client. In this spirit, the designers of Terminal Server implemented an innocent-seeming optimization, wherein the Terminal Server client detects when it has been minimized. If this occurs, the Terminal Server client sends a special message to the server asking that it stop sending window updates to the client. When the user restores the Terminal Server client window, the server will resync with the client.

This may seem like a clever little optimization with no downsides at first, but it turns out that it actually worsens the user experience (at least in my opinion) when you look at things a little bit closer. First, there’s how Terminal Server resynchronizes with the client when the client requests that it again wants to receive windowing data. Windows follows the model of not saving data that can be recalculated on demand in its user interface design. In many ways, this is a perfectly valid model, and there are a number of valid reasons for it (especially given that as you open more windows, it starts to become non-trivially-expensive to cache bitmap data for every window on the screen – even more especially on the very low end systems that 16-bit Windows has to work on). As a result, when Windows wants to retrieve the contents of a window on screen, the typical course of action is that a WM_PAINT message is sent to the window. This message asks the window to draw itself into a device context, or a storage area where the bits can then be transferred to the screen, a printer, or any other visual display device.

If you’ve been paying attention, then you might be seeing where this starts to go wrong with Terminal Server. When you restore a minimized Terminal Server client window, the client asks the server to resynchronize. This is necessary because the server has since stopped sending updates to the client, which means that the client has to assume that its display data is now stale. In order to do this resynchronization, Terminal Server has to figure out what contents have changed on the overall desktop bitmap that describes the entire screen. Terminal Server is free to cache the entire contents of a session’s interactive desktop as a whole (and indeed this is necessary so that during resynchronization, the entire desktop doesn’t have to be transferred as a bitmap to the client). However, it still needs to compare the last copy of the bitmap that was sent to a client with the “current” view of the desktop. In order to do that, Terminal Server essentially does something along the lines of asking each visible window on the desktop to paint itself. Then, Terminal Server can update the client with new display data for each window.

The problem here is that many programs don’t repaint themselves in a very graceful fashion. Many programs have unpleasant tendencies like triggering multiple draw operations over the same region before the end result is achieved, something that manifests itself as a very slightly annoying flicker when a window repaints. Even Microsoft programs exhibit this problem; for instance, Visual Studio 2005 tends to do this, as does Internet Explorer when drawing pages with foreground images overlayed with background images.

Now, while this may be a minor annoyance when working locally, it turns out to be a big problem when a program is running over Terminal Server. What would otherwise be an innocuous flicker over the course of a couple of milliseconds on a “glass terminal” display turns into multiple draw commands being sent and realized over the network to the Terminal Server client. This translates to bandwidth waste as redundant draw commands are transmitted, and even worse, a lack of responsiveness when restoring a minimized Terminal Server client window (due to having to wait on the programs on Terminal Server to finish updating themselves in the resynchronization process). If you have several programs running on the Terminal Server, this can correspond to three or four seconds of waiting before the Terminal Server session is responsive to input from the client.

While this is annoying in and of itself, it may still not seem all that bad. After all, this problem only happens if you minimize and restore a window, and you generally don’t just minimize and restore windows all the time, right? It turns out that with the Terminal Server client, most people do just that, if they are working in fullscreen mode. Remember that fullscreen Terminal Server obscures the task bar on the physical client computer, and in many cases, results in task switching keystrokes such as Alt+Tab or the Windows key being sent to the remote Terminal Server session and not the physical client system. In order to switch to another program on the physical client computer, then, one needs to either minimize the Terminal Server window or (perhaps temporarily) take it out of fullscreen mode. At least for me, if I want to switch to a program on the physical client system, the logical choice is to hit the minimize button on the Terminal Server client info bar at the top of the fullscreen Terminal Server client window. Unfortunately, that little minimize button invokes the clever redraw optimization that stops the server from updating the client. This means that when I switch back to the Terminal Server session, I need to wait several seconds while programs running in the Terminal Server session finish redrawing themselves and transmitting the draw operations to the client (which is especially painful if you are dealing with bitmaps, such as Internet Explorer on a page with foreground images overlaying a background image).

As a result, thanks to somebody’s “clever optimization”, my Terminal Server sessions now take several seconds to come back when I switch away from them to work on something locally (perhaps to copy and paste some text from the remote system to my computer) and then switch back.

Now, Terminal Server is a great example of a highly optimized program on the whole (and it’s absolutely usable, don’t get me wrong about that). It beats the pants off of VNC and any of the other remote windowing systems that I have ever used any day of the week, for one. However, this just goes to show that even with the best of intentions, one little optimization can blow up in unintended (negative) ways if you are not careful.

Oh, and if you run into this little annoyance as frequently as I do, there is one thing that you may be able to do that alleviates it (at least looking to the future, anyway). When using the Windows Vista (or later) Terminal Server client to connect to a Windows Vista or Windows Server “Longhorn” Terminal Server (or Remote Desktop), you can prevent this lack of responsiveness when restoring minimized Terminal Server windows by enabling desktop composition on the Terminal Server connection. This may seem a bit counter-intuitive at first (enabling 3D transparent windows would sure make you think that a lot more data would need to be transferred, thus slowing down the experience as a whole), but if you are on a high-bandwidth, low-latency link to the target computer, it turns out that desktop composition improves responsiveness when restoring minimized Terminal Server windows. This is because with desktop composition enabled, Windows breaks from the traditional model of not saving data that you can recalculate. Instead, with desktop composition enabled, Windows will save the contents of all windows on the screen for future reference, so that if Windows needs to access the bits of a window, it doesn’t need to ask that window to redraw. (This allows all sorts of neat tricks, such as how you can have a window appearing to be drawn twice with the new Alt+Tab window on Windows Vista, with the live preview, without a major performance hit – try it out with a 3D game in windowed mode to see what I mean). Because of this caching of window data, when resynchronizing with the client after a minimize and restore operation, the server end of Terminal Server doesn’t need to ask every program to redraw itself; all it needs to do is fetch the bits out of the cache that is created for each window by desktop composition (and thus the differences sent to the client will only show “real” differences, not multiple layers of a redraw operation. Try this with an Internet Explorer window open on a page with foreground images overlaying background images, and the difference is immediately visible between Terminal Server with desktop composition enabled and Terminal Server without desktop composition.) This means that there are no more painful multi-step-redraw operations that are visible in real time on the client, at least when it comes to pathological bitmap drawing cases, such as Internet Explorer (and no annoying flicker in the less severe cases).

Programming against the x64 exception handling support, part 2: A description of the new unwind APIs

Tuesday, December 19th, 2006

Last time, I described many of the structures and prototypes necessary to program against the new x64 exception handling (EH) support. This posting continues that series, and describes how to manually initiate an unwind of procedure frames (and when and why you might want to do this).

Because x64 has built-in support for data-driven unwinding, there are a great many interesting things that you can do with unwinding functions at arbitrary points in execution. Unlike x86, you don’t have to either assume that all functions use a frame pointer (which is typically not the case in many programs), and you don’t need to call code with a certain register context setup in the correct way (with the right local variables at the right displacements from the stack pointer) in order to initiate an unwind of a function that had registered an unwind or exception handler.

If you’ve been reading some of my recent postings about performing stack traces on x86, then you one of the first things that might come to mind is designing an approach that can create a “perfect” call stack in all situations without symbols. There are other benefits to this data-driven unwind data approach, however, than simply being able to take accurate call stacks at arbitrary points in the execution process. For instance, there are particularly interesting benefits as far as instrumentation and code analysis go (such as an improved ability to detect most functions in an image programmatically with a great deal of certainty based on unwind data), and there are interesting implications for techniques such as function patching and modification on the fly as well.

First things first, however. The initial step is to get familiar with the new unwinding APIs that Win64 exposes on x64. Although these APIs can be manually duplicated by explicit parsing of all unwind information, I would recommend calling the APIs directly instead of doing all of the work to manually emulate unwinds yourself. The reason that I make that recommendation is that while the unwind metadata is documented, there is still a significant amount of work involved in reimplementing them from scratch, and the unwind APIs themselves are (mostly) documented on MSDN and thus unlikely to change.

There are several APIs in particular that you’ll frequently find yourself using for unwind support on x64. These APIs are available in both user mode and kernel mode (and aside for a lack of support for dynamically generated unwind data) the two operating environments use exactly the same semantics for unwinding. Thus, for the most part, you can interact with unwind metadata in the same fashion for both user mode and kernel mode.

  1. RtlLookupFunctionEntry: The first API that you’ll likely end up having to call for any unwind-related operation is RtlLookupFunctionEntry. This routine is the basis of all unwind operations in that it allows the caller to translate a raw 64-bit RIP value into two important values: An image base for any associated image in the address space of the caller, and a pointer to the RUNTIME_FUNCTION structure associated with the RIP value passed in. For virtually all cases on x64, you’ll be able to retrieve a valid RUNTIME_FUNCTION structure for the current RIP value. The exception to this rule relates to what are known as leaf functions, or functions that both make no direct modifications to the stack pointer (or any nonvolatile registers), and do not call any subfunctions. For these leaf functions only, the emission of unwind metadata is optional by the compiler. To handle this case, it is typical to read the first ULONG64 from the current RSP value (i.e. the return address of the current leaf function). This address can then be passed to RtlLookupFunctionEntry. Because leaf functions do not touch any nonvolatile registers or alter the stack pointer or call any subfunctions, they can be safely skipped in the unwind process in this fashion. (Virtually all functions in a given x64 binary are non-leaf functions (otherwise known as frame functions), or functions that do not meet the previously described three criteria. In either case, however, the restrictions on leaf functions mean that they do not impact the ability to perform complete unwinds despite the lack of unwind metadata associated with them.)

    The typical usage case for RtlLookupFunctionEntry is simply to retrieve the function entry for the currently executing function. (For leaf functions, it may be necessary to retrieve the function entry for the caller, if there is no unwind metadata for the current function, as described above.) Then, the PRUNTIME_FUNCTION returned is typically passed to one of the “high level” unwind support routines, although if necessary, it can be manually interpreted directly (this is typically not required, however).

  2. RtlVirtualUnwind: The RtlVirtualUnwind API is the core of the Win64 x64 unwind support. This API implements the lowest level interface exposed for interacting with unwind metadata through a RUNTIME_FUNCTION. In particular, it implements all of the code necessary to interpret UNWIND_CODEs and adjust the stack and nonvolatile register context according to the unwind information specified via a RUNTIME_FUNCTION. It also has logic to locate and execute exception or unwind handlers for a given function.

    RtlVirtualUnwind provides the infrastructure upon which higher level exception and unwind handling support is implemented. It exposes the concept of a virtual unwind (as one might guess, given the routine’s name). The virtual unwind concept is one that is entirely new to x64 (and IA64), and does not exist in any form on x86. This is due entirely to the fact that IA64 and x64 have data-driven unwind support, while x86 has code-driven unwind support.

    The distinction is important in that on x64 and IA64, it is possible to simulate an unwind, at an arbitrary point in time, without running code with potentially unknown side effects (or unknown entry conditions, as with x86 exception or unwind handlers that utilize local variables). This is accomplished by interpreting the unwind codes described by a RUNTIME_FUNCTION and associated UNWIND_INFO blocks. This is the essence of what a virtual unwind is; a simulated unwind operation that can operate on an arbitrary, isolated register context without affecting (or otherwise impacting) the actual realized state of the program. In its purest form, a virtual unwind can be accomplished by invoking RtlVirtualUnwind with a register context that you wish to have the unwind applied to, and the UNW_FLAG_NHANDLER flag value for the HandlerType parameter (which suppresses the invokation of any unwind or exception handlers registered by the function).

    This is a very powerful capability indeed, as it allows for a much more complete and thorough traversal of call frames than ever possible on x86. With the ability to describe and undo the changes to nonvolatile registers given an initial register context and stack, virtual unwinding allows programmatic, completely-reliable access to not only the return address, stack frame, and arguments of arbitrary functions at any point in an active call stack, but also access to nonvolatile register values at any point in a call stack. If you have ever debugged optimized code where parameter values and intermediate values are frequently only present in registers, then you can immediately see how valuable this particular benefit of virtual unwinding is to debugging (it is important to note that as volatile registers are not saved anywhere, it is not necessarily possible to reconstruct their values at any point in the call frame).

    It is also possible to use RtlVirtualUnwind to effect a “realized” unwind, and indeed, RtlVirtualUnwind is the cornerstore on which the rest of the unwinding architecture in Win64 x64 is built. By directing RtlVirtualUnwind to call unwind (or exception) handlers, as appropriate, and then further altering the returned context (such as by specifying a return value), it is possible to perform a complete “realized” unwind from a procedure at an arbitrary point in execution.

  3. RtlUnwindEx: RtlUnwindEx supplants the RtlUnwind API that exists on x86 for purposes of implementing a “hard unwind” that alters the realized execution state of the program. RtlUnwindEx is a natural extension of RtlUnwind that includes support for features new to 64-bit exception handling support. Unlike RtlUnwind, it can operate on a register context other than the current register context.

    RtlUnwindEx implements an unwind that calls all of the necessary unwind handlers necessary to unwind to a particular point. It also adjusts the register context based on the unwind metadata at the given procedure frame being unwound. Internally, RtlUnwindEx is essentially implemented as a wrapper that calls RtlVirtualUnwind and registered unwind handlers as necessary for each frame in between the active frame and the target frame. It also houses all of the logic necessary to deal with some of the other subtleties of unwinding, such as detection of a bogus stack pointer value in the passed in register context.

    RtlUnwindEx is useful if you are needing to execute a complete unwind (and only a complete unwind) of a particular procedure frame or set of procedure frames. In most cases where you would be doing this, it is usually sufficient to just be relying on the language-level exception handling support, so I consider RtlUnwindEx as relatively uninteresting (at least when compared to RtlVirtualUnwind). Many of the more interesting use cases for directly calling the x64 exception handling support thus require the use of RtlVirtualUnwind directly (although selectively unwinding past certain procedure frames with complete support for calling unwind handlers is made easier by direct usage of RtlUnwindEx).

  4. RtlCaptureStackBackTrace: The RtlCaptureStackBackTrace routine is essentially a high-level implementation of a stack walking routine that utilizes the lower level unwind support (in particular, RtlVirtualUnwind). Unlike StackWalk64, RtlCaptureStackBackTrace is very light-weight and does not use symbols (it is implemented entirely with the unwind metadata present on x64). As such, it does not exist on x86. It is, however, handy for quickly capturing stack traces (and can be used in both user mode and kernel mode in the same fashion). RtlCaptureStackBackTrace does not return non-volatile register contexts for each frame being traced, however, so if you require this functionality, then you would need to implement your own stack trace mechanism on top of RtlVirtualUnwind. (It is worth noting that this is sort of mechanism is essentially what functionality like handle tracing and page heap tracing are built on top of, to give you an idea of how useful it can be.) If you only need return addresses for each frame, however, then RtlCaptureStackBackTrace is an excellent API to consider for use if you need to log stack traces at periodic locations in your own programs for later analysis (especially since it doesn’t require anything as invasive as loading symbols).

That’s all there is in this posting. More details on how to use the new unwind support next time…

Vista ASLR is not on by default for image base addresses

Saturday, December 16th, 2006

This little tidbit seems to be missed in all of the press about Vista’s ASLR implementation: Vista ASLR (when speaking of randomizing image base addresses) does not apply to image bases by default. This is a sacrifice for application compatibility’s sake, in an effort to make fewer programs break “out of the box” on Vista. Most notably, this is the case even for images with base relocations.

Unfortunately, the mechanism to mark an executable image as “ASLR aware” (such that it can be freely rebased by Vista’s ASLR) is not at present documented. Furthermore, the linker version that is included with Visual Studio 2005 and the Windows Vista Platform SDK does not support the option necessary to mark as image as ASLR aware (though you could technically modify the image by hand with a hex editor or the like to enable it).

The WDK linker does support the new ASLR-enabling linker option, however (though it too does not appear to document it anywhere). You can find references to this new linker option in makefile.new:

!if defined(NO_DYNAMICBASE)
DYNAMICBASE_FLAG=
!else
! if $(_NT_TOOLS_VERSION) >= 0x800
DYNAMICBASE_FLAG=/dynamicbase
! else
DYNAMICBASE_FLAG=
! endif
!endif

Passing /dynamicbase to the WDK version of link.exe (8.00.50727.215) or later will set the 0x40 DllCharacteristics value in the PE header of the output binary. This corresponds to a newly-defined constant which is at present only documented briefly in the WDK version of ntimage.h:

#define IMAGE_DLLCHARACTERISTICS_DYNAMIC_BASE 0x0040
     // DLL can move.

If this flag is set, then the base address of an image can be randomized by Vista’s ASLR; if the flag is clear, however, then no ASLR-style randomizations are performed to the image base address of a particular image (in this case, however, it is important to note that heap and stack allocations are still randomized – it is only the image base address that does not become randomized).

Now, virtually all of the Microsoft PE images that ship with the operating system are built with /dynamicbase, so they will take full advantage of Vista’s ASLR with respect to image base randomization. However, third party (ISV)-built programs will not, by default, gain all the benefits of ASLR due to this application compatibility sacrifice. This is where the potential problem is, as effectively all existing third party PE images will need to be recompiled to enable ASLR on image base addresses. (Technically, you could use link /edit with the WDK linker to do this without a rebuild, or hex edit binaries, but this is not a real solution in my mind. In Microsoft’s defense, many third-party .exe files are often built without base relocations, which means that even if Microsoft had enabled ASLR by default, many third party programs would still not be getting the full benefit. This does not, however, mean that I fully agree with their decision…)

I can understand where Microsoft is coming from with an application compatibility perspective as far as ASLR’s impact on poorly written programs (of which there are an abundance of in the Windows world), but it is a bit unfortunate that there is no real way to administratively enable ASLR globally, or at least administratively make it an opt-out instead of opt-in setting.

So, if you are an ISV, here’s a heads up to be on the lookout for a link.exe version shipping with Visual Studio that supports /dynamicbase. When such becomes available, I would highly recommend enabling /dynamicbase for all of your projects (so long as you aren’t doing anything terribly stupid in your programs, enabling image base randomizations should be fairly harmless in most cases). You should also mark your .exe files as /FIXED:NO such that they contain a relocation section. This, when combined with /dynamicbase, will allow your .exe files to be randomized by ASLR (just the same as with DLLs that have relocation information and are built with /dynamicbase).

Update: Visual Studio 2005 SP1 has shipped. This update to Visual Studio includes a newer version of the linker, which supports the /dynamicbase option described above. So, be sure to rebuild your programs with /dynamicbase and /fixed:no with VS 2005 SP1 in order to take full advantage of ASLR on Vista.

Programming against the x64 exception handling support, part 1: Definitions for x64 versions of exception handling support

Wednesday, December 13th, 2006

This is a series dealing with how to use the new x64 exception handling support from a programmatic perspective (that is, how to write programs that take advantage of the new support, instead of the perspective of how to understand it while reverse engineering or disassembling something. Those topics have been covered in the past on this site already.)

To get started with programming against the new x64 EH support, you’ll need to have the structure and prototype definitions for the standard x64 EH related functions and structures. One’s first instinct here is to go to MSDN. Be warned, that if you are dealing with the low-level SEH routines (such as RtlUnwindEx), the documentation on MSDN is still missing / wrong for x64. For the most part, excepting RtlVirtualUnwind (which is actually correctly documented now), the exception handler support is only properly documented for IA64 (so don’t be surprised if things don’t work out how you would hope when calling RtlUnwindEx with the MSDN prototype).

For a recent project, I had to do some in-depth work with the inner workings of exception handling support on x64. So, if you’ve been ever having to deal with the low-level EH internals on x64 and have been frustrated by documentation on MSDN that is either incomplete or just plain wrong, here’s some of the things that I have run into along the way as far as things that are either missing or incorrect on MSDN while relating to x64 EH support:

  1. When processing an UNWIND_INFO structure, if the UNW_FLAG_CHAININFO flag is set, then there is an additional undocumented possibility for how unwind information can be chained. Specifically, if the low bit is set in the UnwindInfoAddress of the IMAGE_RUNTIME_FUNCTION_ENTRY structure referring to by the parent UNWIND_INFO structure, UnwindInfoAddress is actually the RVA of another IMAGE_RUNTIME_FUNCTION_ENTRY structure after zeroing the first bit (instead of the RVA of an UNWIND_INFO structure). This is used to help more efficiently chain exception data across a binary with minimal waste of space (credits go to skape for telling me about this).
  2. The prototype on MSDN for RtlUnwindEx is only for IA64 and does not apply to x64. The correct prototype is something more on the lines of this:
    VOID
    NTAPI
    RtlUnwindEx(
       __in_opt ULONG64               TargetFrame,
       __in_opt ULONG64               TargetIp,
       __in_opt PEXCEPTION_RECORD     ExceptionRecord,
       __in     PVOID                 ReturnValue,
       __out    PCONTEXT              OriginalContext,
       __in_opt PUNWIND_HISTORY_TABLE HistoryTable
       );
  3. MSDN’s definition of DISPATCHER_CONTEXT (a structure that is passed to the language specific handler) is incomplete. There are some additional fields beyond HandlerData, which is the last field documented in MSDN. You can see this if you disassemble _C_specific_handler, which uses the undocumented ScopeIndex field. Additional credits go to Alex Ionescu for information on a couple of the undocumented DISPATCHER_CONTEXT fields. Here’s the correct definition of this structure for x64:
    typedef struct _DISPATCHER_CONTEXT {
        ULONG64               ControlPc;
        ULONG64               ImageBase;
        PRUNTIME_FUNCTION     FunctionEntry;
        ULONG64               EstablisherFrame;
        ULONG64               TargetIp;
        PCONTEXT              ContextRecord;
        PEXCEPTION_ROUTINE    LanguageHandler;
        PVOID                 HandlerData;
        PUNWIND_HISTORY_TABLE HistoryTable;
        ULONG                 ScopeIndex;
        ULONG                 Fill0;
    } DISPATCHER_CONTEXT, *PDISPATCHER_CONTEXT;
  4. Not all of the flags passed to an exception handler (primarily relating to unwinding) are properly documented on MSDN. These additional flags are included in winnt.h, however, and are actually the same for both x86 and x64. Here’s a listing of the missing flags that apply to the ExceptionFlags member of the EXCEPTION_RECORD structure (only the EXCEPTION_NONCONTINUABLE flag value is documented on MSDN):
    #define EXCEPTION_NONCONTINUABLE   0x0001
    #define EXCEPTION_UNWINDING        0x0002
    #define EXCEPTION_EXIT_UNWIND      0x0004
    #define EXCEPTION_STACK_INVALID    0x0008
    #define EXCEPTION_NESTED_CALL      0x0010
    #define EXCEPTION_TARGET_UNWIND    0x0020
    #define EXCEPTION_COLLIDED_UNWIND  0x0040
    #define EXCEPTION_UNWIND           0x0066

    In particular, EXCEPTION_UNWIND is a bitmask of other flags that indicates all possible flags that are used to signify an unwind operation. This is probably the most interesting bitmask/flag to you, as you’ll need it if you are distinguishing from an exception or an unwind operation from the perspective of an exception handler.

  5. The definition for the C scope-table information emitted by CL for __try/__except/__finally and implicit exception handlers is not documented. Here’s the definition of the scope table used for C exception handling support:
    typedef struct _SCOPE_TABLE {
    	ULONG Count;
    	struct
    	{
    		 ULONG BeginAddress;
    		 ULONG EndAddress;
    		 ULONG HandlerAddress;
    		 ULONG JumpTarget;
    	} ScopeRecord[ 1 ];
     } SCOPE_TABLE, *PSCOPE_TABLE;
    

    This structure was briefly documented in a beta release of the WDK, although it has since disappeared from the RTM build. The ScopeRecord field describes a variable-sized array whose length is given by the Count field.
    You’ll need this structure definition if you are interacting with _C_specific_handler, or implementing assembler routines that are intended to use _C_specific_handler as their language specific handler.
    All of the above addresses are RVAs. BeginAddress and EndAddress are the RVAs for which the current scope record is effective for. HandlerAddress is the RVA of a C-specific exception handler (more on that below) that implements the __except filter routine in C exception support, or the hardcoded value 0x1 to indicate that this is the __except filter unconditionally accepts the exception (this is also set to 0x1 for a __finally block). The JumpTarget member is the RVA of where control is transferred if the C exception handler indicates the address of the body of an __except block (or a __finally block).

  6. The C exception handler routine whose RVA is given by the HandlerAddress of the C scope table for a code block is defined as follows:
    typedef
    LONG
    (NTAPI * PC_LANGUAGE_EXCEPTION_HANDLER)(
       __in    PEXCEPTION_POINTERS    ExceptionPointers,
       __in    ULONG64                EstablisherFrame
       );

    The ExceptionPointers argument is the familiar EXCEPTION_POINTERS structure that the GetExceptionInformation macro returns. The EstablisherFrame argument contains the stack pointer value for the routine associated with the C exception handler in question at the point in which the exception occured. (If the exception occured in a subfunction called by the function that the exception is now being inspected at, then the stack pointer should be relative to the point just after the call to the faulting function was made.) The EstablisherFrame argument is typically used to allow transparent access to the local variables of the current function from within the exception filter, even though technically the exception filter is not part of the current function but actually a completely different function itself. This is the mechanic by which you can access local variables within an __except expression.
    The function definition deserves a bit more explanation than just the parameter value meanings, however, as it is really dual-purpose. There are two modes for this routine, exception handling mode and unwind handling mode. If the low byte of the ExceptionPointers argument is set to the hardcoded value 0x1, then the handler is being called for an unwind operation. In this case, the rest of the ExceptionPointers argument is meaningless, and only the EstablisherFrame argument holds a meaningful value. In addition, when operating in unwind mode, the return value of the exception handler routine is ignored (the compiler often doesn’t even initialize it for that code path). In exception handling mode (where the ExceptionPointers argument’s low byte is not equal to the hardcoded value 0x1), both arguments are significant, and the return value is also used. In this case, the return value is one of the familiar EXCEPTION_EXECUTE_HANDLER, EXCEPTION_CONTINUE_SEARCH, and EXCEPTION_CONTINUE_EXECUTION constants that are returned by an __except filter expression. If EXCEPTION_EXECUTE_HANDLER is returned, then control will eventually be transferred to the JumpTarget member of the current scope table entry.

  7. The definition of the UNWIND_HISTORY_TABLE structure (and associated substructures) for x64 is as follows (this structure is used as a cache to speed up repeated exception handling lookups, and is typically optional as far as usage with RtlUnwindEx goes – though certainly recommended from a performance perspective):
    #define UNWIND_HISTORY_TABLE_SIZE 12
    
    typedef struct _UNWIND_HISTORY_TABLE_ENTRY {
            ULONG64           ImageBase;
            PRUNTIME_FUNCTION FunctionEntry;
    } UNWIND_HISTORY_TABLE_ENTRY,
    *PUNWIND_HISTORY_TABLE_ENTRY;
    
    #define UNWIND_HISTORY_TABLE_NONE 0
    #define UNWIND_HISTORY_TABLE_GLOBAL 1
    #define UNWIND_HISTORY_TABLE_LOCAL 2
    
    typedef struct _UNWIND_HISTORY_TABLE {
            ULONG                      Count;
            UCHAR                      Search;
            ULONG64                    LowAddress;
            ULONG64                    HighAddress;
            UNWIND_HISTORY_TABLE_ENTRY
               Entry[ UNWIND_HISTORY_TABLE_SIZE ];
    } UNWIND_HISTORY_TABLE, *PUNWIND_HISTORY_TABLE;
  8. There are inconsistencies regarding the usage of RUNTIME_FUNCTION and IMAGE_RUNTIME_FUNCTION in various places in the documentation. These two structures are synonymous for x64 and may be used interchangeably.

Most of the other x64 exception handling information on the latest version of MSDN is correct (specifically, parts dealing with dealing with function tables, such as RtlLookupFunctionTableEntry.) Remember that the MSDN documentation also includes IA64 definitions on the same page, though (and the IA64 definition is typically the one presented at the top with all of the arguments explained, where you would expect it). You’ll typically need to scroll through the remarks section to find information on the x64 versions of these routines. Be wary of using your locally installed Platform SDK help with the functions that are correctly documented on MSDN, though, as to my knowledge only the very latest SDK version (e.g. the Vista SDK) actually has correct information for any of the x64 exception handling information; older versions, such as the Platform SDK that shipped with Visual Studio 2005, only include IA64 information for routines like RtlVirtualUnwind or RtlLookupFunctionTableEntry. In general, anywhere you see a reference to a FRAME_POINTERS or Gp structure or value in the documentation, this is a good hint that the documentation is talking exclusively about IA64 and does not directly apply to x64.

That’s all for this installment. More on how to use this information from a programmatic perspective next time…

Debugger internals: How loaded module names are communicated to the debugger

Monday, December 11th, 2006

If you’ve ever used the Win32 debugging API, you’ll notice that the WaitForDebugEvent routine, when returning a LOAD_DLL_DEBUG_EVENT style of event, gives you the address of an optional debuggee-relative string pointer containing the name of the DLL that is being loaded. In case you’ve ever wondered just where that string comes from, you’ll be comforted to know that this mechanism for communicating module name strings to the remote debugger is built upon a giant hack.

To give a bit of background information on how loading of DLLs works, most of the heavy-lifting with respect to loading DLLs (referred to as “mapping an image”) is done by the memory manager subsystem in kernel mode – specifically, in the “MiMapViewOfImageSection” internal routine. This routine is responsible for taking a section object (known as a file mapping object in the Win32 world) that represents a PE image on disk, and setting up the in-memory layout of the PE image in the specified process address space (in the case of Win32, always the address space of the caller). This includes setting up PE image subsections with the correct alignment, zero-filling “bss”-style sections, and setting up the protections of each PE image subsection. It is also responsible for supplying the “magic” necessary to allow shared PE subsections to work. All of this behavior is controlled by the SEC_IMAGE flag being passed to NtMapViewOfSection (this behavior is visible by Win32 via passing SEC_IMAGE to MapViewOfFile, and can be used to achieve the same result of “just” mapping an image in-memory without going through the loader). Internally, the loader routine in NTDLL (LdrLoadDll and its associated subfunctions, which are called by the LoadLibrary family of routines in kernel32) utilizes NtMapViewOfSection to create the in-memory layout of the DLL being requested. After performing this task, the user-mode NTDLL-based loader then performs tasks such as applying base relocations, resolving imports to other modules (and loading dependent modules if necessary), allocating TLS data slots, making DLL initializer callouts, and soforth.

Now, the way that the debugger is notified of module load events is via a kernel mode hook that is called by NtMapViewOfSection (DbgkMapViewOfSection). This hook is responsible for detecting if a debugger (user mode or kernel mode) is present, and if so, forwarding the event to the debugger.

This is all well and good, but there’s a catch here. Both the user mode and kernel mode debuggers display the full path name to the DLL being loaded, but we’re now at the wrong level of abstraction, so to speak, to retrieve this information. All MiMapViewOfSection has is a handle to a section object (in actuality, a PSECTION_OBJECT and not a handle at this point). Now, the section object *does* have a reference to the PFILE_OBJECT associated with the file backing the section object (the reference is stored in the CONTROL_AREA of the section object), but there isn’t necessarily a good way to get the original filename that was passed to LoadLibrary out of the FILE_OBJECT (for starters, at this point, that path has already been converted to a native path instead of a Win32 path, and there is some potential ambiguouity when trying to convert native paths back to Win32 paths).

To work around this little conundrum, the solution the developers chose is to temporarily borrow a field of the NT_TIB portion of the TEB of the calling thread for use as a way to signal the name of a DLL that is being loaded (if SEC_IMAGE is being passed to NtMapViewOfSection). Specifically, NT_TIB.ArbitraryUserPointer is temporarily replaced with a string pointer (in Windows NT, this is always a unicode string) to the original filename passed to LdrLoadDll. Normally, the ArbitraryUserPointer field is reserved exclusively for use by user mode as a sort of “free TLS slot” that is available at a known location for every thread. Although this particular value is rarely used in Windows, the loader does make the effort to preserve its value across calls to LdrLoadDll. This works (since the loader knows that none of the code that it is calling will use NT_TIB.ArbitraryUserPointer), so long as you don’t have cross-thread accesses to a different thread’s NT_TIB.ArbitraryUserPointer (to date, I have never seen a program that tries to do this – and a good thing to, or it would randomly fail when DLLs are being loaded). Because the original value of NT_TIB.ArbitraryUserPointer is restored, the calling thread is typically none-the-wiser that this substitution has been performed.

Disassembling the part of the NTDLL loader responsible for mapping the DLL into the address space via NtMapViewOfSection (a subroutine named “LdrpMapViewOfDllSection” on Windows Vista), we can see this behavior in action:

ntdll!LdrpMapViewOfDllSection:
[...]
;
; Find the TEB address for the current thread.
; esi = NtCurrentTeb()->NtTib.Self
;
77f0e2ee 648b3518000000  mov     esi,dword ptr fs:[18h]
77f0e2f5 8365fc00        and     dword ptr [ebp-4],0
77f0e2f9 57              push    edi
77f0e2fa bf00000020      mov     edi,20000000h
77f0e2ff 857d18          test    dword ptr [ebp+18h],edi
77f0e302 c745f804000000  mov     dword ptr [ebp-8],4
77f0e309 0f85ce700400    jne     LdrpMapViewOfDllSection+0x26

ntdll!LdrpMapViewOfDllSection+0x42:
77f0e30f 8b4514          mov     eax,dword ptr [ebp+14h]
;
; Save away the previous ArbitraryUserPointer value.
;
; ebx = Teb->NtTib.ArbitraryUserPointer
77f0e312 8b5e14          mov     ebx,dword ptr [esi+14h]
77f0e315 6a04            push    4
77f0e317 ff7518          push    dword ptr [ebp+18h]
;
; Set the ArbitraryUserPointer value to the string pointer
; referring to the DLL name passed to LdrLoadDll.
; Teb->NtTib.ArbitraryUserPointer = (PVOID)DllNameString;
; 
77f0e31a 894614          mov     dword ptr [esi+14h],eax
77f0e31d 6a01            push    1
77f0e31f ff7510          push    dword ptr [ebp+10h]
77f0e322 33c0            xor     eax,eax
77f0e324 50              push    eax
77f0e325 50              push    eax
77f0e326 50              push    eax
77f0e327 ff750c          push    dword ptr [ebp+0Ch]
77f0e32a 6aff            push    0FFFFFFFFh
77f0e32c ff7508          push    dword ptr [ebp+8]
;
; Call NtMapViewOfSection to map the image and perform the
; debugger notification.
;
77f0e32f e830180300      call    NtMapViewOfSection
77f0e334 857d18          test    dword ptr [ebp+18h],edi
77f0e337 5f              pop     edi
;
; Restore the previous value of
; Teb->NtTib.ArbitraryUserPointer.
;
77f0e338 895e14          mov     dword ptr [esi+14h],ebx
77f0e33b 5e              pop     esi
77f0e33c 894514          mov     dword ptr [ebp+14h],eax
77f0e33f 5b              pop     ebx
77f0e340 0f85bc700400    jne     LdrpMapViewOfDllSection+0x75

Sure enough, the user mode loader uses the current thread’s NT_TIB.ArbitraryUserPointer to communicate the DLL name string pointer (in this context, the “eax” value loaded into NT_TIB.ArbitraryUserPointer is the dll name string.) We can easily verify this in the debugger:

Breakpoint 0 hit
eax=0017ecfc ebx=00000000 ecx=0017ecd8
edx=774951b4 esi=c0000135 edi=0017ed80
eip=773fe2e5 esp=0017ec10 ebp=0017ed18
iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b
gs=0000             efl=00000246
ntdll!LdrpMapViewOfDllSection:
773fe2e5 8bff            mov     edi,edi
0:000> g 773fe31a 
eax=001db560 ebx=00000000 ecx=0017ecd8
edx=774951b4 esi=7ffdf000 edi=20000000
eip=773fe31a esp=0017ebf0 ebp=0017ec0c
iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b
gs=0000             efl=00000246
ntdll!LdrpMapViewOfDllSection+0x4d:
773fe31a 894614          mov     dword ptr [esi+14h],eax
0:000> du @eax
001db560  "C:\\Windows\\system32\\CLBCatQ.DLL"

Looking in the kernel, we can clearly see the call to DbgkMapViewOfSection:

ntoskrnl!NtMapViewOfSection+0x21a:
0060a9b6 50              push    eax
0060a9b7 8b55e0          mov     edx,dword ptr [ebp-20h]
0060a9ba 8b4dd8          mov     ecx,dword ptr [ebp-28h]
0060a9bd e86e1c0100      call    ntoskrnl!DbgkMapViewOfSection

Additionally, we can see the references to NT_TIB in DbgkMapViewOfSection:

ntoskrnl!DbgkMapViewOfSection+0x65:
;
; Load eax with the address of the current thread's
; KTHREAD object.
;
; Here, fs refers to the KPCR.
;    +0x120 PrcbData         : _KPRCB
;  (in KPRCB)
;    +0x004 CurrentThread    : Ptr32 _KTHREAD
;
0061c695 64a124010000    mov     eax,dword ptr fs:[00000124h]
;
; Load esi with the address of the current thread's
; user mode PTEB.
;
; Here, we have the following layout in KTHREAD:
;    +0x084 Teb              : Ptr32 Void
;
0061c69b 8bb084000000    mov     esi,dword ptr [eax+84h]
0061c6a1 eb02            jmp     DbgkMapViewOfSection+0x75
ntoskrnl!DbgkMapViewOfSection+0x75:
0061c6a5 3bf3            cmp     esi,ebx
0061c6a7 7421            je      DbgkMapViewOfSection+0x9a
0061c6a9 3b8a44010000    cmp     ecx,dword ptr [edx+144h]
0061c6af 7519            jne     DbgkMapViewOfSection+0x9a
0061c6b1 56              push    esi
0061c6b2 e82c060200      call    DbgkpSuppressDbgMsg
0061c6b7 85c0            test    eax,eax
0061c6b9 0f85bf000000    jne     DbgkMapViewOfSection+0x144
0:000> u
ntoskrnl!DbgkMapViewOfSection+0x8f:
;
; Recall that 14 is the offset of the
; ArbitraryUserPointer member in NT_TIB,
; and that NT_TIB is the first member of TEB.
;
;    +0x000 NtTib            : _NT_TIB
;  (in NT_TIB)
;    +0x014 ArbitraryUserPointer : Ptr32 Void
;
0061c6bf 83c614          add     esi,14h
;
; [ebp-90h] is now the current thread's value of
; NtCurrentTeb()->NtTib.ArbitraryUserPointer
;
0061c6c2 89b570ffffff    mov     dword ptr [ebp-90h],esi

Thus is the story of how the filename that you pass to LoadLibrary ends up being communicated to the debugger, in a rather round-about and hackish way.

It is also worth noting that the kernel cannot trust the user mode supplied filename for use with opening the file handle to the DLL passed to the debugger process. This is because the kernel uses ZwOpenFile which bypasses normal security checks. As a result, the kernel needs to retrieve the filename via querying the section’s associated PFILE_OBJECT anyway, although for different purposes than providing the filename to the debugger.

An introduction to kernrate (the Windows kernel profiler)

Thursday, December 7th, 2006

One useful utility for tracking down performance problems that you might not have heard of until now is kernrate, the Windows kernel profiler. This utility currently ships with the Windows Server 2003 Resource Kit Tools package (though you can use kernate on Windows XP is well) and is freely downloadable. Currently, you’ll have to match the version of kernrate you want to use with your processor architecture, so if you are using your processor in x64 mode with an x64 Windows edition, then you’ll have to dig up an x64 version of kernrate (the one that ships with the Srv03 resource kit tools is x86); KrView (see below) ships with an x64 compatible version of kernrate.

Kernrate requires that you have the SeProfilePrivilege assigned (which is typically only granted to administrators), so in most cases you will need to be a local administrator on your system in order to use it. This privilege allows access to the (undocumented) profile object system services. These APIs allow programmatic access to sample the instruction pointer at certain intervals (typically, a profiler program selects the timer interrupt for use with instruction pointer sampling). This allows you to get a feel for what the system is doing over time, which is in turn useful for identifying the cause of performance issues where a particular operation appears to be processor bound and taking longer than you would like.

There are a multitude of options that you can give kernrate (and you are probably best served by experimenting with them a bit on your own), so I’ll just cover the common ones that you’ll need to get started (use “kernrate -?” to get a list of all supported options).

Kernrate can be used to profile both user mode and kernel mode performance issues. By default, it operates only on kernel mode code, but you can override this via the -a (and -av) options, which cause kernrate to include user mode code in its profiling operations in addition to kernel mode code. Additionally, by default, kernrate operates over the entire system at once; to get meaningful results with profiling user mode code, you’ll want to specify a process (or group of processes) to profile, with the “-p pid” and/or “-n process-name” arguments. (The process name is the first 8 characters of a process’s main executable filename.)

To terminate collection of profiling data, use Ctrl-C. (On pre-Windows-Vista systems where you might be running kernrate.exe via runas, remember that Ctrl-C does not work on console processes started via runas.) Additionally, you can use the “-s seconds” argument to specify that profling should be automagically stopped after a given count of seconds have elapsed.

If you run kernrate on kernel mode code only, or just specify a process (or group of processes) as described above, you’ll notice that you get a whole lot of general system-wide output (information about interrupt counts, global processor time usage, context switch counts, I/O operation counts) in addition to output about which modules used a noteworthy amount of processor time. Here’s an example output of running kernrate on just the kernel on my system, as described above (including just the module totals):

D:\\Programs\\Utilities>kernrate
Kernrate User-Specified Command Line:
kernrate


Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
Starting to collect profile data

***> Press ctrl-c to finish collecting profile data
===> Finished Collecting Data, Starting to Process Results

------------Overall Summary:--------------

[...]

OutputResults: KernelModuleCount = 153
Percentage in the following table is based on
the Total Hits for the Kernel

Time   197 hits, 25000 events per hit --------
 Module    Hits   msec  %Total  Events/Sec
intelppm     67        980    34 %     1709183
ntkrnlpa     52        981    26 %     1325178
win32k       35        981    17 %      891946
hal          19        981     9 %      484199
dxgkrnl       6        980     3 %      153061
nvlddmkm      6        980     3 %      153061
fanio         3        981     1 %       76452
bcm4sbxp      2        981     1 %       50968
portcls       2        980     1 %       51020
STAC97        2        980     1 %       51020
bthport       1        981     0 %       25484
BTHUSB        1        981     0 %       25484
Ntfs          1        980     0 %       25510

Using kernrate in this fashion is a good first step towards profiling a performance problem (especially if you are working with someone else’s program), as it quickly allows you to narrow down a processor hog to a particular module. While this is useful as a first step, however, it doesn’t really give you a whole lot of information about what specific code in a particular mode is taking a lot of processor time.

To dig in deeper as to the cause of the problem (beyond just tracing it to a particular module), you’ll need to use the “-z module-name” option. This option tells kernrate to “zoom in” on a particular module; that is, for the given module, kernrate will track instruction pointer locations within the module to individual functions. This level of granularity is often what you’ll need for tracking down a performance issue (at least as far as profiling is concerned). You can repeat the “-z” option multiple times to “zoom in” to multiple modules (useful if the problem you are tracking down involves high processor usage across multiple DLLs or binaries).

Because kernrate is resolving instruction pointer sampling down to a more granular level than modules (with the “-z” option), you’ll need to tell it how to load symbols for all affected modules (otherwise, the granularity for profiler output will typically be very poor, often restricted to just exported functions). There are two ways to do this. First, you can use the “-j symbol-path” command line option; this option tells kernrate to pass a particular symbol path to DbgHelp for use with loading symbols. I recommend the second option, however, which is to configure your _NT_SYMBOL_PATH before-hand so that it points to a valid DbgHelp symbol path. This relieves you of having to manually tell kernrate a symbol path every time you execute it.

Continuing with the example I gave above, we might be interested in just what the “win32k” (the Win32 kernel mode support driver for USER/GDI) module is doing that was taking up 17% of the processor time spent in kernel mode on my system (for the interval that I was profiling). To do that, we can use the following command line (the output has been truncated only include information that we are interested in):

D:\\Programs\\Utilities>kernrate -z win32k

Kernrate User-Specified Command Line:
kernrate -z win32k


Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
CallBack: Finished Attempt to Load symbols for
90a00000 \\SystemRoot\\System32\\win32k.sys

Starting to collect profile data

***> Press ctrl-c to finish collecting profile data
===> Finished Collecting Data, Starting to Process Results

------------Overall Summary:--------------

[...]

OutputResults: KernelModuleCount = 153
Percentage in the following table is based on the
Total Hits for the Kernel

Time   2465 hits, 25000 events per hit --------
 Module      Hits   msec  %Total  Events/Sec
ntkrnlpa     1273      14799    51 %     2150483
win32k        388      14799    15 %      655449
intelppm      263      14799    10 %      444286
hal           236      14799     9 %      398675
bcm4sbxp       66      14799     2 %      111494
spsys          55      14799     2 %       92911
nvlddmkm       48      14799     1 %       81086
STAC97         31      14799     1 %       52368

[...]


===> Processing Zoomed Module win32k.sys...


----- Zoomed module win32k.sys (Bucket size = 16 bytes,
Rounding Down) --------
Percentage in the following table is based on the
Total Hits for this Zoom Module

Time   388 hits, 25000 events per hit --------
 Module                  Hits   msec  %Total  Events/Sec
xxxInternalDoPaint         44      14799    10 %       74329
XDCOBJ::bSaveAttributes    20      14799     4 %       33786
DelayedDestroyCacheDC      20      14799     4 %       33786
HANDLELOCK::vLockHandle    15      14799     3 %       25339
mmxAlphaPerPixelOnly       15      14799     3 %       25339
XDCOBJ::RestoreAttributes  13      14799     2 %       21960
DoTimer                    12      14799     2 %       20271
_SEH_prolog4               11      14799     2 %       18582
memmove                     9      14799     2 %       15203
_GetDCEx                    6      14799     1 %       10135
HmgLockEx                   6      14799     1 %       10135
XDCOBJ::bCleanDC            5      14799     1 %        8446
XEPALOBJ::ulIndexToRGB      5      14799     1 %        8446
HmgShareCheckLock           4      14799     0 %        6757
RGNOBJ::bMerge              4      14799     0 %        6757

[...]

This should give you a feel for the kind of information that you’ll get from kernrate. Although the examples I gave were profiling kernel mode code, the whole process works the same way for user mode if you use the “-p” or “-n” options as I mentioned earlier. In conjunction with a debugger, the information that kernrate gives you can often be a great help in narrowing down CPU usage performance problems (or at the very least point you in the general direction as to where you’ll need to do further research).

There are also a variety of other options that are available in kernrate, such as features for gathering information about “hot” locks that have a high degree of contention, and support for launching new processes under the profiler. There is also support for outputting the raw sampled profile data, which can be used to graph the output (such as you might see used with tools like KrView).

Although kernrate doesn’t have all the “bells and whistles” of some of the high-end profiling tools (like Intel’s vTune), it’s often enough to get the job done, and it’s also available to you at no extra cost (and can be quickly and easily deployed to help find the source of a problem). I’d highly recommend giving it a shot if you are trying to analyze a performance problem and don’t already have a profiling solution that you are using.

Frame pointer omission (FPO) optimization and consequences when debugging, part 2

Wednesday, December 6th, 2006

This series is about frame pointer omission (FPO) optimization and how it impacts the debugging experience.

  1. Frame pointer omission (FPO) and consequences when debugging, part 1.
  2. Frame pointer omission (FPO) and consequences when debugging, part 2.

Last time, I outlined the basics as to just what FPO does, and what it means in terms of generated code when you compile programs with or without FPO enabled. This article builds on the last, and lays out just what the impacts of having FPO enabled (or disabled) are when you end up having to debug a program.

For the purposes of this article, consider the following example program with several do-nothing functions that shuffle stack arguments around and call eachother. (For the purposes of this posting, I have disabled global optimizations and function inlining.)

__declspec(noinline)
void
f3(
   int* c,
   char* b,
   int a
   )
{
   *c = a * 3 + (int)strlen(b);

   __debugbreak();
}

__declspec(noinline)
int
f2(
   char* b,
   int a
   )
{
   int c;

   f3(
      &c,
      b + 1,
      a - 3);

   return c;
}

__declspec(noinline)
int
f1(
   int a,
   char* b
   )
{
   int c;
   
   c = f2(
      b,
      a + 10);

   c ^= (int)rand();

   return c + 2 * a;
}

int
__cdecl
wmain(
   int ac,
   wchar_t** av
   )
{
   int c;

   c = f1(
      (int)rand(),
      "test");

   printf("%d\\n",
      c);

   return 0;
}

If we run the program and break in to the debugger at the hardcoded breakpoint, with symbols loaded, everything is as one might expect:

0:000> k
ChildEBP RetAddr  
0012ff3c 010015ef TestApp!f3+0x19
0012ff4c 010015fe TestApp!f2+0x15
0012ff54 0100161b TestApp!f1+0x9
0012ff5c 01001896 TestApp!wmain+0xe
0012ffa0 77573833 TestApp!__tmainCRTStartup+0x10f
0012ffac 7740a9bd kernel32!BaseThreadInitThunk+0xe
0012ffec 00000000 ntdll!_RtlUserThreadStart+0x23

Regardless of whether FPO optimization is turned on or off, since we have symbols loaded, we’ll get a reasonable call stack either way. The story is different, however, if we do not have symbols loaded. Looking at the same program, with FPO optimizations enabled and symbols not loaded, we get somewhat of a mess if we ask for a call stack:

0:000> k
ChildEBP RetAddr  
WARNING: Stack unwind information not available.
Following frames may be wrong.
0012ff4c 010015fe TestApp+0x15d8
0012ffa0 77573833 TestApp+0x15fe
0012ffac 7740a9bd kernel32!BaseThreadInitThunk+0xe
0012ffec 00000000 ntdll!_RtlUserThreadStart+0x23

Comparing the two call stacks, we lost three of the call frames entirely in the output. The only reason we got anything slightly reasonable at all is that WinDbg’s stack trace mechanism has some intelligent heuristics to guess the location of call frames in a stack where frame pointers are used.

If we look back to how call stacks are setup with frame pointers (from the previous article), the way a program trying to walk the stack on x86 without symbols works is by treating the stack as a sort of linked list of call frames. Recall that I mentioned the layout of the stack when a frame pointer is used:

[ebp-01]   Last byte of the last local variable
[ebp+00]   Old ebp value
[ebp+04]   Return address
[ebp+08]   First argument...

This means that if we are trying to perform a stack walk without symbols, the way to go is to assume that ebp points to a “structure” that looks something like this:

typedef struct _CALL_FRAME
{
   struct _CALL_FRAME* Next;
   void*               ReturnAddress;
} CALL_FRAME, * PCALL_FRAME;

Note how this corresponds to the stack layout relative to ebp that I described above.

A very simple stack walk function designed to walk frames that are compiled with frame pointer usage might then look like so (using the _AddressOfReturnAddress intrinsic to find “ebp”, assuming that the old ebp is 4 bytes before the address of the return address):

LONG
StackwalkExceptionHandler(
   PEXCEPTION_POINTERS ExceptionPointers
   )
{
   if (ExceptionPointers->ExceptionRecord->ExceptionCode
      == EXCEPTION_ACCESS_VIOLATION)
      return EXCEPTION_EXECUTE_HANDLER;

   return EXCEPTION_CONTINUE_SEARCH;
}

void
stackwalk(
   void* ebp
   )
{
   PCALL_FRAME frame = (PCALL_FRAME)ebp;

   printf("Trying ebp %p\\n",
      ebp);

   __try
   {
      for (unsigned i = 0;
          i < 100;
          i++)
      {
         if ((ULONG_PTR)frame & 0x3)
         {
            printf("Misaligned frame\\n");
            break;
         }

         printf("#%02lu %p  [@ %p]\\n",
            i,
            frame,
            frame->ReturnAddress);

         frame = frame->Next;
      }
   }
   __except(StackwalkExceptionHandler(
      GetExceptionInformation()))
   {
      printf("Caught exception\\n");
   }
}

#pragma optimize("y", off)
__declspec(noinline)
void printstack(
   )
{
   void* ebp = (ULONG*)_AddressOfReturnAddress()
     - 1;

   stackwalk(
      ebp);
}
#pragma optimize("", on)

If we recompile the program, disable FPO optimizations, and insert a call to printstack inside the f3 function, the console output is something like so:

Trying ebp 0012FEB0
#00 0012FEB0  [@ 0100185C]
#01 0012FED0  [@ 010018B4]
#02 0012FEF8  [@ 0100190B]
#03 0012FF2C  [@ 01001965]
#04 0012FF5C  [@ 01001E5D]
#05 0012FFA0  [@ 77573833]
#06 0012FFAC  [@ 7740A9BD]
#07 0012FFEC  [@ 00000000]
Caught exception

In other words, without using any symbols, we have successfully performed a stack walk on x86.

However, this all breaks down when a function somewhere in the call stack does not use a frame pointer (i.e. was compiled with FPO optimizations enabled). In this case, the assumption that ebp always points to a CALL_FRAME structure is no longer valid, and the call stack is either cut short or is completely wrong (especially if the function in question repurposed ebp for some other use besides as a frame pointer). Although it is possible to use heuristics to try and guess what is really a call/return address record on the structure, this is really nothing more than an educated guess, and tends to be at least slightly wrong (and typically missing one or more frames entirely).

Now, you might be wondering why you might care about doing stack walk operations without symbols. After all, you have symbols for the Microsoft binaries that your program will be calling (such as kernel32) available from the Microsoft symbol server, and you (presumably) have private symbols corresponding to your own program for use when you are debugging a problem.

Well, the answer to that is that you will end up needing to record stack traces without symbols in the course of normal debugging for a wide variety of problems. The reason for this is that there is a lot of support baked into NTDLL (and NTOSKRNL) to assist in debugging a class of particularly insidious problems: handle leaks (and other problems where the wrong handle value is getting closed somewhere and you need to find out why), memory leaks, and heap corruption.

These (very useful!) debugging features offer options that allow you to configure the system to log a stack trace on each heap allocation, heap free, or each time a handle is opened or closed. Now the way these features work is that they will capture the stack trace in real time as the heap operation or handle operation happens, but instead of trying to break into the debugger to display the results of this output (which is undesirable for a number of reasons), they save a copy of the current stack trace in-memory and then continue execution normally. To display these saved stack traces, the !htrace, !heap -p, and !avrf commands have functionality that locates these saved traces in-memory and prints them out to the debugger for you to inspect.

However, NTDLL/NTOSKRNL needs a way to create these stack traces in the first place, so that it can save them for later inspection. There are a couple of requirements here:

  1. The functionality to capture stack traces must not rely on anything layed above NTDLL or NTOSKRNL. This already means that anything as complicated as downloading and loading symbols via DbgHelp is instantly out of the picture, as those functions are layered far above NTDLL / NTOSKRNL (and indeed, they must make calls into the same functions that would be logging stack traces in the first place in order to find symbols).
  2. The functionality must work when symbols for everything on the call stack are not even available to the local machine. For instance, these pieces of functionality must be deployable on a customer computer without giving that computer access to your private symbols in some fashion. As a result, even if there was a good way to locate symbols where the stack trace is being captured (which there isn’t), you couldn’t even find the symbols if you wanted to.
  3. The functionality must work in kernel mode (for saving handle traces), as handle tracing is partially managed by the kernel itself and not just NTDLL.
  4. The functionality must use a minimum amount of memory to store each stack trace, as operations like heap allocation, heap deallocation, handle creation, and handle closure are extremely frequent operations throughout the lifetime of the process. As a result, options like just saving the entire thread stack for later inspection when symbols are available cannot be used, since that would be prohibitively expensive in terms of memory usage for each saved stack trace.

Given all of these restrictions, the code responsible for saving stack traces needs to operate without symbols, and it must furthermore be able to save stack traces in a very concise manner (without using a great deal of memory for each trace).

As a result, on x86, the stack trace saving code in NTDLL and NTOSKRNL assumes that all functions in the call frame use frame pointers. This is the only realistic option for saving stack traces on x86 without symbols, as there is insufficient information baked into each individual compiled binary to reliably perform stack traces without assuming the use of a frame pointer at each call site. (The 64-bit platforms that Windows supports solve this problem with the use of extensive unwind metadata, as I have covered in a number of past articles.)

So, the functionality exposed by pageheap’s stack trace logging, and handle tracing are how stack traces without symbols end up mattering to you, the developer with symbols for all of your binaries, when you are trying to debug a problem. If you make sure to disable FPO optimization on all of your code, then you’ll be able to use tools like pageheap’s stack tracing on heap operations, UMDH (the user mode heap debugger), and handle tracing to track down heap-related problems and handle-related problems. The best part of these features is that you can even deploy them on a customer site without having to install a full debugger (or run your program under a debugger), only later taking a minidump of your process for examination in the lab. All of them rely on FPO optimizations being disabled (at least on x86), though, so remember to turn FPO optimizations off on your release builds for the increased debuggability of these tough-to-find problems in the field.