Most data references in x64 are RIP-relative

November 5th, 2007

One of the larger (but often overlooked) changes to x64 with respect to x86 is that most instructions that previously only referenced data via absolute addressing can now reference data via RIP-relative addressing.

RIP-relative addressing is a mode where an address reference is provided as a (signed) 32-bit displacement from the current instruction pointer. While this was typically only used on x86 for control transfer instructions (call, jmp, and soforth), x64 expands the use of instruction pointer relative addressing to cover a much larger set of instructions.

What’s the advantage of using RIP-relative addressing? Well, the main benefit is that it becomes much easier to generate position independent code, or code that does not depend on where it is loaded in memory. This is especially useful in today’s world of (relatively) self-contained modules (such as DLLs or EXEs) that contain both data (global variables) and the code that goes along with it. If one used flat addressing on x86, references to global variables typically required hardcoding the absolute address of the global in question, assuming the module loads at its preferred base address. If the module then could not be loaded at the preferred base address at runtime, the loader had to perform a set of base relocations that essentially rewrite all instructions that had an absolute address operand component to refer to take into account the new address of the module.

The loader is hardly capable of figuring out what instructions would need to be rewritten in such a form, instead requiring assistance from the compiler and linker (in terms of the base relocation section of a PE image, for Windows) to provide it with a list of addresses that correspond to instruction operands that need to be modified to reflect the new image base after an image has been relocated.

An instruction that uses RIP relative addressing, however, typically does not require any base relocations (otherwise known as “fixups”) at load time if the module containing it is relocated, however. This is because as long as portions of the module are not internally re-arranged in memory (something not supported by the PE format), any addresses reference that is both relative to the current instruction pointer and refers to a location within the confines of the current image will continue to refer to the correct location, no matter where the image is placed at load time.

As a result, many x64 images have a greatly reduced number of fixups, due to the fact that most operations can be performed in an RIP-relative fashion. For example, the base relocation information (not including alignment padding) on the 64-bit ntdll.dll (for Windows Vista) is a mere 560 bytes total, compared to 18092 bytes in the Wow64 (x86) version.

Fewer fixups also means better memory usage when a binary is relocated, as there is a higher probability that a particular page will not need to be modified by the base relocation process, and thus can still remain shared even if a particular process needs to relocate a particular DLL.

I tend to prefer debugging with release builds instead of debug builds.

November 2nd, 2007

One of the things that I find myself espousing both at work and outside of work from time to time is the value of debugging using release builds of programs (for Windows applications, anyways). This may seem contradictory to some at first glance, as one would tend to believe that the debug build is in fact better for debugging (it is named the “debug build”, after all).

However, I tend to disagree with this sentiment, on several grounds:

  1. Debugging on debug builds only is an unrealistic situation. Most of the “interesting” problems that crop up in real life tend to be with release builds on customer sites or production environments. Many of the time, we do not have the luxury of being able to ship out a debug build to a customer or production environment.

    There is no doubt that debugging using the debug build can be easier, but I am of the opinion that it is disadvantageous to be unable to effectively debug release builds. Debugging with release builds all the time ensures that you can do this when you’ve really got no choice, or when it is not feasible to try and repro a problem using a debug build.

  2. Debug builds sometimes interfere with debugging. This is a highly counterintuitive concept initially, one that many people seem to be surprised at. To see what I mean, consider the scenario where one has a random memory corruption bug.

    This sort of problem is typically difficult and time consuming to track down, so one would want to use all available tools to help in this process. One most useful tool in the toolkit of any competent Windows debugger should be page heap, which is a special mode of the RTL heap (which implements the Win32 heap as exposed by APIs such as HeapAlloc).

    Page heap places a guard page at the end (or before, depending on its configuration) of every allocation. This guard page is marked inaccessible, such that any attempt to write to an allocation that exceeds the bounds of the allocated memory region will immediately fault with an access violation, instead of leaving the corruption to cause random failures at a later time. In effect, page heap allows one to catch the guility party “red handed” in many classes of heap corruption scenarios.

    Unfortunately, the debug build greatly diminishes the ability of page heap to operate. This is because when the debug version of the C runtime is used, any memory allocations that go through the CRT (such as new, malloc, and soforth) have special check and fill patterns placed before and after the allocation. These fill patterns are intended to be used to help detect memory corruption problems. When a memory block is returned using an API such as free, the CRT first checks the fill patterns to ensure that they are intact. If a discrepancy is found, the CRT will break into the debugger and notify the user that memory corruption has occured.

    If one has been following along thus far, it should not be too difficult to see how this conflicts with page heap. The problem lies in the fact that from the heap’s perspective, the debug CRT per-allocation metadata (including the check and fill patterns) are part of the user allocation, and so the special guard page is placed after (or before, if underrun protection is enabled) the fill patterns. This means that some classes of memory corruption bugs will overwrite the debug CRT metadata, but won’t trip page heap up, meaning that the only indication of memory corruption will be when the allocation is released, instead of when the corruption actually occured.

  3. Local variable and source line stepping are unreliable in release builds. Again, as with the first point, it is dangerous to get into a pattern of relying on these conveniences as they simply do not work correctly (or in the expected fashion) in release builds, after the optimizer has had its way with the program. If you get used to always relying on local variable and source line support, when used in conjunction with debug builds, then you’re going to be in for a rude awakening when you have to debug a release build. More than once at work I’ve been pulled in to help somebody out after they had gone down a wrong path when debugging something because the local variable display showed the wrong contents for a variable in a release build.

    The moral of the story here is to not rely on this information from the debugger, as it is only reliable for debug builds. Even then, local variable display will not work correctly unless you are stepping in source line mode, as within a source line (while stepping in assembly mode), local variables may not be initialized in the way that the debugger expects given the debug information.

Now, just to be clear, I’m not saying that anyone should abandon debug builds completely. There are a lot of valuable checks added by debug builds (assertions, the enhanced iterator validation in the VS2005 CRT, and stack variable corruption checks, just to name a few). However, it is important to be able to debug problems with release builds, and it seems to me that always relying on debug builds is detrimental to being able to do this. (Obviously, this can vary, but this is simply speaking on my personal experience.)

When I am debugging something, I typically only use assembly mode and line number information, if available (for manually matching up instructions with source code). Source code is still of course a useful time saver in many instances (if you have it), but I prefer not relying on the debugger to “get it right” with respect to such things, having been burned too many times in the past with incorrect results being returned in non-debug builds.

With a little bit of practice, you can get the same information that you would out of local variable display and the like with some basic reading of disassembly text and examination of the stack and register contents. As an added bonus, if you can do this in debug builds, you should by definition be able to do so in release builds as well, even when the debugger is unable to track locals correctly due to limitations in the debug information format.

How does one retrieve the 32-bit context of a Wow64 program from a 64-bit process on Windows Server 2003 x64?

November 1st, 2007

Recently, Jimmy asked me what the recommended way to retrieve the 32-bit context of a Wow64 application on Windows XP x64 / Windows Server 2003 x64 was.

I originally responded that the best way to do this was to use Wow64GetThreadContext, but Jimmy mentioned that this doesn’t exist on Windows XP x64 / Windows Server 2003 x64. Sure enough, I checked and it’s really not there, which is rather a bummer if one is trying to implement a 64-bit debugger process capable of debugging 32-bit processes on pre-Vista operating systems.

Normally, I don’t typically recommend using undocumented implementation details in production code, but in this case, there seems to be little choice as there’s no documented mechanism to perform this operation prior to Vista. Because Vista introduces a documented way to perform this task, going an undocumented route is at least slightly less questionable, as there’s an upper bound on what operating systems need to be supported, and major changes to the implementation of things on downlevel operating systems are rarer than with new operating system releases.

Clearly, this is not always the case; Windows XP Service Pack 2 changed an enormous amount of things, for instance. However, as a general rule, service packs tend to be relatively conservative with this sort of thing. That’s not that one has carte blanche with using undocumented implementation details on downlevel platforms, but perhaps one can sleep a bit easier at night knowing that things are less likely to break than in the next Windows release.

I had previously mentioned that the Wow64 layer takes a rather unexpected approach to how to implement GetThreadContext and SetThreadContext. While I mentioned at a high level what was going on, I didn’t really go into the details all that much.

The basic implementation of these routines is to determine whether the thread is running in 64-bit mode or not (determined by examining the SegCs value of the 64-bit context record for the thread as returned by NtGetContextThread). If the thread is running in 64-bit mode, and the thread is a Wow64 thread, then an assumption can be made that the thread is in the middle of a callout to the Wow64 layer (say, a system call).

In this case, the 32-bit context is saved at a well-known location by the process that translates from running in 32-bit mode to running in 64-bit mode for system calls and other voluntary, user mode “32-bit break out” events. Specifically, the Wow64 layer repurposes the second TLS slot of each 64-bit thread (that is, Teb->TlsSlots[ 1 ]) to point to a structure of the following layout:

typedef struct _WOW64_THREAD_INFO
{
   ULONG UnknownPrefix;
   WOW64_CONTEXT Wow64Context;
   ULONG UnknownSuffix;
} WOW64_THREAD_INFO, * PWOW64_THREAD_INFO;

(The real structure name is not known..)

Normally, system components do not use the TLS array, but the Wow64 layer is an exception. Because there is not normally any third party 64-bit code running in a Wow64 process, the Wow64 layer is free to do what it wants with the TlsSlots array of the 64-bit TEB for a Wow64 thread. (Each Wow64 thread has its own, separate 32-bit TEB, so this does not interfere with the operation of TLS by the 32-bit program that is currently executing.)

In the case where the requested Wow64 is in a 64-bit Wow64 callout, all one needs to do is to retrieve the base address of the 64-bit TEB of the thread in question, read the second entry in the TlsSlots array, and then read the WOW64_CONTEXT structure out of the memory block referred to by the second 64-bit TLS slot.

The other case that is significant is that where the Wow64 thread is running 32-bit code and is not in a Wow64 callout. In this case, because Wow64 runs x86 code natively, one simply needs to capture the 64-bit context of the desired thread and truncate all of the 64-bit registers to their 32-bit counterparts.

Setting the context of a Wow64 thread works exactly like retrieving the context of a Wow64 thread, except in reverse; one either modifies the 64-bit thread context if the thread is running 32-bit code, or one modifies the saved context record based off of the 64-bit TEB of the desired thread (which will be restored when the thread resumes execution).

I have posted a basic implementation of a version of Wow64­GetThreadContext that operates on pre-Windows-Vista platforms. Note that this implementation is incomplete; it does not translate floating point registers, nor does it only act on the subset of registers requested by the caller in CONTEXT::ContextFlags. The provided code also does not implement Wow64­SetThreadContext; implementing the “set” operation and extending the “get” operation to fully conform to GetThreadContext semantics are left as an exercise for the reader.

This code will operate on Vista x64 as well, although I would strongly recommend using the documented API on Vista and later platforms instead.

Note that the operation of Wow64 on IA64 platforms is completely different from that on x64. This information does not apply in any way to the IA64 version of Wow64.

Thread Local Storage, part 8: Wrap-up

October 31st, 2007

This is the final post in the Thread Local Storage series, which is comprised of the following articles:

  1. Thread Local Storage, part 1: Overview
  2. Thread Local Storage, part 2: Explicit TLS
  3. Thread Local Storage, part 3: Compiler and linker support for implicit TLS
  4. Thread Local Storage, part 4: Accessing __declspec(thread) data
  5. Thread Local Storage, part 5: Loader support for __declspec(thread) variables (process initialization time)
  6. Thread Local Storage, part 6: Design problems with the Windows Server 2003 (and earlier) approach to implicit TLS
  7. Thread Local Storage, part 7: Windows Vista support for __declspec(thread) in demand loaded DLLs
  8. Thread Local Storage, part 8: Wrap-up

By now, much of the inner workings of TLS (both implicit and explicit) on Windows should appear less mysterious, and a number of the seemingly arbitrary restrictions on limitations (maximum counts of explicit TLS slots on various operating systems, and limitations with respect to the usage of __declspec(thread) on demand loaded DLLs). Although many of these things can be (and should) considered implementation details that are subject to change, knowing how things work “under the hood” often comes in useful from time to time. For example, with an understanding of why there’s a hard limit to the number of available explicit TLS slots, the importance of reusing one TLS slots for many variables (by placing them into a structure that is pointed to by the contents of a TLS slot) should become clear.

Many of the details of implicit TLS are actually rather set in stone at this point, due to the fact that the compiler has been emitting code to directly access the ThreadLocalStoragePointer field in the TEB. Interestingly enough, this makes ThreadLocalStoragePointer a “guaranteed portable” part of the TEB, along with the NT_TIB header, despite the fact that the contents between the two are not defined to be portable (and are certainly not across, say, Windows 95).

Most of the inner workings of TLS are fairly straightforward, although there are some clever tricks employed to deal with scenarios such as TLS slots being released while threads are active. Many of the operational details of day to day TLS operation, such as how explicit TLS operates, are significantly different on Windows 95 and other operating systems of the 16-bit Windows lineage, so I would not recommend relying on the details of the implementation of TLS for non-NT-based systems.

Incidentally, most of the operating system itself does not use TLS in the way that it is exposed to third party programs. Instead, many operating system components either have their own dedicated fields in the TEB, or for larger amounts of data that may not need to be allocated for every thread in the system, a pointer field that can be filled with a pointer to a memory block at runtime if desired. For instance, there’s a ReservedForNtRpc field, a number of fields set aside for OpenGL ICDs (so much for Microsoft not supporting OpenGL), a WinSockData field for ws2_32, and many other similar fields for various operating system components.

This doesn’t mean that these components are really getting preferential treatment, as for the most part, an access to such a field in the TEB is in practice not really slower than an access through the documented TLS APIs. The benefit from providing these components with their own dedicated storage in the TEB is that in many cases, these components are already going to be active. If said operating system components used conventional TLS, then this would significantly detract from the already limited number of TLS slots available for use by third party components.

Some components do actually use standard TLS, or at least the space allocated in the TEB for standard TLS slots (though in special circumstances and without going through the standard explicit TLS APIs). For example, the 64-bit portion of the Wow64 layer in a 32-bit process repurposes some of the 64-bit TLS slots (which would normally be completely unused in such a process) for its own internal usage, thereby avoiding the need for dedicated storage in the TEB. That, however, is a story for another day.

Thread Local Storage, part 7: Windows Vista support for __declspec(thread) in demand loaded DLLs

October 30th, 2007

Yesterday, I outlined some of the pitfalls behind the approach that the loader has traditionally taken to implicit TLS, in Windows Server 2003 and earlier releases of the operating system.

With Windows Vista, Microsoft has taken a stab at alleviating some of the issues that make __declspec(thread) unusable for demand loaded DLLs. Although solving the problem may initially appear simple at first (one would tend to think that all that would need to be done would be to track and procesS TLS data for new modules as they’re loaded), the reality of the situation is unfortunately a fair amount more complicated than that.

At heart is the fact that implicit TLS is really only designed from the get-go to support operation at process initialization time. For example, this becomes evident when ones considers what would need to be done to allocate a TLS slot for a new module. This is in and of itself problematic, as the per-module TLS array is allocated at process initialization time, with only enough space for the modules that were present (and using TLS) at that time. Expanding the array is in this case a difficult thing to safely do, considering the code that the compiler generates for accessing TLS data.

The problem resides in the fact that the compiler reads the address of the current thread’s ThreadLocalStoragePointer and then later on dereferences the returned TLS array with the current module’s TLS index. Because all of this is done without synchronization, it is not in general safe to just switch out the old ThreadLocalStoragePointer with a new array and then release the old array from another thread context, as there is no way to ensure that the thread whose TLS array is being modified was not in the middle of accessing the TLS array.

A further difficulty presents itself in that there now needs to be a mechanism to proactively go out and place a new TLS module block into each running thread’s TLS array, as there may be multiple threads active when a module is demand-loaded. This is further complicated by the fact that said modifications are required to be performed before DllMain is called for the incoming module, and while the loader lock is still held by the current thread. This implies that, once again, the alterations to the TLS arrays of other threads will need to be performed by the current thread, without the cooperation of additional threads that are active in the process at the time of the DLL load.

These constraints are responsible for the bulk of the complexity of the new loader code in Windows Vista for TLS-related operations. The general concept behind how the new TLS support operates is as follows:

First, a new module is loaded via LdrLoadDll (which is used to implement LoadLibrary and similar Win32 functions). The loader examines the module to determine if it makes use of implicit TLS. If not, then no TLS-specific handling is performed and the typical loaded module processing occurs.

If an incoming module does make use of TLS, however, then LdrpHandleTlsData (an internal helper routine) is called to initialize support for the new module’s implicit TLS usage. LdrpHandleTlsData determines whether there is room in the ThreadLocalStoragePointer arrays of currently loaded threads for the new module’s TLS slot (with Windows Vista, the array can initially be larger than the total number of modules using TLS at process initialization time, for cheaper expansion of TLS data when a new module using TLS is demand-loaded). Because all running threads will at any given time have the same amount of space in their ThreadLocalStoragePointer, this is easily accomplished by a global variable to keep track of the array length. This variable is the SizeOfBitMap member of LdrpTlsBitmap, an RTL_BITMAP structure.

Depending on whether the existing ThreadLocalStoragePointer arrays are sufficient to contain the new module, LdrpHandleTlsdata allocates room for the TLS variable block for the new module and possibly new TLS arrays to store in the TEB of running threads. After the new data is allocated for each thread for the incoming module, a new process information class (ProcessTlsInformation) is utilized with an NtSetInformationProcess call to ask the kernel for help in switching out TLS data for any threads that are currently running in the process. Conceptually, this behavior is similar to ThreadZeroTlsCell, although its implementation is significantly more complicated. This step does not really appear to need to occur in kernel mode and does introduce significant (arguably unnecessary) complexity, so it is unclear why the designers elected to go this route.

In response to the ProcessTlsInformation request, the kernel enumerates threads in the current process and either swaps out one member of the ThreadLocalStoragePointer array for all threads, or swaps out the entire pointer to the ThreadLocalStoragePointer array itself in the TEB for all threads. The previous values for either the requested TLS index or the entire array pointer are then returned to user mode.

LdrpHandleTlsData then inspects the data that was returned to it by the kernel. Generally, this data represents either a TLS data block for a module that has been since unloaded (which is always safe to immediately free), or it represents an old TLS array for an already running thread. In the latter case, it is not safe to release the memory backing the array, as without the cooperation of the thread in question, there is no way to determine when the thread has released all possible references to the old memory block. Since the code to access the TLS array is hardcoded into every program using implicit TLS by the compiler, for practical purposes there is no particularly elegant way to make this determinatiion.

Because it is not easily possible to determine (prove) when the old TLS array pointer will never again be referenced, the loader enqueues the pointer into a list of heap blocks to be released at thread exit time when the thread that owns the old TLS array performs a clean exit. Thus, the old TLS array pointer (if the TLS array was expanded) is essentially intentionally leaked until the thread exits. This is a fairly minor memory loss in practice, as the array itself is an array of pointers only. Furthermore, the array is expanded in such a way that most of the time, a new module will take an unused slot in the array instead of requiring the TLS array to be reallocated each time. This sort of intentional leak is, once again, necessary due to the design of implicit TLS not being particular conducive to supporting demand loaded modules.

The loader lock itself is used for synchronization with respect to switching out TLS pointers in other threads in the current process. While a thread owns the loader lock, it is guaranteed that no other thread will attempt to modify the TLS array of it (or any other threads). Because the old TLS array pointers are kept if the TLS array is reallocated, there is no risk of touching deallocated memory when the swap is made, even though the threads whose TLS pointers are being swapped have no synchronization with respect to reading the TLS array in their TEBs.

When a module is unloaded, the TLS slot occupied by the module is released back into the TLS slot pool, but the module’s TLS variable space is not immediately freed until either individual threads for which TLS variable space were allocated exit, or a new module is loaded and happens to claim the outgoing module’s previous TLS slot.

For those interested, I have posted my interpretration of the new implicit TLS support in Vista. This code has not been completely tested, though it is expected to be correct enough for purposes of understanding the details of the TLS implementation. In particular, I have not verified every SEH scope in the ProcessTlsInformation implementation; the SEH scope statements (handlers in particular) are in many cases logical extrapolations of what the expected behavior should be in such cases. As always, it should be considered implementation details and subject to change without notice in future operating system releases.

(There also appear to be several unfortunate bugs in the Vista implementation of TLS, mostly related to inconsistent states and potential corruption if heap allocations fail at “bad” points in time. These are commented in the above code.)

The handler for the ProcessTlsInformation process set information class does not appear to be subfunction in reality, but instead a (rather large) case statement in the implementation of NtSetInformationProcess. It is presented as a subfunction for purposes of clarity. For reference, a control flow graph of NtSetInformationProcess is provided, with the basic blocks relevant to the ProcessTlsInformation case statement shaded. I suspect that this information class holds the record for the most convoluted usage of SEH scopes due to its heavy use of dual input/output parameters.

The information class implementation also appears to take many unconventional shortcuts that while technically workable for the use cases, would appear to be rather inconsistent with the general way that most other system calls and information classes are architected. The reasoning behind these inconsistencies is not known (perhaps as a time saver). For example, unlike most other process information classes, the only valid handle that can be used with this information class is NtCurrentProcess(). In other words, the information class handler implementation assumes the caller is the process to be modified.

Thread Local Storage, part 6: Design problems with the Windows Server 2003 (and earlier) approach to implicit TLS

October 29th, 2007

Last week, I described how the loader handles implicit TLS (as of Windows Server 2003). Although the loader’s support for implicit TLS works out well enough for what it was originally designed for, there are some cases where things do not turn out so happily. If you’ve been following along closely so far, you’ve probably already noticed some of the deficiencies relating to the design of implicit TLS. These defects in the design and implementation of TLS eventually spurred Microsoft to significantly revamp the loader’s implicit TLS support in Windows Vista.

The primary problem with respect to how Windows Server 2003 and earlier Windows versions support implicit TLS is that it just plain doesn’t work at all with DLLs that are dynamically loaded (via LoadLibrary, or LdrLoadDll). In fact, the way that implicit TLS fails if you try to dynamically load a DLL written to utilize it is actually rather spectacularly catastrophic.

What ends up happening is that the new DLL will have no TLS processing by the loader happen whatsoever. With our knowledge of how implicit TLS works at this point, the unfortunate consequences of this behavior should be readily apparent.

When a DLL using implicit TLS is loaded, because the loader doesn’t process the TLS directory, the _tls_index value is not initialized by the loader, nor is there space allocated for module’s TLS data in the ThreadLocalStoragePointer arrays of running threads. The DLL continues to load, however, and things will appear to work… until the first access to a __declspec(thread) variable occurs, that is.

The compiler typically initializes _tls_index to zero by default, so the value retains the value zero in the case where an implicit TLS using DLL is loaded after process initialization time. When an access to a __declspec(thread) variable occurs, the typical implicit TLS variable resolution process occurs. That is, ThreadLocalStoragePointer is fetched from the TEB and is indexed by _tls_index (which will always be zero), and the resultant pointer is assumed to be a pointer to the current thread’s thread local variables. Unfortunately, because the loader didn’t actually set _tls_index to a valid value, the DLL will reference the thread local variable storage of whichever module was legitimately assigned TLS index zero. This is typically going to be the main process executable, although it could be a DLL if the main process executable doesn’t use TLS but is static linked to a DLL that does use TLS.

This results in one of the absolute worst possible kinds of problems to debug. Now you’ve got one module trampling all over another module’s state, with the guilty module under the (mistaken) belief that the state that it’s referencing is really the guilty module’s own state. If you’re lucky, the process has no implicit TLS using at all (at process initialization time), and the ThreadLocalStoragePointer will not be allocated for the current thread and the initial access to a __declspec(thread) variable will simply result in an immediate null pointer dereference. More common, however, is the case that there is somebody in the process already using implicit TLS, in which case the module owning TLS index zero will have its thread local variables corrupted by the newly loaded module.

In this situation, the actual crash is typically long delayed until the first module finally gets around to using its thread local variable stage and fails due to the fact that it’s been overwritten, far after the fact. It is also possible that you’ll get lucky and the newly loaded module’s TLS variables will be much larger in size than the module with TLS index zero, in which case the initial access to the __declspec(thread) variable might immediately fault if it is sufficiently beyond the length of the heap allocation used for the already loaded module’s TLS variable storage. Of course, the offset of the variable accessed might be somewhere in between the edge of the current heap segment (page) and the end of the allocation used for the original module’s TLS variable storage, in which case heap corruption will occur instead of original module’s TLS variables for the current thread. (The loader uses the process heap to satisfy module TLS variable block allocations.)

Perhaps the only saving grace of the loader’s limitation with respect to implicit TLS and demand loaded DLLs is that due to the fact that the loader’s support for this situation has (not) operated correctly for so long now, many programmers know well enough to stay away from implicit TLS when used in conjunction with DLLs (or so I would hope).

These dire consequences of demand loading a module using __declspec(thread) variables are the reason for the seemingly after-the-fact warning about using implicit TLS with demand loaded DLLs in the LoadLibrary documentation on MSDN:

Windows Server 2003 and Windows XP: The Visual C++ compiler supports a syntax that enables you to declare thread-local variables: _declspec(thread). If you use this syntax in a DLL, you will not be able to load the DLL explicitly using LoadLibrary on versions of Windows prior to Windows Vista. If your DLL will be loaded explicitly, you must use the thread local storage functions instead of _declspec(thread). For an example, see Using Thread Local Storage in a Dynamic Link Library.

Clearly, the failure mode of demand loaded DLLs using implicit TLS is far from acceptable from a debugging perspective. Furthermore, this restriction puts a serious crimp in the practical usefulness of the otherwise highly useful __declspec(thread) support that has been baked into the compiler and linker, at least with respect to its usage in DLLs.

Fortunately, the Windows Vista loader takes some steps to address this problem, such that it becomes possible to use __declspec(thread) safely on Windows Vista and future operating system versions. The new loader support for implicit TLS in demand loaded DLLs is fairly complicated, though, due to some unfortunate design consequences of how implicit TLS works.

Next time, I’ll go in to some more details on just how the Windows Vista loader supports this scenario, as well as some of the caveats behind the implementation that is used in the loader going forward with Vista.

VMKD 1.1.1.7 released

October 28th, 2007

I have posted an update to VMKD VMKD (1.1.1.7). Since the last release (1.1.1.4), the following things have changed (there is a changelog included with the package):

  1. Fixed an assert that kdvmware.sys was tripping on checked builds of the kernel (whoops). There was a bug in the code that was reprotecting kdcom.dll as a part of assuming control over the KD I/O routines.
  2. Added a potential fix for occasional difficulties resynchronizing with the guest across a reboot if DbgEng is not restarted. If you are still seeing synchronization problems from time to time, I’d be interested to see debug output from vmxpatch (available by attaching a debugger to vmware-vmx.exe) and DbgEng itself (available with CTRL-D/CTRL-ALT-D in kd.exe or WinDbg.exe, respectively).
  3. Added support for partial checked builds in a rather limited fashion. Any checked kernel that is used with VMKD ought to be named “krnltest.exe” in the guest. This seemingly arbitrary limitation is present because the file name specified via /KERNEL= is the actual name that appears in the loaded module list, and VMKD uses string comparisons on loaded module list file names to find the kernel image in-memory. There are certainly “better” ways to do this, but the current approach is fairly simple and aside from checked builds, tends to be the most reliable and officially supported way across a wide range of OS versions. Any file name may be specified for the checked HAL module in a partial checked build configuration.

    In the future, I may update the check to be more clever about finding the kernel so as to not rely on string comparisons, but it does not really appear to be worth the time for most purposes at this point.

Additionally, it has been confirmed that VMKD works with VMware Server 1.0.4 (no changes were required on VMKD’s end, and previous releases will work with VMware Server 1.0.4 as well). I still have not gotten around to verifying the operation on VMware Workstation, as for most purposes I have moved my VMware usage almost completely over to VMware Server.

Now, back to your regularly scheduled coverage on the depths of thread local storage on Windows…

Thread Local Storage, part 5: Loader support for __declspec(thread) variables (process initialization time)

October 26th, 2007

Last time, I described the mechanism by which the compiler and linker generate code to access a variable that has been instanced per-thread via the __declspec(thread) extended storage class. Although the compiler and linker have essentially “set the stage” with respect to implicit TLS at this point, the loader is the component that “fills in the dots” and supplies the necessary run-time infrastructure to allow everything to operate.

Specifically, the loader is responsible for managing the allocation of per-module TLS index values, the allocation and management of the memory for the ThreadLocalStoragePointer array referred to by the TEB of every thread. Additionally, the loader is also responsible for managing the memory for each module’s thread-instanced (that is, __declspec(thread)-decorated) variables.

The loader’s TLS-related allocation and management duties can conceptually be split up into four distinct areas (Note that this represents the Windows Server 2003 and earlier view of things; I will go over some of the changes that Windows Vista makes this this model in a future posting in the TLS series.):

  1. At process initialization time, allocate _tls_index values, determine the extent of memory required for each module’s TLS block, and call TLS and DLL initializers (in that order).
  2. At thread initialization time, allocate and initialize TLS memory blocks for each module utilizing TLS, allocate the ThreadLocalStoragePointer array for the current thread, and link the TLS memory blocks in to the ThreadLocalStoragePointer array. Additionally, TLS initializers and then DLL initializers (in that order) are invoked for the current thread.
  3. At thread deinitialization time, call TLS deinitializers and then DLL deinitializers (in that order), and release the current thread’s TLS memory blocks for each module using TLS, and release the ThreadLocalStoragePointer array.
  4. At process deinitialization time, call TLS and DLL initializers (in that order).

Of course, the loader performs a number of other tasks when these events occur; this is simply a list of those that have some bearing on TLS support.

Most of these operations are fairly straightforward, with the arguable exception of process initialization. Process initialization of TLS is primarily handled in two subroutines inside ntdll, LdrpInitializeTls and LdrpAllocateTls.

LdrpInitializeTls is invoked during process initialization after all DLLs have been loaded, but before any initializer (or TLS) routines have been called. It essentially walks the loaded module list and sums the length of TLS data for each module that contains a valid TLS directory. For each module that contains TLS, a data structure is allocated that contains the length of the module’s TLS data and the TLS index that has been assigned to that module. (The TlsIndex field in the LDR_DATA_TABLE_ENTRY structure appears to be unused except as a flag that the module has TLS (being always set to -1), at least as far back as Windows XP. It is worth mentioning that the WINE implementation of implicit TLS incorrectly uses TlsIndex as the real module TLS index, so it may be unreliable to assume that it is always -1 if you care about working on WINE.)

Modules that use implicit TLS and which are present at initialization time are additionally marked as pinned in memory for the lifetime of the process by LdrpInitializeProcess (the LoadCount of any such module is fixed to 0xFFFF). In practice, this is typically unlikely to matter, as for such modules to be present at process initialization time, they must also by definition static linked by either the main process image or a dependency of the main process image.

After LdrpInitializeTls has determined which modules use TLS in the current process and has assigned those modules TLS index values, LdrpAllocateTls is called to allocate and initialize module TLS values for the initial thread.

At this point, process initialization continues, eventually resulting in TLS initializers and then DLL initializers (DllMain) being called for loaded modules. (Note that the main process image can have one or more TLS callbacks, even though it cannot have a DLL initializer routine.)

One interesting fact about TLS initializers is that they are always called before DLL initializers for their corresponding DLL. (The process occurs in sequence, such that DLL A’s TLS and DLL initializers are called, then DLL B’s TLS and DLL initializers, and so forth.) This means that TLS initializers need to be careful about making, say, CRT calls (as the C runtime is initialized before the user’s DllMain routine is called, by the actual DLL initializer entrypoint, such that the CRT will not be initialized when a TLS initializer for the module is invoked). This can be dangerous, as global objects will not have been constructed yet; the module will be in a completely uninitialized state except that imports have been snapped.

Another point worth mentioning about the loader’s TLS support is that contrary to the Portable Executable specification, the SizeOfZeroFill member of the IMAGE_TLS_DIRECTORY structure is not used (or supported) by the linker or the loader. This means that in practice, all TLS template data is initialized, and the size of the memory block allocated for per-module implicit TLS does not include the SizeOfZeroFill member as the PE documentation (or certain other publications that appear to be based on said documentation) would seem to state. (It seems that the WINE folks happened to get it wrong as well, thanks to the implication in the PE specification that the field is actually used.)

Some programs abuse TLS callbacks for anti-debugging purposes (gaining code execution before the normal process entrypoint routine is executed by creating a TLS callback for the main process image), although this is, in practice, quite obvious as almost all PE images do not use TLS callbacks at all.

Up through Windows Server 2003, the above is really all the loader needs to do with respect to supporting __declspec(thread). While this approach would appear to work quite well, it turns out that there are, in fact, some problems with it (if you’ve been following along thus far, you can probably figure out what they are). More on some of the limitations of the Windows Server 2003 approach to implicit TLS next week.

Thread Local Storage, part 4: Accessing __declspec(thread) data

October 25th, 2007

Yesterday, I outlined how the compiler and linker cooperate to support TLS. However, I didn’t mention just what exactly goes on under the hood when one declares a __declspec(thread) variable and accesses it.

Before the inner workings of a __declspec(thread) variable access can be explained, however, it is necessary to discuss several more special variables in tlssup.c. These special variables are referenced by _tls_used to create the TLS directory for the image.

The first variable of interest is _tls_index, which is implicitly referenced by the compiler in the per-thread storage resolution mechanism any time a thread local variable is referenced (well, almost every time; there’s an exception to this, which I’ll mention later on). _tls_index is also the only variable declared in tlssup.c that uses the default allocation storage class. Internally, it represents the current module’s TLS index. The per-module TLS index is, in principal, similar to a TLS index returned by TlsAlloc. However, the two are not compatible, and there exists significantly more work behind the per-module TLS index and its supporting code. I’ll cover all of that later as well; for now, just bear with me.

The definitions of _tls_start and _tls_end appear as so in tlssup.c:

#pragma data_seg(".tls")

#if defined (_M_IA64) || defined (_M_AMD64)
_CRTALLOC(".tls")
#endif
char _tls_start = 0;

#pragma data_seg(".tls$ZZZ")

#if defined (_M_IA64) || defined (_M_AMD64)
_CRTALLOC(".tls$ZZZ")
#endif
char _tls_end = 0;

This code creates the two variables and places them at the start and end of the “.tls” section. The compiler and linker will automatically assume a default allocation section of “.tls” for all __declspec(thread) variables, such that they will be placed between _tls_start and _tls_end in the final image. The two variables are used to tell the linker the bounds of the TLS storage template section, via the image’s TLS directory (_tls_used).

Now that we know how __declspec(thread) works from a language level, it is necessary to understand the supporting code the compiler generates for an access to a __declspec(thread) variable. This supporting code is, fortunately, fairly straightforward. Consider the following test program:

__declspec(thread) int threadedint = 0;

int __cdecl wmain(int ac,
   wchar_t **av)
{
   threadedint = 42;

   return 0;
}

For x64, the compiler generated the following code:

mov	 ecx, DWORD PTR _tls_index
mov	 rax, QWORD PTR gs:88
mov	 edx, OFFSET FLAT:threadedint
mov	 rax, QWORD PTR [rax+rcx*8]
mov	 DWORD PTR [rdx+rax], 42

Recall that the gs segment register refers to the base address of the TEB on x64. 88 (0x58) is the offset in the TEB for the ThreadLocalStoragePointer member on x64 (more on that later):

   +0x058 ThreadLocalStoragePointer : Ptr64 Void

If we examine the code after the linker has run, however, we’ll notice something strange:

mov     ecx, cs:_tls_index
mov     rax, gs:58h
mov     edx, 4
mov     rax, [rax+rcx*8]
mov     dword ptr [rdx+rax], 2Ah ; 42
xor     eax, eax

If you haven’t noticed it already, the offset of the “threadedint” variable was resolved to a small value (4). Recall that in the pre-link disassembly, the “mov edx, 4” instruction was “mov edx, OFFSET FLAT:threadedint”.

Now, 4 isn’t a very flat address (one would expect an address within the confines of the executable image to be used). What happened?

Well, it turns out that the linker has some tricks up its sleeve that were put into play here. The “offset” of a __declspec(thread) variable is assumed to be relative to the base of the “.tls” section by the linker when it is resolving address references. If one examines the “.tls” section of the image, things begin to make a bit more sense:

0000000001007000 _tls segment para public 'DATA' use64
0000000001007000      assume cs:_tls
0000000001007000     ;org 1007000h
0000000001007000 _tls_start        dd 0
0000000001007004 ; int threadedint
0000000001007004 ?threadedint@@3HA dd 0
0000000001007008 _tls_end          dd 0

The offset of “threadedint” from the start of the “.tls” section is indeed 4 bytes. But all of this still doesn’t explain how the instructions the compiler generated access a variable that is instanced per thread.

The “secret sauce” here lies in the following three instructions:

mov     ecx, cs:_tls_index
mov     rax, gs:58h
mov     rax, [rax+rcx*8]

These instructions fetch ThreadLocalStoragePointer out of the TEB and index it by _tls_index. The resulting pointer is then indexed again with the offset of threadedint from the start of the “.tls” section to form a complete pointer to this thread’s instance of the threadedint variable.

In C, the code that the compiler generated could be visualized as follows:

// This represents the ".tls" section
struct _MODULE_TLS_DATA
{
   int tls_start;
   int threadedint;
   int tls_end;
} MODULE_TLS_DATA, * PMODULE_TLS_DATA;

PTEB Teb;
PMODULE_TLS_DATA TlsData;

Teb     = NtCurrentTeb();
TlsData = Teb->ThreadLocalStoragePointer[ _tls_index ];

TlsData->threadedint = 42;

This should look familiar if you’ve used explicit TLS before. The typical paradigm for explicit TLS is to place a structure pointer in a TLS slot, and then to access your thread local state, the per thread instance of the structure is retrieved and the appropriate variable is then referenced off of the structure pointer. The difference here is that the compiler and linker (and loader, more on that later) cooperated to save you (the programmer) from having to do all of that explicitly; all you had to do was declare a __declspec(thread) variable and all of this happens magically behind the scenes.

There’s actually an additional curve that the compiler will sometimes throw with respect to how implicit TLS variables work from a code generation perspective. You may have noticed how I showed the x64 version of an access to a __declspec(thread) variable; this is because, by default, x86 builds of a .exe involve a special optimization (/GA (Optimize for Windows Application, quite possibly the worst name for a compiler flag ever)) that eliminates the step of referencing the special _tls_index variable by assuming that it is zero.

This optimization is only possible with a .exe that will run as the main process image. The assumption works in this case because the loader assigns per-module TLS index values on a sequential basis (based on the loaded module list), and the main process image should be the second thing in the loaded module list, after NTDLL (which, now that this optimization is being used, can never have any __declspec(thread) variables, or it would get TLS index zero instead of the main process image). It’s worth noting that in the (extremely rare) case that a .exe exports functions and is imported by another .exe, this optimization will cause random corruption if the imported .exe happens to use __declspec(thread).

For reference, with /GA enabled, the x86 build of the above code results in the following instructions:

mov     eax, large fs:2Ch
mov     ecx, [eax]
mov     dword ptr [ecx+4], 2Ah ; 42

Remember that on x86, fs points to the base address of the TEB, and that ThreadLocalStoragePointer is at offset +0x2C from the base of the x86 TEB.

Notice that there is no reference to _tls_index; the compiler assumes that it will take on the value zero. If one examines a .dll built with the x86 compiler, the /GA optimization is always disabled, and _tls_index is used as expected.

The magic behind __declspec(thread) extends beyond just the compiler and linker, however. Something still has to set up the storage for each module’s per-thread state, and that something is the loader. More on how the loader plays a part in this complex process next time.

Thread Local Storage, part 3: Compiler and linker support for implicit TLS

October 24th, 2007

Last time, I discussed the mechanisms by which so-called explicit TLS operates (the TlsGetValue, TlsSetValue and other associated supporting routines).

Although explicit TLS is certainly fairly heavily used, many of the more “interesting” pieces about how TLS works in fact relate to the work that the loader does to support implicit TLS, or __declspec(thread) variables (in CL). While both TLS mechanisms are designed to provide a similar effect, namely the capability to store information on a per-thread basis, many aspects of the implementations of the two different mechanisms are very different.

When you declare a variable with the __declspec(thread) extended storage class, the compiler and linker cooperate to allocate storage for the variable in a special region in the executable image. By convention, all variables with the __declspec(thread) storage class are placed in the .tls section of a PE image, although this is not technically required (in fact, the thread local variables do not even really need to be in their own section, merely contiguous in memory, at least from the loader’s perspective). On disk, this region of memory contains the initializer data for all thread local variables in a particular image. However, this data is never actually modified and references to a particular thread local variable will never refer to an address within this section of the PE image; the data is merely a “template” to be used when allocating storage for thread local variables after a thread has been created.

The compiler and linker also make use of several special variables in the context of implicit TLS support. Specifically, a variable by the name of _tls_used (of the type IMAGE_TLS_DIRECTORY) is created by a portion of the C runtime that is static linked into every program to represent the TLS directory that will be used in the final image (references to this variable should be extern “C” in C++ code for name decoration purposes, and storage for the variable need not be allocated as the supporting CRT stub code already creates the variable). The TLS directory is a part of the PE header of an executable image which describes to the loader how the image’s thread local variables are to be managed. The linker looks for a variable by the name of _tls_used and ensures that in the on-disk image, it overlaps with the actual TLS directory in the final image.

The source code for the particular section of C runtime logic that declares _tls_used lives in the tlssup.c file (which comes with Visual Studio), making the variable pseudo-documented. The standard declaration for _tls_used is as so:

_CRTALLOC(".rdata$T")
const IMAGE_TLS_DIRECTORY _tls_used =
{
 (ULONG)(ULONG_PTR) &_tls_start, // start of tls data
 (ULONG)(ULONG_PTR) &_tls_end,   // end of tls data
 (ULONG)(ULONG_PTR) &_tls_index, // address of tls_index
 (ULONG)(ULONG_PTR) (&__xl_a+1), // pointer to callbacks
 (ULONG) 0,                      // size of tls zero fill
 (ULONG) 0                       // characteristics
};

The CRT code also provides a mechanism to allow a program to register a set of TLS callbacks, which are functions with a similar prototype to DllMain that are called when a thread starts or exits (cleanly) in the current process. (These callbacks can even be registered for a main process image, where there is no DllMain routine.) The callbacks are typed as PIMAGE_TLS_CALLBACK, and the TLS directory points to a null-terminated array of callbacks (called in sequence).

For a typical image, there will not exist any TLS callbacks (in practice, almost everything uses DllMain to perform per-thread initialization tasks). However, the support is retained and is fully functional. To use the support that the CRT provides for TLS callbacks, one needs to declare a variable that is stored in the specially named “.CRT$XLx” section, where x is a value between A and Z. For example, one might write the following code:

#pragma section(".CRT$XLY",long,read)

extern "C" __declspec(allocate(".CRT$XLY"))
  PIMAGE_TLS_CALLBACK _xl_y  = MyTlsCallback;

The strange business with the special section names is required because the in-memory ordering of the TLS callback pointers is significant. To understand what is happening with this peculiar looking declaration, it is first necessary to understand a bit about the compiler and linker organize data in the final PE image that is produced.

Non-header data in a PE image is placed into one or more sections, which are regions of memory with a common set of attributes (such as page protection). The __declspec(allocate(“section-name”)) keyword (CL-specific) tells the compiler that a particular variable is to be placed in a specific section in the final executable. The compiler additionally has support for concatenating similarly-named sections into one larger section. This support is activated by prefixing a section name with a $ character followed by any other text. The compiler concatenates the resulting section with the section of the same name, truncated at the $ character (inclusive).

The compiler alphabetically orders individual sections when concatenating them (due to the usage of the $ character in the section name). This means that in-memory (in the final executable image), a variable in the “.CRT$XLB” section will be after a variable in the “.CRT$XLA” section but before a variable in “.CRT$XLZ” section. The C runtime uses this quirk of the compiler to create an array of null terminated function pointers to TLS callbacks (with the pointer stored in the “.CRT$XLZ” section being the null terminator). Thus, in order to ensure that the declared function pointer resides within the confines of the TLS callback array being referenced by _tls_used, it is necessary place in a section of the form “.CRT$XLx“.

The creation of the TLS directory is, however, only one portion of how the compiler and linker work together to support __declspec(thread) variables. Next time, I’ll discuss just how the compiler and linker manage accesses to such variables.

Update: Phil mentions that this support for TLS callbacks does not work before the Visual Studio 2005 release. Be warned if you are still using an old compiler package.