Yesterday, I outlined some of the pitfalls behind the approach that the loader has traditionally taken to implicit TLS, in Windows Server 2003 and earlier releases of the operating system.
With Windows Vista, Microsoft has taken a stab at alleviating some of the issues that make __declspec(thread) unusable for demand loaded DLLs. Although solving the problem may initially appear simple at first (one would tend to think that all that would need to be done would be to track and procesS TLS data for new modules as they’re loaded), the reality of the situation is unfortunately a fair amount more complicated than that.
At heart is the fact that implicit TLS is really only designed from the get-go to support operation at process initialization time. For example, this becomes evident when ones considers what would need to be done to allocate a TLS slot for a new module. This is in and of itself problematic, as the per-module TLS array is allocated at process initialization time, with only enough space for the modules that were present (and using TLS) at that time. Expanding the array is in this case a difficult thing to safely do, considering the code that the compiler generates for accessing TLS data.
The problem resides in the fact that the compiler reads the address of the current thread’s ThreadLocalStoragePointer and then later on dereferences the returned TLS array with the current module’s TLS index. Because all of this is done without synchronization, it is not in general safe to just switch out the old ThreadLocalStoragePointer with a new array and then release the old array from another thread context, as there is no way to ensure that the thread whose TLS array is being modified was not in the middle of accessing the TLS array.
A further difficulty presents itself in that there now needs to be a mechanism to proactively go out and place a new TLS module block into each running thread’s TLS array, as there may be multiple threads active when a module is demand-loaded. This is further complicated by the fact that said modifications are required to be performed before DllMain is called for the incoming module, and while the loader lock is still held by the current thread. This implies that, once again, the alterations to the TLS arrays of other threads will need to be performed by the current thread, without the cooperation of additional threads that are active in the process at the time of the DLL load.
These constraints are responsible for the bulk of the complexity of the new loader code in Windows Vista for TLS-related operations. The general concept behind how the new TLS support operates is as follows:
First, a new module is loaded via LdrLoadDll (which is used to implement LoadLibrary and similar Win32 functions). The loader examines the module to determine if it makes use of implicit TLS. If not, then no TLS-specific handling is performed and the typical loaded module processing occurs.
If an incoming module does make use of TLS, however, then LdrpHandleTlsData (an internal helper routine) is called to initialize support for the new module’s implicit TLS usage. LdrpHandleTlsData determines whether there is room in the ThreadLocalStoragePointer arrays of currently loaded threads for the new module’s TLS slot (with Windows Vista, the array can initially be larger than the total number of modules using TLS at process initialization time, for cheaper expansion of TLS data when a new module using TLS is demand-loaded). Because all running threads will at any given time have the same amount of space in their ThreadLocalStoragePointer, this is easily accomplished by a global variable to keep track of the array length. This variable is the SizeOfBitMap member of LdrpTlsBitmap, an RTL_BITMAP structure.
Depending on whether the existing ThreadLocalStoragePointer arrays are sufficient to contain the new module, LdrpHandleTlsdata allocates room for the TLS variable block for the new module and possibly new TLS arrays to store in the TEB of running threads. After the new data is allocated for each thread for the incoming module, a new process information class (ProcessTlsInformation) is utilized with an NtSetInformationProcess call to ask the kernel for help in switching out TLS data for any threads that are currently running in the process. Conceptually, this behavior is similar to ThreadZeroTlsCell, although its implementation is significantly more complicated. This step does not really appear to need to occur in kernel mode and does introduce significant (arguably unnecessary) complexity, so it is unclear why the designers elected to go this route.
In response to the ProcessTlsInformation request, the kernel enumerates threads in the current process and either swaps out one member of the ThreadLocalStoragePointer array for all threads, or swaps out the entire pointer to the ThreadLocalStoragePointer array itself in the TEB for all threads. The previous values for either the requested TLS index or the entire array pointer are then returned to user mode.
LdrpHandleTlsData then inspects the data that was returned to it by the kernel. Generally, this data represents either a TLS data block for a module that has been since unloaded (which is always safe to immediately free), or it represents an old TLS array for an already running thread. In the latter case, it is not safe to release the memory backing the array, as without the cooperation of the thread in question, there is no way to determine when the thread has released all possible references to the old memory block. Since the code to access the TLS array is hardcoded into every program using implicit TLS by the compiler, for practical purposes there is no particularly elegant way to make this determinatiion.
Because it is not easily possible to determine (prove) when the old TLS array pointer will never again be referenced, the loader enqueues the pointer into a list of heap blocks to be released at thread exit time when the thread that owns the old TLS array performs a clean exit. Thus, the old TLS array pointer (if the TLS array was expanded) is essentially intentionally leaked until the thread exits. This is a fairly minor memory loss in practice, as the array itself is an array of pointers only. Furthermore, the array is expanded in such a way that most of the time, a new module will take an unused slot in the array instead of requiring the TLS array to be reallocated each time. This sort of intentional leak is, once again, necessary due to the design of implicit TLS not being particular conducive to supporting demand loaded modules.
The loader lock itself is used for synchronization with respect to switching out TLS pointers in other threads in the current process. While a thread owns the loader lock, it is guaranteed that no other thread will attempt to modify the TLS array of it (or any other threads). Because the old TLS array pointers are kept if the TLS array is reallocated, there is no risk of touching deallocated memory when the swap is made, even though the threads whose TLS pointers are being swapped have no synchronization with respect to reading the TLS array in their TEBs.
When a module is unloaded, the TLS slot occupied by the module is released back into the TLS slot pool, but the module’s TLS variable space is not immediately freed until either individual threads for which TLS variable space were allocated exit, or a new module is loaded and happens to claim the outgoing module’s previous TLS slot.
For those interested, I have posted my interpretration of the new implicit TLS support in Vista. This code has not been completely tested, though it is expected to be correct enough for purposes of understanding the details of the TLS implementation. In particular, I have not verified every SEH scope in the ProcessTlsInformation implementation; the SEH scope statements (handlers in particular) are in many cases logical extrapolations of what the expected behavior should be in such cases. As always, it should be considered implementation details and subject to change without notice in future operating system releases.
(There also appear to be several unfortunate bugs in the Vista implementation of TLS, mostly related to inconsistent states and potential corruption if heap allocations fail at “bad” points in time. These are commented in the above code.)
The handler for the ProcessTlsInformation process set information class does not appear to be subfunction in reality, but instead a (rather large) case statement in the implementation of NtSetInformationProcess. It is presented as a subfunction for purposes of clarity. For reference, a control flow graph of NtSetInformationProcess is provided, with the basic blocks relevant to the ProcessTlsInformation case statement shaded. I suspect that this information class holds the record for the most convoluted usage of SEH scopes due to its heavy use of dual input/output parameters.
The information class implementation also appears to take many unconventional shortcuts that while technically workable for the use cases, would appear to be rather inconsistent with the general way that most other system calls and information classes are architected. The reasoning behind these inconsistencies is not known (perhaps as a time saver). For example, unlike most other process information classes, the only valid handle that can be used with this information class is NtCurrentProcess(). In other words, the information class handler implementation assumes the caller is the process to be modified.