Reversing the V740, part 1: Rationale

July 2nd, 2007

Recently, I got a V740 EVDO Rev.A ExpressCard for mobile Internet access. Aside from a couple of shortcomings, the thing has actually worked fairly well. However, there are some problems with it which prove fairly annoying in practice.

Aside from blinky lights gadget syndrome, the other major deficiency I’ve encountered with the V740 (really a rebranded Novatel Merlin X720) so far is a problem with the 1xRTT<->EVDO handoff logic.

(There’s a little bit of important background information about 1xRTT/EVDO networks here that is relevant for understanding the rest of the series, so bear with me for a minute. Most of this information is from Wikipedia, and I’ve tried to include links to relevant parts along the way.)

You see, the way the card works is that it supports both 1xRTT-based networks (which are fairly common as long as you’re not completely in the middle of nowhere) for rougly dialup modem speeds, and 1xEVDO-based networks, for low(er) latency, high(er) bandwidth. EVDO is still limited to populous areas – e.g. medium to large size cities and the surrounding areas, and isn’t as ubiqitous as 1xRTT coverage is, at least in the US.

Because EVDO is hardly universal in the US, most CDMA data devices supporting EVDO can actually use both 1xRTT or EVDO, depending the local coverage. Beyond that, most such devices also support (mostly seamless) handoff between EVDO and 1xRTT, such that when you move out of an EVDO coverage area and into a 1xRTT-only coverage area, the data session is transferred without any interruption other than a momentary blip in latency as the air interface is changed from EVDO to 1xRTT. This can happen even while data is being actively transferred on the link without dropping TCP connections or anything as intrusive as that, which is very nice; aside from an increase in latency and decrease in bandwidth as you move to 1xRTT, everything otherwise continues to magically “just work”.

However, there is a minor problem with this logic: Sometimes, if you happen to pass through an area with spotty cell coverage (bridge, elevator, whatnot), you’ll lose one or both of the EVDO or 1xRTT signals temporarily. Depending on whether the device reacquires the EVDO signal fast enough, this sometimes results in the call being handed over to 1xRTT even though you’re really still in an EVDO coverage area. The result is that all of a sudden, everything is high latency and low bandwidth, as the device has decided to use the lower quality network. Unfortunately, there doesn’t seem to be support in the V740 to automatically “fail upwards” from 1xRTT back to EVDO if the device re-enters an area with EVDO signal. This means that your data session is now stuck in “slow mode” due to a minor coverage blip – not so good.

The typical solution that people take to this problem is to instruct their device to disable 1xRTT support and only use EVDO, which does work fine if you are really ever only in an EVDO coverage area. However, I really didn’t want to take this approach, as it means manually futzing around with the card’s configuration when you travel. Worse, it makes you less robust against situations where there really is better connectivity to the 1xRTT network than the EVDO network, even if you’re in an EVDO coverage region.

It turns out that there is actually a way to cause the device to re-select the best network after it has failed over to 1xRTT (other than re-dialing the connection), but it is unfortunately not all that seamless: When the device enters “dormant mode” and then transitions back to an active state, it will rescan available networks and pick the best quality network with which to re-engage the link. (Dormant mode is a mode where the device essentially relinquishes its over-the-air slot, freeing it up for another caller, until either end detects activity on the link. Then, the link is seamlessly re-established, again without TCP connections or the like being dropped. Having the device let go of its over-the-air slot is good for network operators, as it allows the system to essentially “borrow back” the capacity used for an otherwise idle session until the session needs it again.)

Normally, dormant mode is only activated when both ends of the link have been idle for some pre-set period of time, usually several seconds. While this means that if you have a fairly idle connection, you’ll eventually get switched back over to EVDO, if your connection is active, it will not enter dormant mode and will thus continue to run on 1xRTT.

Now, if you’re using a conventional handset in tethered mode (Bluetooth or via USB cable), you can often forcibly enter dormant mode on the device with the “end call” button, and as soon as either end of the link tries to send data, the device will pick up the call on the best network, essentially offering a manual way to force the device to consider switching back to EVDO. (At least, that’s how it works on my LG VX8100 – your mileage my vary based on your device, of course.) While that’s not ideal (it’s a completely manual process to hit the end call button, of course), it does get the job done.

With an ExpressCard, however, the problem is a bit more severe, simply because there is no “end call” button. (Sure, you can disconnect the session, but that kind of defeats the whole point – the goal is to re-establish the high speed network connection without dropping your active TCP connections). Unfortunately, as an end user here, you’re pretty much out of luck without the device vendor (or the network operator) providing a tool to coax the device into reconsidering which data network to use (and as far as I know, neither Verizon nor Novatel provide such a utility).

Not satisified with this, I decided to take matters into my own hands, so to speak, and see if there might be a way to programmatically tell the card to switch back to EVDO, or at least enter dormant mode on-demand without idling all data traffic, so that it can automatically select EVDO when resuming from dormant mode. The end desired result would be a program that I could either manually run or that could run continually while a session is active, and try to re-establish the EVDO link where appropriate. While Novatel does appear to make some mention of an SDK for their devices available on their website, it doesn’t appear to be readily available to individual users, which unforunately rules it out. That leaves reverse engineering the way the V740 is controlled from the computer in the hopes of finding something to achieve the desired results.

Fortunately, reversing things is something that I do happen to have a little bit of knowledge of…

Next time: Popping the hood on Verizon’s connection manager app (“VZAccess Manager”), and how it talks to the V740 ExpressCard.

The No-Execute hall of shame…

June 15th, 2007

One of the things that I do when I set up Windows on a new (modern) box that I am going to use for more than just temporary testing is to enable no-execute (“NX”) support by default. Preferable, I use NX in “always on” mode, but sometimes I use “opt-out” mode if I have to. It is possible to bypass NX in opt-out (or opt-in) mode with a “ret2libc”-style attack, which diminishes the security gain in an unfortunately non-trivial way in many cases. For those unclear on what “alwayson”, “optin”, “optout”, and “alwaysoff” mean in terms of Windows NX support, they describe how NX is applied across processes. Alwayson and alwaysoff are fairly straightforward; they mean that all processes either have NX forced or forced off, unconditionally. Opt-in mode means that only programs that mark themselves as NX-aware have NX enabled (or those that the administrator has configured in Control Panel), and opt-out mode means that all programs have NX applied unless the administrator has explicitly excluded them in Control Panel.

(Incidentally, the reason why the Windows implementation of NX was vulnerable to a ret2libc style attack in the first place was for support for extremely poorly written copy protection schemes, of all things. Specifically, there is code baked into the loader in NTDLL to detect SecuROM and SafeDisc modules being loaded on-the-fly, for purposes of automagically turning off NX for these security-challenged copy protection mechanisms. As a result of this, exploit code can effectively do the same thing that the user mode loader does in order to disable NX, before returning to the actual exploit code (see the Uninformed article for more details on how this works “under the hood”). I really just love how end user security is compromised for DRM/copy protection mechanisms as a result of that. Such is the price of backwards compatibility with shoddy code, and the myriad of games using such poorly written copy protection systems…)

As a result of this consequence of “opt-out” mode, I would recommend forcing NX to being always on, if possible (as previously mentioned). To do this, you typically need to edit boot.ini to enable always on mode (/noexecute=alwayson). On systems that use BCD boot databases, e.g. Vista or later, you can use this command to achieve the same effect as editing boot.ini on downlevel systems:

bcdedit /set {current} nx alwayson

However, I’ve found that you can’t always do that, especially on client systems, depending on what applications you run. There are still a lot of things out there which break with NX enabled, unfortunately, and “alwayson” mode means that you can’t exempt applications from NX. Unfortunately, even relatively new programs sometimes fall into this category. Here’s a list of some of the things I’ve ran into that blow up with NX turned on by default; the “No-Execute Hall of Shame”, as I like to call it, since not supporting NX is definitely very lame from a security perspective – even more so, since it requires you to switch from “Alwayson” to “Optout” (so you can use the NX exclusion list in Control Panel) which is in an of itself a security risk even for unrelated programs that do have NX enabled by default). This is hardly an exclusive list, but just some things that I have had to work around in my own personal experience.

  1. Sun’s Java plug-in for Internet Explorer. (Note that to enable NX for IE if you are in “Optout” mode, you need to go grovel around in IE’s security options, as IE disables NX for itself by default if it can – it opts out, in other words). Even with the latest version of Java (“Java Platform 6 update 1”, which also seems to be be known as 1.6.0), it’s “bombs away” for IE as soon as you try to go to a Java-enabled webpage if you have Sun’s Java plugin installed.

    Personally, I think that is pretty inexcusable; it’s not like Sun is new to JIT code generation or anything (or that Java is anything like the “new kid on the block”), and it’s been the recommended practice for a very long time to ensure that code you are going to execute is marked executable. It’s even been enforced by hardware for quite some time now on x86. Furthermore, IE (and Java) are absolutely “high priority” targets for exploits on end users (arguably, IE exploits on unpatched systems are one of the most common delivery mechanisms for malware), and preventing IE from running with NX is clearly not a good thing at all from a security standpoint. Here’s what you’ll see if you try to use Java for IE with NX enabled (which takes a bit of work, as previously noted, in the default configuration on client systems):

    (106c.1074): Access violation - code c0000005 (first chance)
    First chance exceptions are reported before any
    exception handling.
    This exception may be expected and handled.
    eax=6d4a4a79 ebx=00070c5c ecx=0358ea98
    edx=76f10f34 esi=0ab65b0c edi=04da3100
    eip=04da3100 esp=0358ead4 ebp=0358eb1c
    iopl=0         nv up ei pl nz na pe nc
    cs=001b  ss=0023  ds=0023  es=0023  fs=003b
    gs=0000             efl=00210206
    04da3100 c74424040c5bb60a mov dword ptr [esp+4],0AB65B0Ch
    0:005> k
    ChildEBP RetAddr  
    WARNING: Frame IP not in any known module. Following
    frames may be wrong.
    0358ead0 6d4a4abd 0x4da3100
    0358eb1c 76e31ae8 jpiexp+0x4abd
    0358eb94 76e31c03 USER32!UserCallWinProcCheckWow+0x14b
    [...]
    0:005> !vprot @eip
    BaseAddress:       04da3000
    AllocationBase:    04c90000
    AllocationProtect: 00000004  PAGE_READWRITE
    RegionSize:        000ed000
    State:             00001000  MEM_COMMIT
    Protect:           00000004  PAGE_READWRITE
    Type:              00020000  MEM_PRIVATE
    0:005> .exr -1
    ExceptionAddress: 04da3100
       ExceptionCode: c0000005 (Access violation)
      ExceptionFlags: 00000000
    NumberParameters: 2
       Parameter[0]: 00000008
       Parameter[1]: 04da3100
    Attempt to execute non-executable address 04da3100
    0:005> lmvm jpiexp
    start    end        module name
    [...]
        CompanyName:  JavaSoft / Sun Microsystems
        ProductName:  JavaSoft / Sun Microsystems
                      -- Java(TM) Plug-in
        InternalName: Java Plug-in für Internet Explorer

    (Yes, the internal name is in German. No, I didn’t install a German-localized build, either. Perhaps whoever makes the public Java for IE releases lives in Germany.)

  2. NVIDIA’s Vista 3D drivers. From looking around in a process doing Direct3D work in the debugger on an NVIDIA system, you can find sections of memory laying around that are PAGE_EXECUTE_READWRITE, or writable and executable, and appear to contain dynamically generated SSE code. Leaving memory around like this diminishes the value of NX, as it provides avenues of attack for exploits that need to store some code somewhere and then execute it. (Fortunately, technologies like ASLR may make it difficult for exploit code to land in such dynamically allocated executable and writable zones in one shot. An example of this sort of code is as follows (observed in a World of Warcraft process):
    0:000> !vprot 15a21a42 
    BaseAddress:       0000000015a20000
    AllocationBase:    0000000015a20000
    AllocationProtect: 00000040  PAGE_EXECUTE_READWRITE
    RegionSize:        0000000000002000
    State:             00001000  MEM_COMMIT
    Protect:           00000040  PAGE_EXECUTE_READWRITE
    Type:              00020000  MEM_PRIVATE
    0:000> u 15a21a42
    15a21a42 8d6c9500       lea     ebp,[rbp+rdx*4]
    15a21a46 0f284d00       movaps  xmm1,xmmword ptr [rbp]
    15a21a4a 660f72f108     pslld   xmm1,8
    15a21a4f 660f72d118     psrld   xmm1,18h
    15a21a54 0f5bc9         cvtdq2ps xmm1,xmm1
    15a21a57 0f590dd0400b05 mulps   xmm1,xmmword ptr
               [nvd3dum!NvDiagUmdCommand+0xd6f30 (050b40d0)]
    15a21a5e 0f59c1         mulps   xmm0,xmm1
    15a21a61 0f58d0         addps   xmm2,xmm0
    0:000> lmvm nvd3dum
    start             end                 module name
    04de0000 0529f000   nvd3dum
        Loaded symbol image file: nvd3dum.dll
    [...]
        CompanyName:      NVIDIA Corporation
        ProductName:      NVIDIA Windows Vista WDDM driver
        InternalName:     NVD3DUM
        OriginalFilename: NVD3DUM.DLL
        ProductVersion:   7.15.11.5828
        FileVersion:      7.15.11.5828
        FileDescription:  NVIDIA Compatible Vista WDDM
              D3D Driver, Version 158.28 

    Ideally, this sort of JIT’d code would be reprotected to be readonly after it is generated. While not as severe as leaving the entire process without NX, any “NX holes” like this are to be avoided when possible. (Simply slapping a “PAGE_EXECUTE_READWRITE” on all of your allocations and calling it done is not the proper solution, and compromises security to a certain extent, albeit significantly less so than disabling NX entirely.)

  3. id Software’s Quake 3 crashes out immediately if you have NX enabled. I suppose it can be given a bit of slack given its age, but the SDK documentation has always said that you need to protect executable regions as executable.

Programs that have been “recently rescued” from the No-Execute hall of shame include:

  1. DosBox, the DOS emulator (great for running those old games on new 64-bit computers, or even on 32-bit computers where DosBox has superior support for hardware, such as sound cards, compared to NTVDM. The current release version (0.70) of DosBox doesn’t work with NX if you have the dynamic code generation core enabled. However, after speaking to one of the developers, it turns out that a version that properly protects executable memory was already in the works (in fact, the DosBox team was kind enough to give me a prerelease build with the fix in it, in leui of my trying to get a DosBox build environment working). Now I’m free to indulge in those oldie-but-goodie classic DOS titles like “Master of Magic” from time to time without having to mess around with excluding DosBox.exe from the NX policy.
  2. Blizzard Entertainment’s World of Warcraft used to crash immediately on logging on to the game world if you had NX enabled. It seems as if this has been fixed for the most recent patch level, though.
  3. Another of Blizzard’s games, Starcraft, will crash with NX enabled unless you’ve got a recent patch level installed. The reason is that Starcraft’s rendering engine JITs various rendering operations into native code on the fly for increased performance. Until relatively recently in Starcraft’s lifetime, none of this code was NX-aware and blithely outputted non-executable rendering code which it then tried to run. In fact, Blizzard’s Warcraft II: Battle.net Edition has not been patched to repair this deficiency and is thus completely unusable with NX enabled as a result of the same underlying problem.

If you’re writing a program that generates code on the fly, please use the proper protection attributes for it – PAGE_READWRITE during generation, and PAGE_EXECUTE_READONLY after the code is ready to use. Windows Server already ships with NX enabled by default (although in “opt-out” mode), and it’s already a matter of time before client SKUs of Windows do the same.

Much ado about signed drivers, and yet…

June 13th, 2007

Today, I was installing a Srv03 x64 VM for testing purposes. This is something I’ve done countless times before – install/reinstall/blow away Windows VMs for testing purposes with the standard Windows setup, nothing all that interesting. However, I ran into something extremely bizzare this time around:

Windows Server 2003 x64 Setup - Unsigned Driver Warning

The really weird thing here is that this VM was completely clean; brand new, empty hard disk which had only been just now connected to standard installation media (no third party OEM drivers slipstreamed onto the media or anything either).

Apparently, there’s an unsigned “in-box” driver lurking on the Srv03 x64 installation discs. Nice. The only thing that I can think of that was different about this VM from all the others I’ve made is that I created a serial port (redirected to a named pipe as usual) in the VM before setup, instead of after, and that somehow Windows might have thought that I had one of those old-style serial-port-controlled UPS batteries connected to the box. Still, I’ve been around enough Srv03 installs on “real metal” to know that it’s weird for this to happen; I’ve never seen it on another install, VM or physical hardware.

Given the fuss that Microsoft makes about signed drivers, seeing what appears to be an in-box driver that isn’t signed is, to say the least, amusing. (Note that there are two types of signing; cat file signing, to sign the installation package – this is what suppresses the unsigned driver popup – and driver binary signing, which allows a driver to load on an x64 system but still generates the unsigned driver popup if you install the driver via an INF. I can only assume that for whatever reason, this driver doesn’t have a valid catalog signature, or it wouldn’t load at all. I’ll see if I can confirm this when the VM finishes installing, though…)

Selectively suppress Wow64 filesystem redirection on Vista x64 with ‘Sysnative’

June 8th, 2007

One of the features implemented by the Wow64 layer on x64 editions of Windows for 32-bit programs is filesystem redirection, which operates similarly to registry redirection in that it provides a Wow64 “sandbox”, for file I/O relating to the System32 directory.

The reasoning behind this is that many programs have the “System32” directory name hardcoded, and more importantly, unlike the transition between 16-bit Windows and Win32, virtually all of the DLLs implementing the Windows API retained their same name (and path) across the 32-bit to 64-bit transition.

Clearly, this creates a conflict, as there can’t be both 32-bit and 64-bit DLLs with the same name and path on the same system at the same time. The solution that Microsoft devised, filesystem redirection, is a layer that intercepts accesses to the system32 directory made by 32-bit applications, and re-points the accesses to the SysWOW64 directory, which functions as the Wow64 version of the system32 directory (where most of the core Wow64 system DLLs reside).

Normally, this works fairly well, but there are times when a program needs to access the real System32 directory. For this purpose, Microsoft has provided a set of APIs to enable and disable the filesystem redirection on a per-thread basis. This is all well and dandy, but sometimes you may find yourself needing to turn off Wow64 filesystem redirection in a place where you can’t modify a program easily.

Until Vista, there was no easy way to do this. Fortunately, Vista adds a “backdoor” to Wow64 filesystem redirection, in the form of a pseudo-directory named “Sysnative” that is present under the Windows directory (e.g. C:\Windows\Sysnative). This pseudo-directory is not visible in directory listings, and does not exist for native 64-bit processes. For Wow64 processes, however, it can be used to access the 64-bit system32 directory.

Why would you need to do that in the first place? Well, there are lots of reasons, most of them dealing with things like where there is only a 64-bit version of a program that you need to launch (or the 64-bit version is necessary, such as if you need to communicate between a 32-bit process and a 64-bit process by launching another 64-bit process or something of the sort). Prior to Vista, situations like this were a pain, as there was no built-in way to access the 64-bit system32 directory. The “sysnative” pseudo-directory allows the problem to be addressed in a general fashion without requiring code changes, as Wow64 programs can address the 64-bit system32 via the special path (of course, this does require the user supplying such a path, but there is little working around this given the fact that 64-bit and 32-bit system DLLs have identical names).

Sysnative is not a universal solution, however. Some functionality does not work well with it, as sysnative does not appear in directory listings; for instance, the common controls open/save file dialogs won’t allow you to navigate through sysnative due to this issue.

Why doesn’t the publicly available kernrate work on Windows x64? (and how to fix it)

June 4th, 2007

Previously, I wrote up an introduction to kernrate (the Windows kernel profiler). In that post, I erroneously stated that the KrView distribution includes a version of kernrate that works on x64. Actually, it supports IA64 and x86, but not x64. A week or two ago, I had a problem on an x64 box of mine that I wanted to help track down using kernrate. Unfortunately, I couldn’t actually find a working kernrate for Windows x64 (Srv03 / Vista).

It turns out that nowhere is there a published kernrate that works on any production version of Windows x64. KrView ships with an IA64 version, and the Srv03 resource kit only ships with an x86 version only. (While you can run the x86 version of kernrate in Wow64, you can’t use it to profile kernel mode; for that, you need a native x64 version.)

A bit of digging turned up a version of kernrate labeled ‘AMD64’ in the Srv03 SP0 DDK (the 3790.1830 / Srv03 SP1 DDK doesn’t include kernrate at all, only documentation for it, and the 6000 WDK omits even the documentation, which is really quite a shame). Unfortunately, that version of kernrate, while compiled as native x64 and theoretically capable of profiling the x64 kernel, doesn’t actually work. If you try to run it, you’ll get the following error:

NtQuerySystemInformation failed status c0000004
KERNRATE: Failed to get SYSTEM_BASIC_INFORMATION

So, no dice even with the Srv03 SP0 DDK version of kernrate, or so I thought. Ironically, the various flavors of x86 (including 3790.0) kernrate I could find all worked on Srv03 x64 SP1 (3790.1830) / Vista x64 SP0 (6000.0), but as I mentioned earlier, you can’t use the profiling APIs to profile kernel mode from a Wow64 program. So, I had a kernrate that worked but couldn’t profile kernel mode, and a kernrate that should have worked and been able to profile kernel mode except that it bombed out right away.

After running into that wall, I decided to try and get ahold of someone at Microsoft who I suspected might be able to help, Andrew Rogers from the Windows Serviceability group. After trading mails (and speculation) back and forth a bit, we eventually determined just what was going on with that version of kernrate.

To understand what the deal is with the 3790.0 DDK x64 version of kernrate, a little history lesson is in order. When the production (“RTM”) version of Windows Server 2003 was released, it was supported on two platforms: x86, and IA64 (“Windows Server 2003 64-bit”). This continued until the Srv03 Service Pack 1 timeframe, when support for another platform “went gold” – x64 (or AMD64). Now, this means that the production code base for Srv03 x64 RTM (3790.1830) is essentially comparable to Srv03 x86 SP1 (3790.1830). While normally, there aren’t “breaking changes” (or at least, these are tried to kept to a minimum) from service pack to service pack, Srv03 x64 is kind of a special case.

You see, there was no production / RTM release of Srv03 x64 3790.0, or a “Service Pack 0” for the x64 platform. As a result, in a special case, it was acceptable for there to be breaking changes from 3790.0/x64 to 3790.1830/x64, as pre-3790.1830 builds could essentially be considered beta/RC builds and not full production releases (and indeed, they were not generally publicly available as in a normal production release).

If you’re following me this far, you’re might be thinking “wait a minute, didn’t he say that the 3790.0 x86 build of kernrate worked on Srv03 3790.1830?” – and in fact, I did imply that (it does work). The breaking change in this case only relates to 64-bit-specific parts, which for the most part excludes things visible to 32-bit programs (such as the 3790.0 x86 kernrate).

In this particular case, it turns out that part of the SYSTEM_BASIC_INFORMATION structure was changed from the 3790.0 timeframe to the 3790.1830 timeframe, with respect to x64 platforms.

The 3790.0 structure, as viewed from x64 builds, is approximately like this:

typedef struct _SYSTEM_BASIC_INFORMATION_3790 {
    ULONG Reserved;
    ULONG TimerResolution;
    ULONG PageSize;
    ULONG_PTR NumberOfPhysicalPages;
    ULONG_PTR LowestPhysicalPageNumber;
    ULONG_PTR HighestPhysicalPageNumber;
    ULONG AllocationGranularity;
    ULONG_PTR MinimumUserModeAddress;
    ULONG_PTR MaximumUserModeAddress;
    KAFFINITY ActiveProcessorsAffinityMask;
    CCHAR NumberOfProcessors;
} SYSTEM_BASIC_INFORMATION_3790, *PSYSTEM_BASIC_INFORMATION_3790;

However, by the time 3790.1830 (the “RTM” version of Srv03 x64 / XP x64) was released, a subtle change had been made:

typedef struct _SYSTEM_BASIC_INFORMATION {
	ULONG Reserved;
	ULONG TimerResolution;
	ULONG PageSize;
	ULONG NumberOfPhysicalPages;
	ULONG LowestPhysicalPageNumber;
	ULONG HighestPhysicalPageNumber;
	ULONG AllocationGranularity;
	ULONG_PTR MinimumUserModeAddress;
	ULONG_PTR MaximumUserModeAddress;
	KAFFINITY ActiveProcessorsAffinityMask;
	CCHAR NumberOfProcessors;
} SYSTEM_BASIC_INFORMATION, *PSYSTEM_BASIC_INFORMATION;

Essentially, three ULONG_PTR fields were “contracted” from 64-bits to 32-bits (by changing the type to a fixed-length type, such as ULONG). The cause for this relates again to the fact that the first 64-bit RTM build of Srv03 was for IA64, and not x64 (in other words, “64-bit Windows” originally just meant “Windows for Itanium”, something that is to this day still unfortunately propagated in certain MSDN documentation).

According to Andrew, the reasoning behind the difference in the two structure versions is that those three fields were originally defined as either a pointer-sized data type (such as SIZE_T or ULONG_PTR – for 32-bit builds), or a 32-bit data type (such as ULONG – for 64-bit builds) in terms of the _IA64_ preprocessor constant, which indicates an Itanium build. However, 64-bit builds for x64 do not use the _IA64_ constant, which resulted in the structure fields being erroneously expanded to 64-bits for the 3790.0/x64 builds. By the time Srv03 x64 RTM (3790.1830) had been released, the page counts had been fixed to be defined in terms of the _WIN64 preprocessor constant (indicating any 64-bit build, not just an Itanium build). As a result, the fields that were originally 64-bits long for the prerelease 3790.0/x64 build became only 32-bits long for the RTM 3790.1830/x64 production build. Note that due to the fact the page counts are expressed in terms of a count of physical pages, which are at least 4096 bytes for x86 and x64 (and significantly more for large physical pages on those platforms), keeping them as a 32-bit quantity is not a terribly limiting factor on total physical memory given today’s technology, and that of the foreseeable future, at least relating to systems that NT-based kernels will operate on).

The end result of this change is that the SYSTEM_BASIC_INFORMATION structure that kernrate 3790.0/x64 tries to retrieve is incompatible with the 3790.1830/x64 (RTM) version of Srv03, hence the call failing with c0000004 (otherwise known as STATUS_INFO_LENGTH_MISMATCH). The culimination of this is that the 3790.0/x64 version of kernrate will abort on production builds of Srv03, as the SYSTEM_BASIC_INFORMATION structure format is incompatible with the version kernrate is expecting.

Normally, this would be pretty bad news; the program was compiled against a different layout of an OS-supplied structure. Nonetheless, I decided to crack open kernrate.exe with IDA and HIEW to see what I could find, in the hopes of fixing the binary to work on RTM builds of Srv03 x64. It turned out that I was in luck, and there was only one place that actually retrieved the SYSTEM_BASIC_INFORMATION structure, storing it in a global variable for future reference. Unfortunately, there were a great many references to that global; too many to be practical to fix to reflect the new structure layout.

However, since there was only one place where the SYSTEM_BASIC_INFORMATION structure was actually retrieved (and even more, it was in an isolated, dedicated function), there was another option: Patch in some code to retrieve the 3790.1830/x64 version of SYSTEM_BASIC_INFORMATION, and then convert it to appear to kernrate as if it were actually the 3790.0/x64 layout. Unfortunately, a change like this involves adding new code to an existing binary, which means that there needs to be a place to put it. Normally, the way you do this when patching a binary on-disk is to find some padding between functions, or at the end of a PE section that is marked as executable but otherwise unused, and place your code there. For large patches, this may involve “spreading” the patch code out among various different “slices” of padding, if there is no one contiguous block that is long enough to contain the entire patch.

In this case, however, due to the fact that the routine to retrieve the SYSTEM_BASIC_INFORMATION structure was a dedicated, isolated routine, and that it had some error handling code for “unlikely” situations (such as failing a memory allocation, or failing the NtQuerySystemInformation(…SystemBasicInformation…) call (although, it would appear the latter is not quite so “unlikely” in this case), a different option presented itself: Dispense with the error checking, most of which would rarely be used in a realistic situation, and use the extra space to write in some code to convert the structure layout to the version expected by kernrate 3790.0. Obviously, while not a completely “clean” solution per se, the idea does have its merits when you’re already into patch-the-binary-on-disk-land (which pretty much rules out the idea of a “clean” solution altogether at that point).

The structure conversion is fairly straightforward code, and after a bit of poking around, I had a version to try out. Lo and behold, it actually worked; it turns out that the only thing that was preventing the prerelease 3790.0/x64 kernrate from doing basic kernel profiling on production x64 kernels was the layout of the SYSTEM_BASIC_INFORMATION structure.

For those so inclined, the patch I created effectively rewrites the original version of the function, as shown below after translated to C:

PSYSTEM_BASIC_INFORMATION_3790
GetSystemBasicInformation(
 VOID
 )
{
 PSYSTEM_BASIC_INFORMATION_3790 Sbi;
 NTSTATUS                       Status;

 Sbi = (PSYSTEM_BASIC_INFORMATION_3790)malloc(
  sizeof(SYSTEM_BASIC_INFORMATION_3790));

 if (!Sbi)
 {
  fprintf(stderr, "Buffer allocation failed"
   " for SystemInformation in "
   "GetSystemBasicInformation\\n");
  exit(1);
 }

 if (!NT_SUCCESS((Status = NtQuerySystemInformation(
  SystemBasicInformation,
  Sbi,
  sizeof(SYSTEM_BASIC_INFORMATION_3790),
  0))))
 {
  fprintf(stderr, "NtQuerySystemInformation failed"
   " status %08lx\\n",
   Status);
  free(Sbi);

  Sbi = 0;
 }

 return Sbi;
}

Conceptually represented in C, the modified (patched) function would appear something like the following (note that the error checking code has been removed to make room for the structure conversion logic, as previously mentioned; the substantial new changes are colored in red):

(Note that the patch code was not compiler-generated and is not truly a function written in C; below is simply how it would look if it were translated from assembler to C.)

PSYSTEM_BASIC_INFORMATION_3790
GetSystemBasicInformationFixed(
 VOID
 )
{
 PSYSTEM_BASIC_INFORMATION_3790 Sbi;
 PSYSTEM_BASIC_INFORMATION      SbiReal;

 Sbi     = (PSYSTEM_BASIC_INFORMATION_3790)malloc(
  sizeof(SYSTEM_BASIC_INFORMATION_3790));
 SbiReal = (PSYSTEM_BASIC_INFORMATION)Sbi;

 NtQuerySystemInformation(
  SystemBasicInformation,
  SbiReal,
  sizeof(SYSTEM_BASIC_INFORMATION),
  0);

 Sbi->NumberOfProcessors           =
  SbiReal->NumberOfProcessors;
 Sbi->ActiveProcessorsAffinityMask =
  SbiReal->ActiveProcessorsAffinityMask;
 Sbi->MaximumUserModeAddress       =
  SbiReal->MaximumUserModeAddress;
 Sbi->MinimumUserModeAddress       =
  SbiReal->MinimumUserModeAddress;
 Sbi->AllocationGranularity        =
  SbiReal->AllocationGranularity;
 Sbi->HighestPhysicalPageNumber    =
  SbiReal->HighestPhysicalPageNumber;
 Sbi->LowestPhysicalPageNumber     =
  SbiReal->LowestPhysicalPageNumber;
 Sbi->NumberOfPhysicalPages        =
  SbiReal->NumberOfPhysicalPages;

 return Sbi;
}

(The structure can be converted in-place due to how the two versions are laid out in memory; the “expected” version is larger than the “real” version (and more specifically, the offsets of all the fields we care about are greater in the “expected” version than the “real” version), so structure fields can safely be copied from the tail of the “real” structure to the tail of the “expected” structure.)

If you find yourself in a similar bind, and need a working kernrate for x64 (until if and when Microsoft puts out a new version that is compatible with production x64 kernels), I’ve posted the patch (assembler, with opcode diffs) that I made to the “amd64” release of kernrate.exe from the 3790.0 DDK. Any conventional hex editor should be sufficient to apply it (as far as I know, third parties aren’t authorized to redistribute kernrate in its entirety, so I’m not posting the entire binary). Note that DDKs after the 3790.0 DDK don’t include any kernrate for x64 (even the prerelease version, broken as it was), so you’ll also need the original 3790.0 DDK to use the patch. Hopefully, we may see an update to kernrate at some point, but for now, the patch suffices in a pinch if you really need to profile a Windows x64 system.

What’s the difference between the Wow64 and native x86 versions of a DLL?

June 1st, 2007

Wow64 includes a complete set of 32-bit system DLLs implementing the Win32 API (for use by Wow64 programs). So, what’s the difference between the native 32-bit DLLs and their Wow64 versions?

On Windows x64, not much, actually. Most of the DLLs are actually the 32-bit copies from the 32-bit version of the operating system (not even recompiled). For example, the Wow64 ws2_32.dll on Vista x64 is the same file as the native 32-bit ws2_32.dll on Vista x86. If I compare the wow64 version of ws2_32.dll on Vista with the x86 native version, they are clearly identical:

32-bit Vista:

C:\\Windows\\System32>md5sum ws2_32.dll
d99a071c1018bb3d4abaad4b62048ac2 *ws2_32.dll

64-bit Vista (Wow64):

C:\\Windows\\SysWOW64>md5sum ws2_32.dll
d99a071c1018bb3d4abaad4b62048ac2 *ws2_32.dll

Some DLLs, however, are not identical. For instance, ntdll is very different across native x86 and Wow64 versions, due to the way that Wow64 system calls are translated to the 64-bit user mode portion of the Wow64 layer. Instead of calling kernel mode directly, Wow64 system services go through a translation layer in user mode (so that the kernel does not need to implement 32-bit and 64-bit versions of each system service, or stubs thereof in kernel mode). This difference can clearly be seen when comparing the Wow64 ntdll with the native x64 ntdll, with respect to a system service:

If we look at the native x86 ntdll, we see the expected call through the SystemCallStub pointers in SharedUserData:

0:000> u ntdll!NtClose
ntdll!ZwClose:
mov     eax,30h
mov     edx,offset SharedUserData!SystemCallStub
call    dword ptr [edx]
ret     4

However, an examination of the Wow64 ntdll shows something different; a call is made through a field at offset +C0 in the 32-bit TEB:

0:000> u ntdll!NtClose
ntdll!ZwClose:
mov     eax,0Ch
xor     ecx,ecx
lea     edx,[esp+4]
call    dword ptr fs:[0C0h]
ret     4

If we examine the 32-bit TEB, this field as marked as “WOW32Reserved” (kind of a misnomer, since it’s actually reserved for the 32-bit portion of Wow64):

0:000> dt ntdll!_TEB
   +0x000 NtTib            : _NT_TIB
[...]
   +0x0c0 WOW32Reserved    : Ptr32 Void

As I had previously mentioned, this all leads up to a transition to 64-bit mode in user mode in the Wow64 case. From there, the 64-bit portion of the Wow64 layer reformats the request into a 64-bit system call, makes the actual kernel transition, then repackages the results into what a 32-bit system call would return.

Thus, one case where a Wow64 DLL will necessarily differ is when that DLL contains a system call stub. The most common case of this is ntdll, of course, but there are a couple of other modules with inline system call stubs (user32.dll and gdi32.dll are the most common other modules, though winsrv.dll (part of CSRSS) also contains stubs, however it has no Wow64 version).

There are other cases, however, where Wow64 DLLs differ from their native x86 counterparts. Usually, a DLL that makes an LPC (“Local Procedure Call”) call to another component also has a differing version in the Wow64 layer. This is because most of the LPC interfaces in 64-bit Windows have been “widened” to 64-bits, where they included pointers (most notably, this includes the LSA LPC interface for use in implementing code that talks to LSA, such as LogonUser or calls to authentication packages). As a result, modules like advapi32 or secur32 differ on Wow64.

This has, in fact, been the source of some rather insidious bugs where mistakes have been made in the translation layer. For instance, if you call LsaGetLogonSessionData on Windows Server 2003 x64, from a Wow64 process, you may unhappily find out that if your program accesses any fields beyond “LogonTime” in the SECURITY_LOGON_SESSION_DATA structure returned by LsaGetLogonSessionData, it will crash. This is the result of a bug in the Wow64 version of Secur32.dll (on Windows Server 2003 x64 / Windows XP x64), where part of the structure was not properly converted to its 32-bit counterpart after the 64-bit version is read “off the wire” from an LSA LPC call. This particular problem has been fixed in Windows Vista x64 (and can be worked around with some clever hackery if you take a look at how the conversion layer in Secur32 works for Srv03 x64). Note that as far as I know from having spoken to PSS, Microsoft has no plans to fix the bug for pre-Vista platforms (and it is still broken in Srv03 x64 SP2), so if you are affected by that problem then you’ll have to live with implementing a hackish workaround if you can’t rebuild your app as native 64-bit (the problem only occurs for 32-bit processes on Srv03 x64, not 32-bit processes on Srv03 32-bit, or 64-bit processes on Srv03 x64).

There are a couple of other corner cases which result in necessary differences between Wow64 modules and their native 32-bit counterparts, though these are typically few and far between. Most notably absent in the list of things that require code changes for a Wow64 DLL are modules that directly talk to 64-bit drivers via IOCTLs. For example, even “low-level” modules such as mswsock.dll (the AFD Winsock2 Tcpip service provider which implements the user mode to kernel mode translation layer from Winsock2 service provider calls to AFD networking calls in kernel mode) that communicate heavily with drivers via IOCTLs are unchanged in Wow64. This is due to direct support for Wow64 processes in the kernel IOCTL interface (requiring changes to drivers to be aware of 32-bit callers via IoIs32bitProcess), such that 32-bit and 64-bit processes can share the same IOCTL codes even though such structures are typically “widened” to 64-bits for 64-bit processes.

Due to this approach to Wow64, most of the code that implements the 32-bit Win32 API on x64 computers is identical to their 32-bit counterparts (the same binary in the vast majority of cases). For DLLs that do require a rebuild, typically only small portions (system call stubs and LPC stubs in most cases) require any code changes. The major benefits to this design are savings with respect to QA and development time for the Wow64 layer, as most of the code implementing the Win32 API (at least in user mode) did not need to be re-engineered. That’s not to say that Microsoft hasn’t tested it, but it’s a far cry from having to modify every single module in a non-trivial way.

Additionally, this approach has the added benefit of preserving the vast majority of implementation details that have slowly sept into the Win32 programming model, as much as Microsoft has tried to prevent such from happening. Even programs that use undocumented system calls from NTDLL should generally work on the 64-bit version of that operating system via the Wow64 layer, for instance.

In a future posting later on down the line, I’ll be expanding on this high level overview of Wow64 and taking a more in depth look at the guts of Wow64 (of course, subject to change between OS releases without warning, as usual).

Can I get that without the blinky lights?

May 15th, 2007

It seems that every sort of electronic device you get nowadays has to have some sort of ridiculously annoying blinky light.

The set of Bluetooth stereo headphones that I recently bought have an annoying blue light on the “M” button on each earpiece. While you’re playing stereo audio through them, the two lights slowly fade in and out. It’s really quite annoying to walk into a dark room with the headphones on and have a weird blue glow playing off the walls because of it. (It seems that there must be a rule out that somewhere that every Bluetooth device has to have a blue LED somewhere, or it’s not really Bluetooth(tm).)

Then there’s the V740 ExpressCard that I got recently for EVDO Rev.A Internet access. The improved upstream and latency on it is great, but I wish that someone would have written on the box: “Warning, this product includes a blinky LED whose sole purpose is to annoy the operator.”

Specifically, there’s bright LED (low quality camera – sorry) on the card that blinks several times per second while in use. The LED doesn’t provide any sort of indication of data rate, what type of service you are using; just that you’re connected (in other words, nothing useful). Apparently, Novatel thought it important enough to remind you that you’re still online with an annoying bright light right by your laptop keyboard (where most ExpressCard ports are) that strobes a couple times per second. Great idea, there, Novatel…

Then, there’s a PNY USB flash memory stick I bought recently that also blinks many times per second while it’s in use (bright red, this time). This, of course, really sucks if you use it for ReadyBoost, pretty much forcing you to plug it into the back of your computer (to the eternal annoyance of anyone sitting across from you at a table).

Even my laptop (an XPS M1710) is adorned with flashy LEDs, though at least these are programmable (and can be disabled if desired), making them at least somewhat redeemable.

Ironically, my cell phone is about the least intrusive gadget I use in terms of annoying blinky lights. I think that is saying something about the unfortunate state of affairs with modern electronic gadgets and the obsession with blinky lights nowadays…

Beware GetThreadContext on Wow64

May 11th, 2007

If you’re planning on running a 32-bit program of yours under Wow64, one of the things that you may need to watch out for is a subtle change in how GetThreadContext / SetThreadContext operate for Wow64 processes.

Specifically, these operations require additional access rights when operating on a Wow64 context (as is the case of a Wow64 process calling Get/SetThreadContext). This is hinted at in MSDN in the documentation for GetThreadContext, with the following note:

WOW64: The handle must also have THREAD_QUERY_INFORMATION access.

However, the reality of the situation is not completely documented by MDSN.

Under the hood, the Wow64 context (e.g. x86 context) for a thread on Windows x64 is actually stored at a relatively well-known location in the virtual address space of the thread’s process, in user-mode. Specifically, one of the TLS slots in thread’s TLS array is repurposed to point to a block of memory that contains the Wow64 context for the thread (used to read or write the context when the Wow64 “portion” of the thread is not currently executing). This is presumably done for performance reasons, as the Wow64 thunk layer needs to be able to quickly transition to/from x86 mode and x64 mode. By storing the x86 context in user mode, this transition can be managed without a kernel mode call. Instead, a simple far call is used to make the transition from x86 to x64 (and an iretq is used to transition from x64 to x86). For example, the following is what you might see when stepping into a Wow64 layer call with the 64-bit debugger while debugging a 32-bit process:

ntdll_777f0000!ZwWaitForMultipleObjects+0xe:
00000000`7783aebe 64ff15c0000000  call    dword ptr fs:[0C0h]
0:000:x86> t
wow64cpu!X86SwitchTo64BitMode:
00000000`759c31b0 ea27369c753300  jmp     0033:759C3627
0:012:x86> t
wow64cpu!CpupReturnFromSimulatedCode:
00000000`759c3627 67448b0424      mov     r8d,dword ptr [esp]

A side effect of the Wow64 register context being stored in user mode, however, is that it is not so easily accessible to a remote process. In order to access the Wow64 context, what needs to occur is that the TEB for a thread in question must be located, then read out of the thread’s process’s address space. From there, the TLS array is processed to locate the pointer to the Wow64 context structure, which is then read (or written) from the thread’s process’s address space.

If you’ve been following along so far, you might see the potential problem here. From the perspective of GetThreadContext, there is no handle to the process associated with the thread in question. In other words, we have a missing link here: In order to retrieve the Wow64 context of the thread whose handle we are given, we need to perform a VM read operation on its process. However, to do that, we need a process handle, but we’ve only got a thread handle.

The way that the Wow64 layer solves this problem is to query the process ID associated with the requested thread, open a new handle to the process with the required access rights, and then performs the necessary VM read / write operations.

Now, normally, this works fine; if you can get a handle to the thread of a process, then you should almost always be able to get a handle to the process itself.

However, the devil is in the details, here. There are situations where you might not have access to open a handle to a process, even though you have a handle to the thread (or even an existing process handle, but you can’t open a new one). These situations are relatively rare, but they do occur from time to time (in fact, we ran into one at work here recently).

The most likely scenario for this issue is when you are dealing with (e.g. creating) a process that is operating under a different security context than your process. For example, if you are creating a process operating as a different user, or are working with a process that has a higher integrity level than the current process, then you might not have access to open a new handle to the process (even if you might already have an existing handle returned by a CreateProcess* family routine.

This turns into a rather frustrating problem, especially if you have a handle to both the process and a thread in the process that you’re modifying; in that case, you already have the handle that the Wow64 layer is trying to open, but you have no way to communicate it to Wow64 (and Wow64 will try and fail to open the handle on its own when you make the GetThreadContext / SetThreadContext call).

There are two effective solutions to this problem, neither of which are particularly pretty.

First, you could reverse engineer the Wow64 layer and figure out the exact specifics behind how it locates the PWOW64_CONTEXT, and implement that logic inline (using your already-existing process handle instead of creating a new one). This has the downside that you’re way into undocumented implementation details land, so there isn’t a guarantee that your code will continue to operate on future Windows versions.

The other option is to temporarily modify the security descriptor of the process to allow you to open a second handle to it for the duration of the GetThreadContext / SetThreadContext calls. Although this works, it’s definitely a pain to have to go muddle around with security descriptors just to get the Wow64 layer to work properly.

Note that if you’re a native x64 process on x64 Windows, and you’re setting the 64-bit context of a 64-bit process, this problem does not apply. (Similarly, if you’re a 32-bit process on 32-bit Windows, things work “as expected” as well.)

So, in case you’ve been getting mysterious STATUS_ACCESS_DENIED errors out of GetThreadContext / SetThreadContext in a Wow64 program, now you know why.

Don’t perform complicated tasks in your unhandled exception filter

May 10th, 2007

When it comes to crash reporting, the mechanism favored by many for globally catching “crashes” is the unhandled exception filter (as set by SetUnhandledExceptionFilter).

However, many people tend to go wrong with this mechanism with respect to what actions they take when the unhandled exception filter is called. To understand what I am talking about, it’s necessary to define the conditions under which an unhandled exception filter is executed.

The unhandled exception filter is, by definition, called when an unhandled exception occurs in any Win32 thread in a process. This sort of event is virtually always caused by some sort of corruption of the process state somewhere, such that something eventually probably touched a non-allocated page somewhere and caused an unhandled access violation (or some other similarly severe problem).

In other words, in the context of the unhandled exception filter, you don’t really know what lead up to the current unhandled exception, and more importantly, you don’t know what you can rely on in the process. For example, if you get an AV that bubbles up to your UEF, it might have been caused by corruption in the process heap, which would mean that you probably can’t safely perform heap allocations or you’re risking running into the same problem that caused the original crash in the first place. Or perhaps the problem was an unhandled allocation failure, and another attempt by your unhandled exception filter to allocate memory might just similarly fail.

Actually, the problem gets a bit worse, because you aren’t even guaranteed anything about what the other threads in the process are doing when the crash occurs (in fact, if there are any other threads in the process at the time of the crash, chances are that they’re still running when your unhandled exception filter is called – there is no magical logic to suspend all other activity in the process while your filter is called). This has a couple of other implications for you:

  1. You can’t really rely on the state of synchronization objects in the process. For all you know, the thread that crashed owned a lock that will cause a deadlock if you try to acquire a second lock, which might be owned by a thread that is waiting on the lock owned by the crashed thread.
  2. You can’t with 100% certainty assume that a secondary failure (precipitated by the “original” crash) won’t occur in a different thread, causing your exception filter to be entered by an additional thread at the same time as it is already processing an event from the “first” crash.

In fact, it would be safe to say that there is even less that you can safely do in an unhandled exception filter than under the infamous loader lock in DllMain (which is saying something indeed).

Given these rather harsh conditions, performing actions like heap allocations or writing minidumps from within the current process are likely to fail. Essentially, as whatever kind of recovery action you take from the UEF grows more complicated, it becomes extremely more likely to fail (possibly causing secondary failures that obscure the original problem) as a result of corruption that caused (or is caused by) the original crash. Even something as seemingly innocuous as creating a new process is potentially dangerous (are you sure that nothing in CreateProcess will ever touch the process heap? What about acquire the loader lock? I’ll give you a hint – the latter is definitely not true, such as in the case where Software Restriction Policies are defined).

If you’ve ever taken a look at the kernel32 JIT debugger support in Windows XP, you may have noticed that it doesn’t even follow these rules – it calls CreateProcess, after all. This is part of the reason why sometimes you’ll have programs silently crash even with a JIT debugger. For programs where you want truly robust crash reporting, I would recommend putting as much of the crash reporting logic into a separate process that is started before a crash occurs (e.g. during program initialization), instead of following the JIT launch-reporting-process approach. This “watchdog process” can then sit idle until it is signaled by the process it is watching over that a crash has occured.

This signaling mechanism should, ideally, be pre-constructed during initialization so that the actual logic within the unhandled exception filter just signals the watchdog process that an event occurs, with a pointer to the exception/context information. The filter should then wait for the watchdog process to signal that it is finished before exiting the process.

The mechanism that we use here at work with our programs to communicate between the “guarded” process and the watchdog is simply a file mapping that is mapped into both processes (for passing information between the two, such as the address of the exception record and context record for an active exception event) and a pair of events that are used to communicate “exception occured” and “dump writing completed” between the two processes. With this configuration, all the exception filter needs to do is to store some data in the file mapping view (already mapped ahead of time) and call SetEvent to notify the watchdog process to wake up. It then waits for the watchdog process to signal completion before terminating the process. (This particular mechanism does not address the issue of multiple crashes occuring at the same time, which is something that I deemed acceptable in this case.) The watchdog process is responsible for all of the “heavy lifting” of the crash reporting process, namely, writing the actual dump with MiniDumpWriteDump.

An alternative to this approach is to have the watchdog process act as a debugger on the guarded process; however, I do not typically recommend this as acting as a debugger has a number of adverse side effects (notably, that a great many events cause the process to be suspended entirely while the debugger/watchdog process can inspect the state of the process for a particular debugger event). The watchdog process mechanism is more performant (if ever so slightly less robust), as there is virtually no run-time overhead in the guarded process unless an unhandled exception occurs.

So, the moral of the story is: keep it simple (at least with respect to your unhandled exception filters). I’ve dealt with mechanisms that try to do the error reporting logic in-process, and those that punt the hard work off to a watchdog process in a clean state, and the latter is significantly more reliable in real world cases. The last thing you want to be happening with your crash reporting mechanism is that it causes secondary problems that hide the original crash, so spend the bit of extra work to make your reporting logic that much more reliable and save yourself the headaches later on.

What is the “lpReserved” parameter to DllMain, really? (Or a crash course in the internals of user mode process initialization)

May 9th, 2007

One of the parameters to the DllMain function is the enigmatic lpReserved argument. According to MSDN, this parameter is used to convey whether a DLL is being loaded (or unloaded) as part of process startup or termination, or as part of a dynamic DLL load/unload operation (e.g. LoadLibrary/FreeLibrary).

Specifically, MSDN says that for static DLL operations, lpReserved contains a non-null value, whereas for dynamic DLL operations, it contains a null value.

There’s actually a little bit more to this parameter, though. While it’s true that in the case of DLL_PROCESS_DETACH operations, it is little more than a boolean value, it has some more significance in DLL_PROCESS_ATTACH operations.

To understand this, you need to know a little bit more about how process/thread initialization occurs. When a new user mode thread is started, the kernel queues a user mode APC to it, pointing to a function in ntdll called LdrInitializeThunk (ntdll is always mapped into the address space of a new process before any user mode code runs). The kernel also arranges for the thread to dispatch the user mode APC the first time it begins execution.

One of the arguments to the APC is a pointer to a CONTEXT structure describing the initial execution state of the new thread. (The actual contents of the CONTEXT structure are based at the stack of the thread.)

When the thread is first resumed, the APC executes and control transfers to ntdll!LdrInitializeThunk. From there, depending on whether the process is already initialized or not, either the process initialization code is executed (loading DLLs statically linked to the process, and soforth), or per-thread initialization code is run (for instance, making DLL_THREAD_ATTACH callouts to loaded DLLs).

If a user mode thread always actually begins execution at ntdll!LdrInitializeThunk, then you might be wondering how it ever starts executing at the start address specified in a CreateThread call. The answer is that eventually, the code called by LdrInitializeThunk passes the context record argument supplied by the kernel to the NtContinue system call, which you can think of as simply taking that context and transferring control to it. Because the context record argument to the APC contained the information necessary for control to be transferred to the starting address supplied to CreateThread, the thread then begins executing at expected thread starting address*.

(*: Actually, there is typically another layer here – usually, control would go to a kernel32 or ntdll function (depending on whether you are running on a downlevel platform or on Vista), which sets up a top level exception handler and then calls the start routine supplied to CreateThread. But, for the purposes of this discussion, you can consider it as just running the requested thread start routine.)

As all of this relates to DllMain, the value of the lpReserved parameter to DllMain (when process initialization is occuring and static linked DLLs are being loaded and initialized) corresponds to the context record argument supplied to the LdrInitializeThunk APC, and is thus representative of the initial context that will be set for the thread after initialization completes. In fact, it’s actually the context that will be used after process initialization is complete, and not just a copy of it. This means that by treating the lpReserved argument as a PCONTEXT, the initial execution state for the first thread in the process can be examined (or even altered) from the DllMain of a static-linked DLL.

This can be verified experimentally by using some trickery to step into process initialization DllMain (more on just how to do that with the user mode debugger in a future entry, as it turns out to be a bit more complicated than what one might imagine):

1:001> g
Breakpoint 3 hit
ADVAPI32!DllInitialize:
000007fe`feb15580 48895c2408      mov     qword ptr [rsp+8],
rbx ss:00000000`001defc0=0000000000000000
1:001> r
rax=0000000000000000 rbx=00000000002c4770 rcx=000007fefeaf0000
rdx=0000000000000001 rsi=000007fefeb15580 rdi=0000000000000003
rip=000007fefeb15580 rsp=00000000001defb8 rbp=0000000000000000
 r8=00000000001df530  r9=00000000001df060 r10=00000000002c1310
r11=0000000000000246 r12=0000000000000000 r13=00000000001df0f0
r14=0000000000000000 r15=000000000000000d
iopl=0         nv up ei pl zr na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b
             efl=00000246
ADVAPI32!DllInitialize:
000007fe`feb15580 48895c2408      mov     qword ptr [rsp+8],
rbx ss:00000000`001defc0=0000000000000000
1:001> k
RetAddr           Call Site
00000000`77414664 ADVAPI32!DllInitialize
00000000`77417f29 ntdll!LdrpRunInitializeRoutines+0x257
00000000`7748e974 ntdll!LdrpInitializeProcess+0x16af
00000000`7742c4ee ntdll! ?? ::FNODOBFM::`string'+0x1d641
00000000`00000000 ntdll!LdrInitializeThunk+0xe
1:001> .cxr @r8
rax=0000000000000000 rbx=0000000000000000 rcx=00000000ffb3245c
rdx=000007fffffda000 rsi=0000000000000000 rdi=0000000000000000
rip=000000007742c6c0 rsp=00000000001dfa08 rbp=0000000000000000
 r8=0000000000000000  r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl nz na pe nc
cs=0033  ss=002b  ds=0000  es=0000  fs=0000  gs=0000
             efl=00000200
ntdll!RtlUserThreadStart:
00000000`7742c6c0 4883ec48        sub     rsp,48h
1:001> u @rcx
tracert!mainCRTStartup:
00000000`ffb3245c 4883ec28        sub     rsp,28h

If you’ve been paying attention thus far, then you might then be able to explain why when you set a hardware breakpoint at the initial process breakpoint, the debugger warns you that it will not take effect. For example:

(16f0.1890): Break instruction exception
- code 80000003 (first chance)
ntdll!DbgBreakPoint:
00000000`7742fdf0 cc              int     3
0:000> ba e1 kernel32!CreateThread
        ^ Unable to set breakpoint error
The system resets thread contexts after the process
breakpoint so hardware breakpoints cannot be set.
Go to the executable's entry point and set it then.
 'ba e1 kernel32!CreateThread'
0:000> k
RetAddr           Call Site
00000000`774974a8 ntdll!DbgBreakPoint
00000000`77458068 ntdll!LdrpDoDebuggerBreak+0x35
00000000`7748e974 ntdll!LdrpInitializeProcess+0x167d
00000000`7742c4ee ntdll! ?? ::FNODOBFM::`string'+0x1d641
00000000`00000000 ntdll!LdrInitializeThunk+0xe

Specifically, this message occurs because the debugger knows that after the process breakpoint occurs, the current thread context will be discarded at the call to NtContinue. As a result, hardware breakpoints (which rely on the debug register state) will be wiped out when NtContinue restores the expected initial context of the new thread.

If one is clever, it’s possible to apply the necessary modifications to the appropriate debug register values in the context record image that is given as an argument to LdrInitializeThunk, which will thus be realized when NTDLL initialization code for the thread runs.

The fact that every user mode thread actually begins life at ntdll!LdrInitializeThunk also explains why, exactly, you can’t create a new thread in the current process from within DllMain and attempt to synchronize with it; by virtue of the fact that the current thread is executing DllMain, it must have the enigmatic loader lock acquired. Because the new thread begins execution at LdrInitializeThunk (even before the start routine you supply is called), for the purpose of making DLL_THREAD_ATTACH callouts, it too will become almost immediately blocked on the loader lock. This results in a classic deadlock if the thread already in DllMain tries to wait for the new thread.

Parting shots:

  • Win9x is, of course, completely dissimilar as far as DllMain goes. None of this information applies there.
  • The fact that lpReserved is a PCONTEXT is only very loosely documented by a couple of ancient DDK samples that used the PCONTEXT argument type, and some more recent SDK samples that name the “lpReserved” parameter “lpvContext”. As far as I know, it’s been around on all versions of NT (including Vista), but like other pseudo-documented things, it isn’t necessarily guaranteed to remain this way forever.
  • Oh, and in case you’re wondering why I used advapi32 instead of kernel32 in this example, it’s because due to a rather interesting quirk in ntdll on recent versions of Windows, kernel32 is always dynamic-loaded for every Win32 process (regardless of whether or not the main process image is static linked to kernel32). To make things even more interesting, kernel32 is dynamic loaded before static-linked DLLs are loaded. As a result, I decided it would be best to steer clear of it for the purposes of making this posting simple; be suitably warned, then, about trying this out on kernel32 at home.