Archive for the ‘Windows’ Category

SDbgExt extensions – part 2.

Monday, July 10th, 2006

Last post, I discussed some of the major extensions available in SDbgExt.  This post is a continuation that explains the remaining major / interesting extensions available in the extension library.

 Another set of extensions that may prove useful if you do frequent C++ development are the STL datastructures display functions.  These allow you to interpret many of the complicated STL container datastructures and format them in more understandable forms from the debugger.  At present, the extension DLL only understands the Visual C++ 7.0 / 7.1 STL structure layout (no support for VC6 or VC8 as of yet, sorry).  This complements the built-in WinDbg support for STL, which does not cover all of VC7 (or at least did not the last time I checked).  The SDbgExt STL type display functions do not rely on type information contained in symbols, so you can use them to help debug third party programs that you do not have symbols for.  However, some extensions will require you to have a bit of knowledge about the structures contained in an STL container (usually the size of a container item).  These extensions will work on local or remote 32-bit targets.

The main functions that you might find useful in this grouping are these:

!stlmap allows you to traverse an std::map (binary tree) and optionally dump some bytes from each of the key and value types.  Due to the layout of the std::map type, you will need to tell SDbgExt about the size of the key and value types.

!stlset allows you to traverse an std::set (binary tree) optionally dump some bytes from each of value types.  It is very similar to the !stlmap extension, except that it works on std::set structures and thus only needs information about a value type and not an additional key type.

!stllist and !stlvector allow you to display the contents of an std::list or std::vector, respectively.  Optionally, you may provide the size of an element to dump some bytes from each element to the debugger.

!stlstring and !stlwstring allow you to display std::string and std::wstring structures.  If you are displaying very long strings (>16K characters), then you will need to provide a second argument specifying the maximum number of characters to display.  This limit is always capped at 64K characters.

Most of the STL datastructures traversal functions have only minimal protection against damaged or corrupted datastructures.  If you attempt to use them on a datastructure that is broken (perhaps it has a link that references itself, causing an infinite loop), then you will need to break out of the extension with Ctrl-C or Ctrl-Break depending on which debugger you use.  To ensure that your system remains responsive enough to have the option of breaking out, SDbgExt will lower its priority to at least normal (WinDbg runs at high priority by default) temporarily, for the duration of the call to the extension (the original priority is restored before the extension returns).

The last major category of functions supported by SDbgExt are those that are related to Windows UI debugging.  These extensions generally always require a live 32-bit target on the local computer in order to function (no remote debugging).  They work by directly querying internal window structures or by calling user32/gdi32 APIs about a particular UI object.  These can be used as a handy replacement for Spy++, which has a nasty tendancy to break badly (and break every GUI application on the same desktop with it) when it encounters a GUI program that is frozen in the debugger.

The !hwnd extension is the primary UI debugging extension supported by SDbgExt.  It will dump many of the interesting attributes about a window given its handle (HWND) to the debugger console.  Additionally, it can be used to enumerate the active windows on the current thread.  This extension is particular useful for programs that store things like class object pointers at GWLP_USERDATA or DWLP_USERDATA, which are normally hard to get at from WinDbg.

The !getprop extension can be used to enumerate window properties associated with a particular window, or to query a single specific window property associated with a particular window.  These are useful if a program stores information (like a class object pointer) in a window property and you need to get at it from the debugger, which is something that you cannot easily do from WinDbg normally.

The !msg extension will display information about a MSG structure given a pointer to it.  It has a limited built in set of message names for some of the most common standard window messages.  It will also display information about the window with which the window message is associated with (if any).

Finally, there are a couple of misc. functions that SDbgExt supports which don’t fit cleanly into any specific category.  Many of these are niche extension that are only useful for very specific scenarios.

The !switchdesk extension will switch the active desktop for the computer hosting the debugger.  This can be useful if you are debugging a full screen DirectX program using a GUI debugger on an alternate desktop and a console debugger in full screen mode connected to the GUI debugger using remote debugging on the same desktop as the program being debugged.

The !lmx extension will allow you to view the loaded module list in the various forms that the loader maintains it (in-load-order, in-memory-order, in-initialization-order).  When used on kernel targets, there is only one loaded module list, so the list identification parameter is unused in that usage case.

!findwfptr (courtsey of skape) allows you to scan the address space of the debuggee for function pointers that are located in writable memory.  It willl work on all target types.  For live targets, it can also optionally place breakpoints at all of the functions pointed to by pointers residing in writable memory.  This extension is useful if you are auditing a program for potential security risks; many of the static addresses (i.e. global variables) in standard system DLLs that contain function pointers have been changed to encode the pointer values based on a random cookie to prevent exploits from overwriting them to gain control over the program.  This extension can help you identify which function pointers might be used by an attacker to compromise your program when used in conjunction with certain types of security flaws.

!cmpmem allows you to compare ranges of memory over time, with the ability to exclude modified ranges.  This is particularly useful if you want to watch a data structure (or even the global variables of an entire module) and quickly determine what changes when a particular operation happens in the context of the debuggee.  For example, if you are trying to reverse engineer a function and want to understand some of the side effects it has on data structures or global variables, this function can help quickly identify modified areas without requiring you to analyze the entire function in a disassembler.

 That’s all for this series.  There are a couple of extensions that I didn’t mention, but they are either very obvious or not generally useful enough to be worth mentioning.  The online help (!help) provides basic syntax and a very brief description about all extensions supported by SDbgExt, so you can find parameter information about all of the extension I mentioned there.

Using SDbgExt to aid your debugging and reverse engineering efforts (part 1).

Friday, July 7th, 2006

One of the programs up on my homepage (does anyone call their website a “homepage” anymore?) is SDbgExt, a WinDbg-compatible extension DLL.  It’s a little collection of various debugging tools that I have put together over time that you might find useful.  There is some minimal documentation included, but not a whole lot to tell you where and when certain extensions are useful – hence, the topic of today’s blog post.

 This posting series assumes that you have already installed WinDbg, installed the Visual C++ 2005 Runtimes, and placed SDbgExt.dll into your Debugging tools for Windows\WinExt directory.  At present, SDbgExt can only be loaded into the 32-bit WinDbg package, although some extensions do support 64-bit targets (such as 64-bit kernel debugging targets).

To get started using SDbgExt, you’ll need to load it into your debugger.  For WinDbg, ntsd, cdb, and kd, you can use the “.load SDbgExt” command to do this.  If you then use the “!sdbgext.help” command, you should be presented with a list of the available extensions supported by SDbgExt.  Most of the extensions are targetted at live 32-bit processes on the local computer (such as the extensions dealing with displaying HWND information).  The documentation does not specify which targets are supported by each extension yet, so if you aren’t sure whether an extension is supported against your current debugging target, just try it; if not, it will harmlessly fail in an obvious manner.

Many of the SDbgExt extensions have very specific purposes, so I’ll try to several of them  address them individually and describe what they are best used for.  Additionally, most of the SDbgExt extensions can be broken down into several categories, such as symbol management, UI debugging, STL datastructures manipulation, kernel object support, and debuggee state manipulation.  I’ll be skipping some of the extensions that are either obvious or not generally useful, but I’ll try to cover most of the interesting ones.

To start off, I’ll talk about some of the debuggee state manipulation extensions.  These extensions allow you to control the state of a live 32-bit target process that is running on the same computer as the debugger (remote debugging and 64-bit targets are not supposed yet by these extensions).  The extensions that fall under this category include:

  • !valloc, !vallocrwx, !heapalloc, !heapfree: Allocate memory within the address space of the target.
  • !remotecall, !remotecall64: Call a function in the target, using the currently active thread (symbols are not required, unlike “.call”).
  • !loaddll, !unloaddll: Load or unload a .dll within the address space of the target, using the currently active thread.
  • !close: Close a kernel object handle in the targets handle space.
  • !killthread: Terminate a thread in the target process.
  • !adjpriv: Alter privileges of the target process or currently active thread.
  • !ret: Effect a virtual function return in the context of the currently active thread.

Many of these extensions are useful if you are doing some runtime patching of a target or are trying to see how the target will react to specific circumstances.  If you are trying to reverse engineer or modify the behavior of a target, in particular, you might find several of these extensions very useful.

The first group of extensions are used for managing memory allocation in the targets address space – either by directly allocating pages within the target or allocating heap memory.  The latter requires that you resume execution of the target as internally the remote heap manipulation functions use the remote function call support to call the heap manager in the target process.  Perhaps the most common use for this family of functions is if you need to allocate some space on the fly if you are adding some code to the debuggee on the fly (maybe you need to patch a function and add several instructions, in which case you could allocate memory with !vallocrwx, and then patch a jump instruction to refer to the newly allocated memory block).

The next two groups of functions are used for directly calling functions in the target, from the context of the currently active thread in the debuggee.  Be warned that this is an invasive operation and may cause undesirable side effects in the context of the debuggee.  The main advantage of the !remotecall family of extensions over the built-in .call command is that you do not need to have private symbols for the target, which makes it particularly useful if you are reverse engineering something or need to call a Win32 API function in the context of the debuggee from the debugger.  These functions can allow you to do complicated things that are difficult or impossible to do remotely (from the debugger), such as calling SetHandleInformation to mark a handle closable or nonclosable from WinDbg. You can use !loaddll and !unloaddll as shortcuts (as opposed to using !remotecall on kernel32!LoadLibraryA/kernel32!FreeLibrary manually) for DLL management in the target process space.

!close is mostly analagous to the .closehandle built-in command, and is used to close a handle in the context of a debuggee.

!killthread is useful if you need to instantly terminate a thread in the debuggee process, for whatever reason.  You can manually achieve something like this with a command like “r eip=kernel32!ExitThread;g” for Win32 targets, but this extension provides a more elegant means to killing debuggee threads (for instance, you might want to kill a thread that is crashed so that it doesn’t take down the rest of the process in the default SEH handler, for certain scenarios).

!adjpriv is useful if you are debugging problems related to the privileges that are enabled in a primary or impersonation token.  You can use the built-in !token extension to determine what privileges are currently present, enabled, or disabled in a token, and the !adjpriv extension to manipulate these privileges from the debugger itself.  This can also be used to work around buggy programs that don’t properly enable privileges before they try to do certain privileged operations (such as things written for Win9x).

!ret is primarily useful in conditional breakpoints if you want to return from the middle of a function at a breakpoint location based on a particular condition.  It alters the context of the currently active thread (modifying the stack pointer, instruction pointer, and optionally return address registers) according to its arguments.

The next group of extensions that I’d like to describe are the symbol management extensions.  These are extremely useful if you are reverse engineering a program and want to synchronize your work between a disassembler (such as IDA) and the debugger.  The two extensions that fall into this category are !loadsym and !unloadsym.

These two extensions allow you to either create or remove custom virtual symbols in the target.  A virtual symbol allows you to name an address (although it does not allow you to convey type information, unfortunately).  This can be extremely useful if you are debugging a third-party program that has no symbols, and you want to name certain addresses to make them easier to recognize.

Both extensions can operate on two different types of symbol files: a custom format that is specific to SDbgExt and allows you to specify all possible attributes that are supported by virtual symbols (primarily the size of the symbol, name, and its offset from a base module), and a standard linker .map file.  The latter is generally the most useful of the two formats, as there are many things that can write symbol information to a .map file which you can then load into SDbgExt and access through WinDbg.  For instance, IDA allows you to dump all names in a database (disassembly project) to a .map file, which you could then load using SDbgExt and have names in WinDbg that match the work you have done in IDA.  These commands can also be useful if the only symbols you have for a particular binary are the linker map files (which has happened to me once or twice, on rare occasions).

Both extensions require a 32-bit target, although the target may be a remote target and can be either a user mode or kernel mode target.  For kernel targets, the symbol loading support will apply to modules in the kernel mode loaded module list primarily.  Virtual symbols are automatically unloaded whenever you reload symbols (such as with the “.reload” command), so you may find yourself needing to re-apply the custom symbols more than once in a session.  Additionally, due to a bug / limitation in how DbgHelp and DbgEng manage virtual symbols, the process of creating virtual symbols unfortunately gets exponentially slower relative to how many virtual symbols are currently in existance.  As a result, creating more than a couple thousand virtual symbols may take a while.

The last group of extensions that I am going to cover in the first installment of this series is the kernel object support extension group.  These extensions are intended to complement the built-in support (such as !handle or !token) for querying kernel objects by allowing access to things that are otherwise not easily queryable from the debugger.  Although the information available from the built-in debugger support is usually sufficient, in special cases you may need additional information (such as security descriptor information).  Most of these extensions require a live 32-bit target on the local computer to operate correctly.

The !objname extension takes an object handle relative to the handle space of the debuggee and returns the full name for it.  This is similar to the built-in !handle extension, except that it works on all kernel object types (unlike !handle, which does not work on some object types, such as file object handles).

!tokeninfo will allow you to inspect some additional information about an access token object, beyond that which the built-in !token extension makes available to you (either given a token handle, or by operating on the primary or impersonation token that is effective for the currently active thread).  The most useful pieces of information available from this extension are the TokenId (uniquely identifying a token object throughout the system) and the ModifiedId (which increments whenever a tokens attributes are altered in any way).

The !objsec extension is useful for displaying detailed information about the security descriptor of an object given an object handle.  In kernel mode, you can use the !sd extension based on the security descriptor pointer embedded in a kernel object header, but this extension allows you to perform a similar function from user mode.  It has built-in knowledge about the object specific access rights supported by all of the kernel object types (as of Windows Server 2003 SP1) and will automatically translate access right values in access masks to more human readable values.

If you are dealing with a raw access mask directly (perhaps passed to a security related function as a parameter), and you want to know what it means given a particular object type, then you can use the !amask extension to have SDbgExt interpret the access mask bits as they apply to a particular object type.  The !objtypes extension lists the object type names that are supported by !amask.  If you do not supply an object type argument to !amask, it will only interpret generic and standard access rights.

The !sidname extension can be used to convert a SID into an account name.  Unlike most extensions, this extension does not operate on the debuggee at all; instead, it simply takes a string SID as a single argument and attempts to resolve it against the security database of the computer hosting the debugger.  This is a shortcut for command line utilities (like PsGetSid) that could do the same for you, since many of the access token related functions will give you back a raw string SID and will not translate it into a more understandable account name.

The !threadinfo extension will display some basic information about a thread running in the debuggee.  It will only work on local targets on the same computer as the debugger.  This extension allows you to view a couple of rarely-used fields that aren’t easily viewable from the debugger in user mode, like processor affinity or thread and exit times.

 That’s all for this post.  The next post in this series will cover the remaining major extensions in SDbgExt.

Fun with Logitech MX900 Bluetooth receivers

Thursday, July 6th, 2006

For some time now, I have been partial to cordless mice; they’re much less of a hastle to use than “conventional” mice, especially if you use a laptop primarily.  Several months ago, I decided to upgrade from a Logitech MX700 cordless optical mouse to an MX900 Bluetooth optical mouse, so that with my new Bluetooth-enabled laptop, I would not need to bring the bulky charger/base station it to plug into my computer at work every day.

As it would happen, the MX900 base station has a Bluetooth receiver that you can use (in conjunction with the WIDCOMM – now Broadcom – Bluetooth stack) to connect to other Bluetooth devices out there.  At the time when I first got the mouse, I didn’t really see this as all that useful, as my laptop already had an integrated Bluetooth receiver that was supposed by the Microsoft Bluetooth stack included with Windows XP SP2.  Recently, however, I got a second Bluetooth enabled device – a new cell phone – and decided that I might as well see what I could do with getting one of my other computers at my apartment talking to it.

 Now, a bit of background about the MX900 base station.  It’s actually a pretty neat thing – during boot (and even in normal operating system use, if you don’t have the right software installed), the MX900 will act as if it were a standard HID USB mouse even though it is actually connected through Bluetooth – “HID emulation mode”, as I call it.  This is a cool feature because it allows you to use your standard USB mouse drivers with the MX900 without having to go install all of Logitech’s drivers and the like before the mouse will work.  Additionally, if your BIOS supports USB input devices (most modern ones do), you can use the MX900 there even though it functions over Bluetooth.

As a result of the handy HID emulation mode feature of the MX900, I can already use it as a mouse on my other, non-Bluetooth computers as if it were a plain USB mouse, with the operating system none the wiser.  Therein is the rub, however; in order for me to be able to connect the MX900 base station to non-keyboard/mouse devices, I need to be able to convince Windows that it is actually a full fludged Bluetooth receiver and not just a USB mouse.  Normally, Logitech’s SetPoint software installs a program that runs when you log in to Windows and magically turns the MX900 base station into Bluetooth HCI mode, that is, native Bluetooth receiver mode – assuming you had installed the WIDCOMM bluetooth stack, that is.

 So, I set out to install SetPoint on my test computer.  Unfortunately, this didn’t really work out as planned.  The computer I had available to work with was running Windows Server 2003 and it seems that the SetPoint installer for the version I needed wasn’t exactly well tested on Windows Server 2003.  The installer would tend to blow up with heap corruption right away, making it impossible to do anything.  I then tried running the installer under the Windows XP SP2 compatibility layer (right click the .exe, there is a compatibility option in the propsheet if you an administrator).  This got me a bit further, but the Logitech installer inevitibly crashed.

Looking around a bit, there was actually a more recent version of SetPoint available (Logitech supports 2.22 with the MX900, the latesting being 2.60 which is designed for Logitech’s Bluetooth keyboard and mouse suite).  I figured that it was worth a try to install 2.60 and see if that worked.  Sure enough, the installer actually didn’t crash this time, but unfortunately, it would not accept that I had a Bluetooth device that was compatible with it; I got stuck at a dialog that instructed me to connect my Logitech Bluetooth device and hit OK, or skip the installation of the Bluetooth support and install “just plain” SetPoint.  Well, that sucks – the whole point of this excercise was to get Bluetooth working on the test computer, not Logitech’s middleware.

Poking around in my temp directory, I noticed that while the installer was running, one of the temporary directories it created seemed to have a second installer for the WIDCOMM Bluetooth stack (WIDCOMM – now Broadcom - does not make their software directly available for download to end users, and instead requires them to get it bundled with hardware from an equipment manufacturer).  A-ha – maybe there was light at the end of the tunnel, after all.  While the Logitech installer was waiting for me to hit Next in one of the wizard steps, I manually launched the WIDCOMM installer from the temp directory that the Logitech installer had created.  The installer actually worked fine, except that it too complained that it could not detect an active Bluetooth device (fortunately, though, it allowed me the option of continuing the install anyway).

After the WIDCOMM installer finished, I canceled out of the Logitech install and went to see if I could convince the WIDCOMM stack that I really did have a Bluetooth device.  After not getting anywhere on my own, I turned to Google, where I found a number of people complaining about the same problem (about not being able to turn their MX900 receivers to native HCI mode), but no quick solution for Windows.  I did, however, find something for Linux – a program called “hid2hci” that knew how to turn an MX900 Bluetooth receiver to a HCI mode.  Fortunately, source code was included, so it was easy enough to see what it was doing.  Unfortunately, I don’t really have a whole lot of experience with USB, on Windows or other platforms, and what I needed to do was port hid2hci to Windows.

The linux program is fairly simple.  Even with my limited knowledge of USB, what it was doing appeared to be straightforward enough.  The program sends three vendor-specific HID output reports (a HID report is the basic way to either report information from a device to the computer or change a setting on the device for HID devices) to the MX900 receiver.  After receiving the special three HID reports, the MX900 changes its PnP ID and appears to the operating system as a different piece of hardware, an HCI Bluetooth receiver.

 So, I got started working on a Windows version of hid2hci.  The first step was to crack open the Windows DDK documentation (you can download the DDK with the free KMDF 1.1 ISO distribution) and start looking around for ways to talk to USB devices.  It turns out that there is already a rather full featured API to do this, from both user mode and kernel mode.  Because all I really needed to do here was to send three custom commands to the MX900, I picked the user mode HID API to start with.

The user mode HID APIs live in hid.dll and come in two flavors: HID Parser routines (prefixed HidP_), and HID Device/Driver routines (prefixed HidD_).  The former provide a high level interface for formatting, preparing, and parsing the actual USB HID reports, while the latter deal with actually sending and receiving USB HID reports.  The API is a bit cumbersome, but it turns out to not be too hard to use.  The basic idea is:

  1. Open a handle to the HID device that you want to talk to with CreateFile.
  2. Call HidD_GetPreparsedData to load the preparsed data for the collection.  This basically retrieves all sorts of information about the HID device ahead of time in one blob that is later used by the HID Parser routines to ensure that you are matching the report formats used by the device.
  3. Call HidD_GetAttributes and HidD_GetCaps to make sure that the device is the one you meant to connect to and supports the HID usages that you are going to use.  Here, I wanted to verify that the vendor specific usage page 0xFF00, usage 0x0001 is present (as this is where I wanted to send the magic 3 reports to turn the receiver to HCI mode).
  4. Build the HID report.  I originally attempted to do this using the high level HID Parser APIs, but I couldn’t get it to work right – the HID Parser APIs kept complaining that the usage I requested didn’t exist.  I assume that this is because Logitech never bothered to completely describe the format of the HID reports for this vendor specific usage, resulting in the high level parser becoming unhappy if you asked it to format a report for that usage.  As a result, I just built the report manually by just writing the raw data into the report buffer and prepending the HID report ID (0x10) to the report data.
  5. Send the completed report to the device.  There are two ways to do this – WriteFile and HidD_SetOutputReport.  I attempted to use WriteFile first, but it always failed with ERROR_ACCESS_DENIED.  Not being an expert on HID devices, I tried the other documented routine (HidD_SetOutputReport) send the report, which worked fine.  HidD_SetOutputReport internally just sends a special IOCTL to the driver for the device you open, so the code paths are in fact different.

Steps 4 and 5 will basically need to be repeated for each of the three HID reports that we need to send to the Bluetooth receiver.

 There are a couple of other things that you need to do in order to get this to work that I glossed over.  In particular, you need to actually find the device that you want to open with CreateFile.  The best way to do this is by using the SetupDi family of APIs to enumerate all HID devices.  We can then verify that each device has the expected vendor ID, product ID, and HID usages before we try to send it the magical commands to convert the device to native HCI mode.

After putting all of these steps together, I had something that appeared to do what the Linux hid2hci program did.  Sure enough, when I ran my prototype hid2hci port on my test box, a new device appeared in Device Manager, which was detected by the WIDCOMM Bluetooth stack as a Cambridge Silicon Radio Bluetooth Receiver.  Success!

The device itself stays in native HCI mode until it is reset (i.e. rebooting the computer or unplugging the receiver itself), so the HCI conversion program needs to either periodically scan for devices to switch to HCI mode, or register for a device change notification in order to enable full Bluetooth functionality if you reboot or disconnect the Bluetooth receiver.

The source code for my Logitech HID-to-HCI convertor program is available for download if you are interested in it.  You will need Windows DDK installed in order to build it.  Alternatively, you can download the binary if you just want the program and don’t want to install the development environment to build it.  It takes two command line arguments: the hexadecimal vendor ID and hexadecimal product ID of the device that it should switch from HID emulation mode to native HCI mode.  You can find these under Device Manager if you are using Windows XP SP2 or Windows Server 2003 SP1 by going to your device’s property sheet and going to the details tab, then selecting the Hardware Ids listbox item.  The device you want is probably going to be named “USB Composite Device”.  If you are using an MX900, then you can use 046d for the vendor ID and c705 for the product ID.  There is no harm in running the program repeatedly after it has already switched your device(s) to HCI mode.

VMware Server and RDP don’t always play nicely together.

Wednesday, July 5th, 2006

Steve already stole my thunder (well, if that makes sense, since it was my paper anyway) by posting my analysis of this earlier, but I figure that it is also worth discussing here.

 Recently, I finally* got a got a new development box at work – multiproc, x64 capable (with the ability to run 64-bit VMs too!), lots of RAM, generally everything you would want in a really nice development box.  Needless to say, I was rather excited to see what I could do with it.  The first thing I had in mind was setting up a dedicated set of VMs to run my test network on and host various dedicated services such as our symbol server here at the office.

 (*: There is a long, sad story behind this.  For a long time, I’ve been having a VM running on an ancient ~233MHz box that nobody else at the office wanted (for obvious reasons!).  I had been trying to get a replacement box that didn’t suck so much to put this VM (and others) on to run full time, but just about every thing that could possibly go wrong with requesting a purchase from work did go wrong, resulting in it being delayed in the order of over half a year…).

 The box came with Windows XP Professional x64 Edition installed, so I figured that I might as well use the install instead of blowing it away and putting Windows Server 2003 on for now.  As it turned out, this came around to bite me later.  After installing all of the usual things (service packs, hotfixes, and soforth), I went to grab the latest VMware Server installer so that I could put the box to work running my library of VMs.  Everything seemed to be going okay at the start, until I began to do things that were a bit outside the box, so to speak.  Here, I wanted to have my XP x64 box route through a VM running on the same computer.  Why on earth would I possibly want to do that, you ask?  Well, I have an internal VPN-based network that overlays the office network here at work and connects all of the VMs I have running on various boxes at the office.  I wanted to be able to interconnect all of those VMs with various services (in particular, lots and lots of storage space) running on the beefy x64 box over this trusted VPN network instead of the public office network (which I have for testing purposes designated the untrusted Internet network).  If I have the x64 box routing through something that is connected to the entire overlay network, then I don’t need to worry about creating connections to every single other VM in existance to grant access to those resources.  (At this point, our x64 support is still in beta, and XP doesn’t have a whole lot of useful support for dedicated VPN links.)

 Anyways, things start to get weird when I finally get this setup going.  The first problem I run into is that sometimes on boot, all of the VMs that I had configured to autostart would appear to hang on startup – I would have to go to Task Manager and kill the vmware-vmx.exe processes, then restart the vmserverdWin32 service before I could get them to come up properly.  After a bit of poking around, I noticed a suspicious Application Eventlog entry that seemed to correlate with when this problem happened on a boot:

Event Type: Information
Event Source: VMware Registration Service
Event Category: None
Event ID: 1000
Date:  6/13/2006
Time:  2:10:06 PM
User:  N/A
Computer: MOGHEDIEN
Description:
vmserverdWin32 took too much time to initialize.

 Hmm… that doesn’t look good.  Well, digging a bit deeper, it turns out that VMware Server has several different service components, and apparently there are dependencies between them.  However, the VMware Server developers neglected to properly assign dependencies between all of the services; instead, they appear to have just let the services start in whatever order and have a timeout window in which the services are supposed to establish communication with eachother.  Unfortunately, this tends to randomly break on some configurations (like mine, apparently).

 Fortunately, the fix for this problem turned out to be fairly easy.  Using sc.exe, the command line service configuration app (which used to ship with the SDK, but now ships with Windows XP and later – a handy tool to remember), I added an SCM dependency between the main VMware launcher service (“VMServerdwin32”) and the VMware authorization service (“VMAuthdService”): 

C:\Documents and Settings\Administrator>sc config vmserverdWin32 depend= RPCSS/VMAuthdService
[SC] ChangeServiceConfig SUCCESS
After fixing the service dependencies, everything seemed to be okay, but of course, that wasn’t really the case…

 When I went home later that day, I decided to VPN into the office and RDP into my new development box in order to change some hardware settings on one of my VMs.  In this particular case, some of the VPN traffic from my apartment to the development box on the office happened to pass through that router VM which I had running on the development box.  Whenever I tried to RDP into the development box, it would just freeze whenever I tried to enter my credentials; the RDP connection would hang after I entered valid logon credentials at the winlogon prompt until TCP gave up and broke off the connection.  This happened every single time I tried to RDP into my new box, but the office connection was fine (I could still connect to other things at the office while this was happening).  Definitely not cool.  So, I opened a session on our development server at the office and decided to try an experiment – ping my new dev box from it while I try to RDP in.  The initial results of this experiment were not at all what I expected; my dev box responded to pings the whole time while it was apparently unreachable over RDP while the TCP connection was timing out.  The next time I tried RDPing in, I ran a ping from my home box to my dev box, and the pings were dropped while I was trying to make the RDP session connection to the console session after providing valid logon credentials, and yet the box still responded to pings from a different box at the office.

After poking around a bit more, I determined that every single VM on my brand new dev box would just freeze and stop responding whenever I tried to RDP into my dev box from home (but not from the office).  To make matters even more strange, I could connect to a different box at the office, and bounce RDP through that box to my new dev server and things would work fine.  Well, that sucks – what’s going on here?  A partial explanation stems from how exactly I had setup the routing on my new dev box; the default gateway was set to my router VM (running on that box) using one of the VMnet virtual NICs, but I had left the physical NIC on the box still bound to TCP (without a default gateway set however).  So, for traffic destined to the office subnet, there is no need for packets to traverse the router VM – but for traffic from the VPN connection to my home, packets are routed through the router VM.

 Given this information, it seemed that I had at least found why the problem was happening, on some level – whenever I tried to RDP into my new dev box over the VPN, all of the VMs on my new dev box would freeze.  Because traffic through the VPN to my new dev box is routed through a VM on the new dev box, the RDP connection stalls and times out (because the router VM has hung).

 At this point, I had to turn to a debugger to understand what was going on.  Popping the vmware-vmx.exe process corresponding to the router VM open in the debugger and comparing call stacks between when it was running normally and when it was frozen while I was trying to RDP in pointed to the thread that emulated the virtual CPU becoming blocked on an internal NtUser call to win32k.sys.  At this point, I couldn’t really do a whole lot more without resorting to a kernel debugger, making that my next step.

 With the help of kd, I was able to track down the problem a bit further; the vmware CPU simulator thread was blocking on acquiring the global win32k NtUser lock that almost all NtUser calls acquire at the start of their implementation.  With the `!locks’ command, I was able to track down the owner of the lock – which happened to be (surprise!) a Terminal Server thread in CSRSS for the console session.  This thread was waiting on a kernel event, which turns out to be signalled when the RDP TCP transport driver receives data from the network.  So, we have a classical deadlock situation; the router VM is blocking on win32k’s internal NtUser global lock, and there is a CSRSS thread that is holding the win32k internal NtUser global lock while waiting on network I/O (from the RDP client).  Because the RDP client (me at home connecting through the VPN) needs to route traffic through the router VM to reach the RDP TCP transport on my new dev box, everything appears to freeze until the TCP connection times out.

 Unfortunately, there isn’t really a very good solution to this problem.  Installing Windows Server 2003 would have helped, in my case, because then VMware Server and its services would be running on session 0, and RDP connections would be diverted to new Terminal Server sessions (with their own per-session-instanced win32k NtUser locks), thus avoiding the deadlock (unless you happened to connect to Terminal Server using the `/console’ option).

 So there you have it – why VMware Server and RDP can make a bad mix sometimes.  This is a real shame, too, because RDPing into a box and running the VMware Server console client “locally” is sooo superior to running the VMware Server console client over the network (updates *much* faster, even over a LAN).

 If you’re interested, I did a writeup of most of the technical details of the actual debugging (with WinDbg and kd) of this problem that you can look at here – you are encouraged to do so if you want to see some of the steps I took in the debugger to further analyze the problem.

 In the future, I’ll try not to gloss over some of the debugger steps so much in blog posts; for this time, I had already written the writeup before hand, and didn’t want to just reformat the whole thing for an entire blog post.

 Whew, that was a long second post – hopefully, future ones won’t be quite so long-winded (if you consider that a bad thing).  Hopefully, future posts won’t be written at 1am just before I go to sleep, too…