Using SDbgExt to aid your debugging and reverse engineering efforts (part 1).

July 7th, 2006

One of the programs up on my homepage (does anyone call their website a “homepage” anymore?) is SDbgExt, a WinDbg-compatible extension DLL.  It’s a little collection of various debugging tools that I have put together over time that you might find useful.  There is some minimal documentation included, but not a whole lot to tell you where and when certain extensions are useful – hence, the topic of today’s blog post.

 This posting series assumes that you have already installed WinDbg, installed the Visual C++ 2005 Runtimes, and placed SDbgExt.dll into your Debugging tools for Windows\WinExt directory.  At present, SDbgExt can only be loaded into the 32-bit WinDbg package, although some extensions do support 64-bit targets (such as 64-bit kernel debugging targets).

To get started using SDbgExt, you’ll need to load it into your debugger.  For WinDbg, ntsd, cdb, and kd, you can use the “.load SDbgExt” command to do this.  If you then use the “!sdbgext.help” command, you should be presented with a list of the available extensions supported by SDbgExt.  Most of the extensions are targetted at live 32-bit processes on the local computer (such as the extensions dealing with displaying HWND information).  The documentation does not specify which targets are supported by each extension yet, so if you aren’t sure whether an extension is supported against your current debugging target, just try it; if not, it will harmlessly fail in an obvious manner.

Many of the SDbgExt extensions have very specific purposes, so I’ll try to several of them  address them individually and describe what they are best used for.  Additionally, most of the SDbgExt extensions can be broken down into several categories, such as symbol management, UI debugging, STL datastructures manipulation, kernel object support, and debuggee state manipulation.  I’ll be skipping some of the extensions that are either obvious or not generally useful, but I’ll try to cover most of the interesting ones.

To start off, I’ll talk about some of the debuggee state manipulation extensions.  These extensions allow you to control the state of a live 32-bit target process that is running on the same computer as the debugger (remote debugging and 64-bit targets are not supposed yet by these extensions).  The extensions that fall under this category include:

  • !valloc, !vallocrwx, !heapalloc, !heapfree: Allocate memory within the address space of the target.
  • !remotecall, !remotecall64: Call a function in the target, using the currently active thread (symbols are not required, unlike “.call”).
  • !loaddll, !unloaddll: Load or unload a .dll within the address space of the target, using the currently active thread.
  • !close: Close a kernel object handle in the targets handle space.
  • !killthread: Terminate a thread in the target process.
  • !adjpriv: Alter privileges of the target process or currently active thread.
  • !ret: Effect a virtual function return in the context of the currently active thread.

Many of these extensions are useful if you are doing some runtime patching of a target or are trying to see how the target will react to specific circumstances.  If you are trying to reverse engineer or modify the behavior of a target, in particular, you might find several of these extensions very useful.

The first group of extensions are used for managing memory allocation in the targets address space – either by directly allocating pages within the target or allocating heap memory.  The latter requires that you resume execution of the target as internally the remote heap manipulation functions use the remote function call support to call the heap manager in the target process.  Perhaps the most common use for this family of functions is if you need to allocate some space on the fly if you are adding some code to the debuggee on the fly (maybe you need to patch a function and add several instructions, in which case you could allocate memory with !vallocrwx, and then patch a jump instruction to refer to the newly allocated memory block).

The next two groups of functions are used for directly calling functions in the target, from the context of the currently active thread in the debuggee.  Be warned that this is an invasive operation and may cause undesirable side effects in the context of the debuggee.  The main advantage of the !remotecall family of extensions over the built-in .call command is that you do not need to have private symbols for the target, which makes it particularly useful if you are reverse engineering something or need to call a Win32 API function in the context of the debuggee from the debugger.  These functions can allow you to do complicated things that are difficult or impossible to do remotely (from the debugger), such as calling SetHandleInformation to mark a handle closable or nonclosable from WinDbg. You can use !loaddll and !unloaddll as shortcuts (as opposed to using !remotecall on kernel32!LoadLibraryA/kernel32!FreeLibrary manually) for DLL management in the target process space.

!close is mostly analagous to the .closehandle built-in command, and is used to close a handle in the context of a debuggee.

!killthread is useful if you need to instantly terminate a thread in the debuggee process, for whatever reason.  You can manually achieve something like this with a command like “r eip=kernel32!ExitThread;g” for Win32 targets, but this extension provides a more elegant means to killing debuggee threads (for instance, you might want to kill a thread that is crashed so that it doesn’t take down the rest of the process in the default SEH handler, for certain scenarios).

!adjpriv is useful if you are debugging problems related to the privileges that are enabled in a primary or impersonation token.  You can use the built-in !token extension to determine what privileges are currently present, enabled, or disabled in a token, and the !adjpriv extension to manipulate these privileges from the debugger itself.  This can also be used to work around buggy programs that don’t properly enable privileges before they try to do certain privileged operations (such as things written for Win9x).

!ret is primarily useful in conditional breakpoints if you want to return from the middle of a function at a breakpoint location based on a particular condition.  It alters the context of the currently active thread (modifying the stack pointer, instruction pointer, and optionally return address registers) according to its arguments.

The next group of extensions that I’d like to describe are the symbol management extensions.  These are extremely useful if you are reverse engineering a program and want to synchronize your work between a disassembler (such as IDA) and the debugger.  The two extensions that fall into this category are !loadsym and !unloadsym.

These two extensions allow you to either create or remove custom virtual symbols in the target.  A virtual symbol allows you to name an address (although it does not allow you to convey type information, unfortunately).  This can be extremely useful if you are debugging a third-party program that has no symbols, and you want to name certain addresses to make them easier to recognize.

Both extensions can operate on two different types of symbol files: a custom format that is specific to SDbgExt and allows you to specify all possible attributes that are supported by virtual symbols (primarily the size of the symbol, name, and its offset from a base module), and a standard linker .map file.  The latter is generally the most useful of the two formats, as there are many things that can write symbol information to a .map file which you can then load into SDbgExt and access through WinDbg.  For instance, IDA allows you to dump all names in a database (disassembly project) to a .map file, which you could then load using SDbgExt and have names in WinDbg that match the work you have done in IDA.  These commands can also be useful if the only symbols you have for a particular binary are the linker map files (which has happened to me once or twice, on rare occasions).

Both extensions require a 32-bit target, although the target may be a remote target and can be either a user mode or kernel mode target.  For kernel targets, the symbol loading support will apply to modules in the kernel mode loaded module list primarily.  Virtual symbols are automatically unloaded whenever you reload symbols (such as with the “.reload” command), so you may find yourself needing to re-apply the custom symbols more than once in a session.  Additionally, due to a bug / limitation in how DbgHelp and DbgEng manage virtual symbols, the process of creating virtual symbols unfortunately gets exponentially slower relative to how many virtual symbols are currently in existance.  As a result, creating more than a couple thousand virtual symbols may take a while.

The last group of extensions that I am going to cover in the first installment of this series is the kernel object support extension group.  These extensions are intended to complement the built-in support (such as !handle or !token) for querying kernel objects by allowing access to things that are otherwise not easily queryable from the debugger.  Although the information available from the built-in debugger support is usually sufficient, in special cases you may need additional information (such as security descriptor information).  Most of these extensions require a live 32-bit target on the local computer to operate correctly.

The !objname extension takes an object handle relative to the handle space of the debuggee and returns the full name for it.  This is similar to the built-in !handle extension, except that it works on all kernel object types (unlike !handle, which does not work on some object types, such as file object handles).

!tokeninfo will allow you to inspect some additional information about an access token object, beyond that which the built-in !token extension makes available to you (either given a token handle, or by operating on the primary or impersonation token that is effective for the currently active thread).  The most useful pieces of information available from this extension are the TokenId (uniquely identifying a token object throughout the system) and the ModifiedId (which increments whenever a tokens attributes are altered in any way).

The !objsec extension is useful for displaying detailed information about the security descriptor of an object given an object handle.  In kernel mode, you can use the !sd extension based on the security descriptor pointer embedded in a kernel object header, but this extension allows you to perform a similar function from user mode.  It has built-in knowledge about the object specific access rights supported by all of the kernel object types (as of Windows Server 2003 SP1) and will automatically translate access right values in access masks to more human readable values.

If you are dealing with a raw access mask directly (perhaps passed to a security related function as a parameter), and you want to know what it means given a particular object type, then you can use the !amask extension to have SDbgExt interpret the access mask bits as they apply to a particular object type.  The !objtypes extension lists the object type names that are supported by !amask.  If you do not supply an object type argument to !amask, it will only interpret generic and standard access rights.

The !sidname extension can be used to convert a SID into an account name.  Unlike most extensions, this extension does not operate on the debuggee at all; instead, it simply takes a string SID as a single argument and attempts to resolve it against the security database of the computer hosting the debugger.  This is a shortcut for command line utilities (like PsGetSid) that could do the same for you, since many of the access token related functions will give you back a raw string SID and will not translate it into a more understandable account name.

The !threadinfo extension will display some basic information about a thread running in the debuggee.  It will only work on local targets on the same computer as the debugger.  This extension allows you to view a couple of rarely-used fields that aren’t easily viewable from the debugger in user mode, like processor affinity or thread and exit times.

 That’s all for this post.  The next post in this series will cover the remaining major extensions in SDbgExt.

I made MVP this year

July 6th, 2006

It’s official – I’m a Microsoft SDK MVP!

This is a pretty cool experience, and I’m really looking forward to expanding my knowledge with the resources that being an MVP makes available to you.

Besides the blog, you can often find me on the Microsoft public newsgroups (e.g. microsoft.public.development.device.drivers) and the OSR mailinglists (e.g. NTDEV), where I’m a frequent poster.

Fun with Logitech MX900 Bluetooth receivers

July 6th, 2006

For some time now, I have been partial to cordless mice; they’re much less of a hastle to use than “conventional” mice, especially if you use a laptop primarily.  Several months ago, I decided to upgrade from a Logitech MX700 cordless optical mouse to an MX900 Bluetooth optical mouse, so that with my new Bluetooth-enabled laptop, I would not need to bring the bulky charger/base station it to plug into my computer at work every day.

As it would happen, the MX900 base station has a Bluetooth receiver that you can use (in conjunction with the WIDCOMM – now Broadcom – Bluetooth stack) to connect to other Bluetooth devices out there.  At the time when I first got the mouse, I didn’t really see this as all that useful, as my laptop already had an integrated Bluetooth receiver that was supposed by the Microsoft Bluetooth stack included with Windows XP SP2.  Recently, however, I got a second Bluetooth enabled device – a new cell phone – and decided that I might as well see what I could do with getting one of my other computers at my apartment talking to it.

 Now, a bit of background about the MX900 base station.  It’s actually a pretty neat thing – during boot (and even in normal operating system use, if you don’t have the right software installed), the MX900 will act as if it were a standard HID USB mouse even though it is actually connected through Bluetooth – “HID emulation mode”, as I call it.  This is a cool feature because it allows you to use your standard USB mouse drivers with the MX900 without having to go install all of Logitech’s drivers and the like before the mouse will work.  Additionally, if your BIOS supports USB input devices (most modern ones do), you can use the MX900 there even though it functions over Bluetooth.

As a result of the handy HID emulation mode feature of the MX900, I can already use it as a mouse on my other, non-Bluetooth computers as if it were a plain USB mouse, with the operating system none the wiser.  Therein is the rub, however; in order for me to be able to connect the MX900 base station to non-keyboard/mouse devices, I need to be able to convince Windows that it is actually a full fludged Bluetooth receiver and not just a USB mouse.  Normally, Logitech’s SetPoint software installs a program that runs when you log in to Windows and magically turns the MX900 base station into Bluetooth HCI mode, that is, native Bluetooth receiver mode – assuming you had installed the WIDCOMM bluetooth stack, that is.

 So, I set out to install SetPoint on my test computer.  Unfortunately, this didn’t really work out as planned.  The computer I had available to work with was running Windows Server 2003 and it seems that the SetPoint installer for the version I needed wasn’t exactly well tested on Windows Server 2003.  The installer would tend to blow up with heap corruption right away, making it impossible to do anything.  I then tried running the installer under the Windows XP SP2 compatibility layer (right click the .exe, there is a compatibility option in the propsheet if you an administrator).  This got me a bit further, but the Logitech installer inevitibly crashed.

Looking around a bit, there was actually a more recent version of SetPoint available (Logitech supports 2.22 with the MX900, the latesting being 2.60 which is designed for Logitech’s Bluetooth keyboard and mouse suite).  I figured that it was worth a try to install 2.60 and see if that worked.  Sure enough, the installer actually didn’t crash this time, but unfortunately, it would not accept that I had a Bluetooth device that was compatible with it; I got stuck at a dialog that instructed me to connect my Logitech Bluetooth device and hit OK, or skip the installation of the Bluetooth support and install “just plain” SetPoint.  Well, that sucks – the whole point of this excercise was to get Bluetooth working on the test computer, not Logitech’s middleware.

Poking around in my temp directory, I noticed that while the installer was running, one of the temporary directories it created seemed to have a second installer for the WIDCOMM Bluetooth stack (WIDCOMM – now Broadcom - does not make their software directly available for download to end users, and instead requires them to get it bundled with hardware from an equipment manufacturer).  A-ha – maybe there was light at the end of the tunnel, after all.  While the Logitech installer was waiting for me to hit Next in one of the wizard steps, I manually launched the WIDCOMM installer from the temp directory that the Logitech installer had created.  The installer actually worked fine, except that it too complained that it could not detect an active Bluetooth device (fortunately, though, it allowed me the option of continuing the install anyway).

After the WIDCOMM installer finished, I canceled out of the Logitech install and went to see if I could convince the WIDCOMM stack that I really did have a Bluetooth device.  After not getting anywhere on my own, I turned to Google, where I found a number of people complaining about the same problem (about not being able to turn their MX900 receivers to native HCI mode), but no quick solution for Windows.  I did, however, find something for Linux – a program called “hid2hci” that knew how to turn an MX900 Bluetooth receiver to a HCI mode.  Fortunately, source code was included, so it was easy enough to see what it was doing.  Unfortunately, I don’t really have a whole lot of experience with USB, on Windows or other platforms, and what I needed to do was port hid2hci to Windows.

The linux program is fairly simple.  Even with my limited knowledge of USB, what it was doing appeared to be straightforward enough.  The program sends three vendor-specific HID output reports (a HID report is the basic way to either report information from a device to the computer or change a setting on the device for HID devices) to the MX900 receiver.  After receiving the special three HID reports, the MX900 changes its PnP ID and appears to the operating system as a different piece of hardware, an HCI Bluetooth receiver.

 So, I got started working on a Windows version of hid2hci.  The first step was to crack open the Windows DDK documentation (you can download the DDK with the free KMDF 1.1 ISO distribution) and start looking around for ways to talk to USB devices.  It turns out that there is already a rather full featured API to do this, from both user mode and kernel mode.  Because all I really needed to do here was to send three custom commands to the MX900, I picked the user mode HID API to start with.

The user mode HID APIs live in hid.dll and come in two flavors: HID Parser routines (prefixed HidP_), and HID Device/Driver routines (prefixed HidD_).  The former provide a high level interface for formatting, preparing, and parsing the actual USB HID reports, while the latter deal with actually sending and receiving USB HID reports.  The API is a bit cumbersome, but it turns out to not be too hard to use.  The basic idea is:

  1. Open a handle to the HID device that you want to talk to with CreateFile.
  2. Call HidD_GetPreparsedData to load the preparsed data for the collection.  This basically retrieves all sorts of information about the HID device ahead of time in one blob that is later used by the HID Parser routines to ensure that you are matching the report formats used by the device.
  3. Call HidD_GetAttributes and HidD_GetCaps to make sure that the device is the one you meant to connect to and supports the HID usages that you are going to use.  Here, I wanted to verify that the vendor specific usage page 0xFF00, usage 0x0001 is present (as this is where I wanted to send the magic 3 reports to turn the receiver to HCI mode).
  4. Build the HID report.  I originally attempted to do this using the high level HID Parser APIs, but I couldn’t get it to work right – the HID Parser APIs kept complaining that the usage I requested didn’t exist.  I assume that this is because Logitech never bothered to completely describe the format of the HID reports for this vendor specific usage, resulting in the high level parser becoming unhappy if you asked it to format a report for that usage.  As a result, I just built the report manually by just writing the raw data into the report buffer and prepending the HID report ID (0x10) to the report data.
  5. Send the completed report to the device.  There are two ways to do this – WriteFile and HidD_SetOutputReport.  I attempted to use WriteFile first, but it always failed with ERROR_ACCESS_DENIED.  Not being an expert on HID devices, I tried the other documented routine (HidD_SetOutputReport) send the report, which worked fine.  HidD_SetOutputReport internally just sends a special IOCTL to the driver for the device you open, so the code paths are in fact different.

Steps 4 and 5 will basically need to be repeated for each of the three HID reports that we need to send to the Bluetooth receiver.

 There are a couple of other things that you need to do in order to get this to work that I glossed over.  In particular, you need to actually find the device that you want to open with CreateFile.  The best way to do this is by using the SetupDi family of APIs to enumerate all HID devices.  We can then verify that each device has the expected vendor ID, product ID, and HID usages before we try to send it the magical commands to convert the device to native HCI mode.

After putting all of these steps together, I had something that appeared to do what the Linux hid2hci program did.  Sure enough, when I ran my prototype hid2hci port on my test box, a new device appeared in Device Manager, which was detected by the WIDCOMM Bluetooth stack as a Cambridge Silicon Radio Bluetooth Receiver.  Success!

The device itself stays in native HCI mode until it is reset (i.e. rebooting the computer or unplugging the receiver itself), so the HCI conversion program needs to either periodically scan for devices to switch to HCI mode, or register for a device change notification in order to enable full Bluetooth functionality if you reboot or disconnect the Bluetooth receiver.

The source code for my Logitech HID-to-HCI convertor program is available for download if you are interested in it.  You will need Windows DDK installed in order to build it.  Alternatively, you can download the binary if you just want the program and don’t want to install the development environment to build it.  It takes two command line arguments: the hexadecimal vendor ID and hexadecimal product ID of the device that it should switch from HID emulation mode to native HCI mode.  You can find these under Device Manager if you are using Windows XP SP2 or Windows Server 2003 SP1 by going to your device’s property sheet and going to the details tab, then selecting the Hardware Ids listbox item.  The device you want is probably going to be named “USB Composite Device”.  If you are using an MX900, then you can use 046d for the vendor ID and c705 for the product ID.  There is no harm in running the program repeatedly after it has already switched your device(s) to HCI mode.

On the selection of blogging software.

July 5th, 2006

After wasting about 4 hours on this subject, I think it has deserved the right to be spoken about.

 Since I decided to finally give this whole blogging thing a try, I went looking for an appropriate set of blogging software to use.  My first instinct was to try Community Server with SQL Server 2005 Express Edition.  Hey, it works for Microsoft (MSDN Blogs is powered by Community Server), so it should work for me, right?  And to make things even better, it’s a native Windows solution (I was planning on the deploying the blog on a Windows box), not a Unix port (which in my experience tend to be half-hearted and generally low quality, as far as applications go).

 Boy, was I wrong.

 After waiting about half an hour (I think) for SQL Server 2005 to install on my interim hosting box (which, although it isn’t the fastest box in the world, should really install the “lite” edition of SQL Server a little bit faster than that, I think), I set to work on setting up my blog.

 So I start the setup app, and everything is going great until the install almost finishes and goes into the configuration wizard, which then wants me to enter in my database information.  Now, I’m no SQL Server expert, but I (foolishly) think – how hard can this be?  I try the defaults – create the default database name on the locally running SQL Server.  Well, the installer freezes for about 30 seconds and then comes up with an error messagebox saying that it can’t talk to the database or I gave it invalid credentials.  Since SQL Server 2005 Express is supposed to used integrated Windows authentication by default and I am running the installer as admin (and told it to use integrated Windows authentication, per recommendations), I discount that possibility for the moment.

 I remember back to when I setup SQL Server 2005 half an hour ago, and it having said something about disabling all remote network access (a good thing from a security perspective!).  So, I figure, something must be stupid here and is trying to connect with one of the network transports instead of one of the local transports and that is why things are failing.  I (foolishly, as it turns out!) try to cancel out of the Community Server configuration wizard so that I can reconfigure SQL Server 2005 to enable network access.

 That was mistake no. 1. The Community Server installer comes back into focus with a big happy “setup was successful!” dialog and no option to go back and re-run the wizard.  Oops.  I had to uninstall the *whole thing* and reinstall it to get the post-installation configuration wizard back (which had failed, mind you, not ran successfully!) after reconfiguring SQL Server.  So, after waiting around for the Community Server installer to take its time, I get back to the SQL Server selection dialog.  Again, no dice – nothing I enter seems to appease it.

 Now, I’m starting to get pretty annoyed.  This was supposed to be a quick and easy thing to setup, not something that I wanted to spend my whole afternoon on.  Well, after doing a bit of research on how SQL Server works (not exactly what I was expecting to have to do, this was supposed to all work out of the box, remember?), I figure out that the default configuration is supposed to listen on port 1433 if you selected TCP/IP for a network transport (which I did, for the SQL Server 2005 Surface Area Configuration Wizard thingie).  Well, a “netstat -anp tcp” says that nobody is listening on that port.  Oops, something is clearly wrong here, even though I followed all of the rules and used the supported UI and everything to configure SQL Server.  Well, I start poking around a bit more with the management tools that SQL Server installed, and eventually I got to where you tell the TCP transport which IP addresses that it should listen on.  I figure that maybe I need to manually tell SQL Server to listen on the IP I want on port 1433 here, if it wasn’t already (even though the TCP transport was enabled according to everything I could see).

 Now, this is a prime example of how you should not design a UI.  The UI for selecting the IP addresses/interfaces to listen on is vaguely reminsicent of the Visual Studio property pages where you have a column of property types (descriptions) and a column of property values (that you enter in to configure it).  Each IP address has a bunch of options that you can pick in this form.

Unfortunately, because I had turned on IPv6 support for this box, it has about 26 IP addresses (due to IPv6 over IPv4 automatic tunneling interfaces) between all of the VMware virtual NICs and the two physical NICs in the box.  And all of the IP addresses were “expanded” by default, in this tiny listbox that only had room for one and a half a property set per IP address.  To make matters even worse, the IP that I wanted to enable the listern for was #3, but the sorting for the UI was wrong – it goes 1, 10, 11, … 2, 20, 21 … and so forth, so the third IP address was at the *end* of the list.  Great UI here, guys!

 Anyways, after agonizing over that particular piece of user interface meets train wreck, I (or so I think) tell SQL Server to listen on my internal LAN IP for port 1433, hit OK, and restart the service.

 Then, I go try hitting “Next” on the Community Server setup.  Again, it doesn’t work.  Back to netstat and SQL Server STILL isn’t listening on port 1433.  Aaargh!  Well, next I do a “netstat -anpo tcp” and match the listening process IDs and port numbers with the SQL Server process ID to see if SQL Server is actually listening on ANYTHING.  It turns out that it IS listening, but on some completely unexpected (to me, anyway) port – 1398.  Huh??  The TCP transport uses 1433 by default!  Furthermore, I had configured it through the UI to listen on 1433.  Well, not being an expert on SQL Server, I start hunting around for how to convince Community Server’s setup wizard to use port 1398 for a remote TCP connection (still having no idea why the local connection mechanism doesn’t work, as that too is enabled in the SQL Server configuration manager).

It turns out that I have to go and define a “connection alias” in the SQL Server Native Client Configuration section of the SQL Server manager UI.  After filling in everything here – connect to the IP that SQL Server was listening on with port 1398, and hitting OK, I went back to the Community Server setup wizard and hit Next, praying for it to work.

 Success! Only about 2.5 hours into the installation attempt, I’ve at least got it talking to the database.  The next steps are to give it my initial account and password for the Community Server administrator account when creating the database.  I enter in the information and continue through the next steps in the wizard, and finally get to the end.  After hitting the last “Next” button, naturally, I get an error message; Community Server failed to create the database properly.  I am then presented with a Notepad view of the setup log, where the problem is quickly evident – apparently, the Community Server setup app didn’t both to escape its strings before passing them to SQL queries, and blew up because the password I used had a ‘ (single quote) in it.  This is just giving me warm fuzzies about how safe Community Server is going to be against SQL injection attacks, if I ever got it working, let me tell you.

So, I back out of that part of the wizard and pick a new password that doesn’t have any characters that are unfriendly to SQL if not properly escaped.  Now I get ANOTHER error at the end of the wizard; the setup program failed to create the database because it already exists.  Apparently, the wizard doesn’t roll back all of its changes if it fails partway through, and some of the database goo from the unsuccessful previous configuration attempt is still there.

 So, then I’m back to uninstalling and reinstalling Community Server for about the 5th time today, and getting madder every minute.  After the reinstall finally finishes, and I reenter all of my information yet another time, the configuration wizard actually completes without any visible errors (yay!!!).  So, the last thing the wizard wants to do is launch the Community Server site for the local IIS install so that I can see all of the Community Server goodness that is now installed.

Guess that?  The site doesn’t load.  Instead, I’m presented with a 404 Not Found in my browser window, from the URL that the configuration wizard so nicely launched at completion.  Apparently, there were several problems here: default.aspx wasn’t properly registered as an index document, and Community Server had picked the wrong IIS site to configure itself on (without even asking me which one I had wanted).

At this point, I’ve had it – after having burned a couple of hours on this problem, I’m just not willing to give it any more of my time.  I uninstalled SQL Server and Community Server and went in search of other alternatives, which lead me to my current blog software, WordPress.

 WordPress is not exactly what you would call a native Windows solution – it relies on PHP, and the only backend database provider that it can talk to is MySQL.  None of these were really designed to work on Windows, but nonetheless, at this point I’m ready to try anything that doesn’t touch SQL Server or doesn’t have Community Server in it’s name.  All in all, it only took me about 15 minutes to download everything I needed (from various sites, too, not one centralized location) to get WordPress to work on IIS (Win32 PHP 5, Win32 MySQL, and of course WordPress itself), work through one or two minor setup hiccups (made a typo in the database user password in the MySQL console once, forgot to tell PHP where to find php.ini, forgot to enable the PHP MySQL extension dll), and get everything working.

 Wow.  Now, I’m not what you would call a Unix guy – I do as much as I can on Windows and avoid Unix whereever possible, and here I am, having figured out this “crude” set of tools that don’t even have any friendly, advanced setup wizards or anything (well, MySQL has a relatively nice GUI setup, actually), learned how to create the database that WordPress needed myself in MySQL, and debugged setting up a complicated ISAPI extension in IIS with partially outdated documentation in a mere fraction of the time as what the leading Windows solution (with it’s do-everything-for-you-the-right-way setup wizards and all) took to get into an almost-working state.  I hate to admit it, but sometimes software that came from Unix just does things better sometimes.

VMware Server and RDP don’t always play nicely together.

July 5th, 2006

Steve already stole my thunder (well, if that makes sense, since it was my paper anyway) by posting my analysis of this earlier, but I figure that it is also worth discussing here.

 Recently, I finally* got a got a new development box at work – multiproc, x64 capable (with the ability to run 64-bit VMs too!), lots of RAM, generally everything you would want in a really nice development box.  Needless to say, I was rather excited to see what I could do with it.  The first thing I had in mind was setting up a dedicated set of VMs to run my test network on and host various dedicated services such as our symbol server here at the office.

 (*: There is a long, sad story behind this.  For a long time, I’ve been having a VM running on an ancient ~233MHz box that nobody else at the office wanted (for obvious reasons!).  I had been trying to get a replacement box that didn’t suck so much to put this VM (and others) on to run full time, but just about every thing that could possibly go wrong with requesting a purchase from work did go wrong, resulting in it being delayed in the order of over half a year…).

 The box came with Windows XP Professional x64 Edition installed, so I figured that I might as well use the install instead of blowing it away and putting Windows Server 2003 on for now.  As it turned out, this came around to bite me later.  After installing all of the usual things (service packs, hotfixes, and soforth), I went to grab the latest VMware Server installer so that I could put the box to work running my library of VMs.  Everything seemed to be going okay at the start, until I began to do things that were a bit outside the box, so to speak.  Here, I wanted to have my XP x64 box route through a VM running on the same computer.  Why on earth would I possibly want to do that, you ask?  Well, I have an internal VPN-based network that overlays the office network here at work and connects all of the VMs I have running on various boxes at the office.  I wanted to be able to interconnect all of those VMs with various services (in particular, lots and lots of storage space) running on the beefy x64 box over this trusted VPN network instead of the public office network (which I have for testing purposes designated the untrusted Internet network).  If I have the x64 box routing through something that is connected to the entire overlay network, then I don’t need to worry about creating connections to every single other VM in existance to grant access to those resources.  (At this point, our x64 support is still in beta, and XP doesn’t have a whole lot of useful support for dedicated VPN links.)

 Anyways, things start to get weird when I finally get this setup going.  The first problem I run into is that sometimes on boot, all of the VMs that I had configured to autostart would appear to hang on startup – I would have to go to Task Manager and kill the vmware-vmx.exe processes, then restart the vmserverdWin32 service before I could get them to come up properly.  After a bit of poking around, I noticed a suspicious Application Eventlog entry that seemed to correlate with when this problem happened on a boot:

Event Type: Information
Event Source: VMware Registration Service
Event Category: None
Event ID: 1000
Date:  6/13/2006
Time:  2:10:06 PM
User:  N/A
Computer: MOGHEDIEN
Description:
vmserverdWin32 took too much time to initialize.

 Hmm… that doesn’t look good.  Well, digging a bit deeper, it turns out that VMware Server has several different service components, and apparently there are dependencies between them.  However, the VMware Server developers neglected to properly assign dependencies between all of the services; instead, they appear to have just let the services start in whatever order and have a timeout window in which the services are supposed to establish communication with eachother.  Unfortunately, this tends to randomly break on some configurations (like mine, apparently).

 Fortunately, the fix for this problem turned out to be fairly easy.  Using sc.exe, the command line service configuration app (which used to ship with the SDK, but now ships with Windows XP and later – a handy tool to remember), I added an SCM dependency between the main VMware launcher service (“VMServerdwin32”) and the VMware authorization service (“VMAuthdService”): 

C:\Documents and Settings\Administrator>sc config vmserverdWin32 depend= RPCSS/VMAuthdService
[SC] ChangeServiceConfig SUCCESS
After fixing the service dependencies, everything seemed to be okay, but of course, that wasn’t really the case…

 When I went home later that day, I decided to VPN into the office and RDP into my new development box in order to change some hardware settings on one of my VMs.  In this particular case, some of the VPN traffic from my apartment to the development box on the office happened to pass through that router VM which I had running on the development box.  Whenever I tried to RDP into the development box, it would just freeze whenever I tried to enter my credentials; the RDP connection would hang after I entered valid logon credentials at the winlogon prompt until TCP gave up and broke off the connection.  This happened every single time I tried to RDP into my new box, but the office connection was fine (I could still connect to other things at the office while this was happening).  Definitely not cool.  So, I opened a session on our development server at the office and decided to try an experiment – ping my new dev box from it while I try to RDP in.  The initial results of this experiment were not at all what I expected; my dev box responded to pings the whole time while it was apparently unreachable over RDP while the TCP connection was timing out.  The next time I tried RDPing in, I ran a ping from my home box to my dev box, and the pings were dropped while I was trying to make the RDP session connection to the console session after providing valid logon credentials, and yet the box still responded to pings from a different box at the office.

After poking around a bit more, I determined that every single VM on my brand new dev box would just freeze and stop responding whenever I tried to RDP into my dev box from home (but not from the office).  To make matters even more strange, I could connect to a different box at the office, and bounce RDP through that box to my new dev server and things would work fine.  Well, that sucks – what’s going on here?  A partial explanation stems from how exactly I had setup the routing on my new dev box; the default gateway was set to my router VM (running on that box) using one of the VMnet virtual NICs, but I had left the physical NIC on the box still bound to TCP (without a default gateway set however).  So, for traffic destined to the office subnet, there is no need for packets to traverse the router VM – but for traffic from the VPN connection to my home, packets are routed through the router VM.

 Given this information, it seemed that I had at least found why the problem was happening, on some level – whenever I tried to RDP into my new dev box over the VPN, all of the VMs on my new dev box would freeze.  Because traffic through the VPN to my new dev box is routed through a VM on the new dev box, the RDP connection stalls and times out (because the router VM has hung).

 At this point, I had to turn to a debugger to understand what was going on.  Popping the vmware-vmx.exe process corresponding to the router VM open in the debugger and comparing call stacks between when it was running normally and when it was frozen while I was trying to RDP in pointed to the thread that emulated the virtual CPU becoming blocked on an internal NtUser call to win32k.sys.  At this point, I couldn’t really do a whole lot more without resorting to a kernel debugger, making that my next step.

 With the help of kd, I was able to track down the problem a bit further; the vmware CPU simulator thread was blocking on acquiring the global win32k NtUser lock that almost all NtUser calls acquire at the start of their implementation.  With the `!locks’ command, I was able to track down the owner of the lock – which happened to be (surprise!) a Terminal Server thread in CSRSS for the console session.  This thread was waiting on a kernel event, which turns out to be signalled when the RDP TCP transport driver receives data from the network.  So, we have a classical deadlock situation; the router VM is blocking on win32k’s internal NtUser global lock, and there is a CSRSS thread that is holding the win32k internal NtUser global lock while waiting on network I/O (from the RDP client).  Because the RDP client (me at home connecting through the VPN) needs to route traffic through the router VM to reach the RDP TCP transport on my new dev box, everything appears to freeze until the TCP connection times out.

 Unfortunately, there isn’t really a very good solution to this problem.  Installing Windows Server 2003 would have helped, in my case, because then VMware Server and its services would be running on session 0, and RDP connections would be diverted to new Terminal Server sessions (with their own per-session-instanced win32k NtUser locks), thus avoiding the deadlock (unless you happened to connect to Terminal Server using the `/console’ option).

 So there you have it – why VMware Server and RDP can make a bad mix sometimes.  This is a real shame, too, because RDPing into a box and running the VMware Server console client “locally” is sooo superior to running the VMware Server console client over the network (updates *much* faster, even over a LAN).

 If you’re interested, I did a writeup of most of the technical details of the actual debugging (with WinDbg and kd) of this problem that you can look at here – you are encouraged to do so if you want to see some of the steps I took in the debugger to further analyze the problem.

 In the future, I’ll try not to gloss over some of the debugger steps so much in blog posts; for this time, I had already written the writeup before hand, and didn’t want to just reformat the whole thing for an entire blog post.

 Whew, that was a long second post – hopefully, future ones won’t be quite so long-winded (if you consider that a bad thing).  Hopefully, future posts won’t be written at 1am just before I go to sleep, too…

And so the blog begins…

July 4th, 2006

Steve Dispensa and Justin Olbrantz have been bothering me to start blogging for some time now.  I’ve finally broken down and decided to give it a try, so here goes.

N.B. The blog site is temporarily hosted in a spare box that I have in the closet of my apartment (until I have proper hosting arrangements finalized), so here’s to hoping there won’t be too much downtime for the beginning.