An introduction to kernrate (the Windows kernel profiler)

December 7th, 2006

One useful utility for tracking down performance problems that you might not have heard of until now is kernrate, the Windows kernel profiler. This utility currently ships with the Windows Server 2003 Resource Kit Tools package (though you can use kernate on Windows XP is well) and is freely downloadable. Currently, you’ll have to match the version of kernrate you want to use with your processor architecture, so if you are using your processor in x64 mode with an x64 Windows edition, then you’ll have to dig up an x64 version of kernrate (the one that ships with the Srv03 resource kit tools is x86); KrView (see below) ships with an x64 compatible version of kernrate.

Kernrate requires that you have the SeProfilePrivilege assigned (which is typically only granted to administrators), so in most cases you will need to be a local administrator on your system in order to use it. This privilege allows access to the (undocumented) profile object system services. These APIs allow programmatic access to sample the instruction pointer at certain intervals (typically, a profiler program selects the timer interrupt for use with instruction pointer sampling). This allows you to get a feel for what the system is doing over time, which is in turn useful for identifying the cause of performance issues where a particular operation appears to be processor bound and taking longer than you would like.

There are a multitude of options that you can give kernrate (and you are probably best served by experimenting with them a bit on your own), so I’ll just cover the common ones that you’ll need to get started (use “kernrate -?” to get a list of all supported options).

Kernrate can be used to profile both user mode and kernel mode performance issues. By default, it operates only on kernel mode code, but you can override this via the -a (and -av) options, which cause kernrate to include user mode code in its profiling operations in addition to kernel mode code. Additionally, by default, kernrate operates over the entire system at once; to get meaningful results with profiling user mode code, you’ll want to specify a process (or group of processes) to profile, with the “-p pid” and/or “-n process-name” arguments. (The process name is the first 8 characters of a process’s main executable filename.)

To terminate collection of profiling data, use Ctrl-C. (On pre-Windows-Vista systems where you might be running kernrate.exe via runas, remember that Ctrl-C does not work on console processes started via runas.) Additionally, you can use the “-s seconds” argument to specify that profling should be automagically stopped after a given count of seconds have elapsed.

If you run kernrate on kernel mode code only, or just specify a process (or group of processes) as described above, you’ll notice that you get a whole lot of general system-wide output (information about interrupt counts, global processor time usage, context switch counts, I/O operation counts) in addition to output about which modules used a noteworthy amount of processor time. Here’s an example output of running kernrate on just the kernel on my system, as described above (including just the module totals):

D:\\Programs\\Utilities>kernrate
Kernrate User-Specified Command Line:
kernrate


Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
Starting to collect profile data

***> Press ctrl-c to finish collecting profile data
===> Finished Collecting Data, Starting to Process Results

------------Overall Summary:--------------

[...]

OutputResults: KernelModuleCount = 153
Percentage in the following table is based on
the Total Hits for the Kernel

Time   197 hits, 25000 events per hit --------
 Module    Hits   msec  %Total  Events/Sec
intelppm     67        980    34 %     1709183
ntkrnlpa     52        981    26 %     1325178
win32k       35        981    17 %      891946
hal          19        981     9 %      484199
dxgkrnl       6        980     3 %      153061
nvlddmkm      6        980     3 %      153061
fanio         3        981     1 %       76452
bcm4sbxp      2        981     1 %       50968
portcls       2        980     1 %       51020
STAC97        2        980     1 %       51020
bthport       1        981     0 %       25484
BTHUSB        1        981     0 %       25484
Ntfs          1        980     0 %       25510

Using kernrate in this fashion is a good first step towards profiling a performance problem (especially if you are working with someone else’s program), as it quickly allows you to narrow down a processor hog to a particular module. While this is useful as a first step, however, it doesn’t really give you a whole lot of information about what specific code in a particular mode is taking a lot of processor time.

To dig in deeper as to the cause of the problem (beyond just tracing it to a particular module), you’ll need to use the “-z module-name” option. This option tells kernrate to “zoom in” on a particular module; that is, for the given module, kernrate will track instruction pointer locations within the module to individual functions. This level of granularity is often what you’ll need for tracking down a performance issue (at least as far as profiling is concerned). You can repeat the “-z” option multiple times to “zoom in” to multiple modules (useful if the problem you are tracking down involves high processor usage across multiple DLLs or binaries).

Because kernrate is resolving instruction pointer sampling down to a more granular level than modules (with the “-z” option), you’ll need to tell it how to load symbols for all affected modules (otherwise, the granularity for profiler output will typically be very poor, often restricted to just exported functions). There are two ways to do this. First, you can use the “-j symbol-path” command line option; this option tells kernrate to pass a particular symbol path to DbgHelp for use with loading symbols. I recommend the second option, however, which is to configure your _NT_SYMBOL_PATH before-hand so that it points to a valid DbgHelp symbol path. This relieves you of having to manually tell kernrate a symbol path every time you execute it.

Continuing with the example I gave above, we might be interested in just what the “win32k” (the Win32 kernel mode support driver for USER/GDI) module is doing that was taking up 17% of the processor time spent in kernel mode on my system (for the interval that I was profiling). To do that, we can use the following command line (the output has been truncated only include information that we are interested in):

D:\\Programs\\Utilities>kernrate -z win32k

Kernrate User-Specified Command Line:
kernrate -z win32k


Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
CallBack: Finished Attempt to Load symbols for
90a00000 \\SystemRoot\\System32\\win32k.sys

Starting to collect profile data

***> Press ctrl-c to finish collecting profile data
===> Finished Collecting Data, Starting to Process Results

------------Overall Summary:--------------

[...]

OutputResults: KernelModuleCount = 153
Percentage in the following table is based on the
Total Hits for the Kernel

Time   2465 hits, 25000 events per hit --------
 Module      Hits   msec  %Total  Events/Sec
ntkrnlpa     1273      14799    51 %     2150483
win32k        388      14799    15 %      655449
intelppm      263      14799    10 %      444286
hal           236      14799     9 %      398675
bcm4sbxp       66      14799     2 %      111494
spsys          55      14799     2 %       92911
nvlddmkm       48      14799     1 %       81086
STAC97         31      14799     1 %       52368

[...]


===> Processing Zoomed Module win32k.sys...


----- Zoomed module win32k.sys (Bucket size = 16 bytes,
Rounding Down) --------
Percentage in the following table is based on the
Total Hits for this Zoom Module

Time   388 hits, 25000 events per hit --------
 Module                  Hits   msec  %Total  Events/Sec
xxxInternalDoPaint         44      14799    10 %       74329
XDCOBJ::bSaveAttributes    20      14799     4 %       33786
DelayedDestroyCacheDC      20      14799     4 %       33786
HANDLELOCK::vLockHandle    15      14799     3 %       25339
mmxAlphaPerPixelOnly       15      14799     3 %       25339
XDCOBJ::RestoreAttributes  13      14799     2 %       21960
DoTimer                    12      14799     2 %       20271
_SEH_prolog4               11      14799     2 %       18582
memmove                     9      14799     2 %       15203
_GetDCEx                    6      14799     1 %       10135
HmgLockEx                   6      14799     1 %       10135
XDCOBJ::bCleanDC            5      14799     1 %        8446
XEPALOBJ::ulIndexToRGB      5      14799     1 %        8446
HmgShareCheckLock           4      14799     0 %        6757
RGNOBJ::bMerge              4      14799     0 %        6757

[...]

This should give you a feel for the kind of information that you’ll get from kernrate. Although the examples I gave were profiling kernel mode code, the whole process works the same way for user mode if you use the “-p” or “-n” options as I mentioned earlier. In conjunction with a debugger, the information that kernrate gives you can often be a great help in narrowing down CPU usage performance problems (or at the very least point you in the general direction as to where you’ll need to do further research).

There are also a variety of other options that are available in kernrate, such as features for gathering information about “hot” locks that have a high degree of contention, and support for launching new processes under the profiler. There is also support for outputting the raw sampled profile data, which can be used to graph the output (such as you might see used with tools like KrView).

Although kernrate doesn’t have all the “bells and whistles” of some of the high-end profiling tools (like Intel’s vTune), it’s often enough to get the job done, and it’s also available to you at no extra cost (and can be quickly and easily deployed to help find the source of a problem). I’d highly recommend giving it a shot if you are trying to analyze a performance problem and don’t already have a profiling solution that you are using.

Frame pointer omission (FPO) optimization and consequences when debugging, part 2

December 6th, 2006

This series is about frame pointer omission (FPO) optimization and how it impacts the debugging experience.

  1. Frame pointer omission (FPO) and consequences when debugging, part 1.
  2. Frame pointer omission (FPO) and consequences when debugging, part 2.

Last time, I outlined the basics as to just what FPO does, and what it means in terms of generated code when you compile programs with or without FPO enabled. This article builds on the last, and lays out just what the impacts of having FPO enabled (or disabled) are when you end up having to debug a program.

For the purposes of this article, consider the following example program with several do-nothing functions that shuffle stack arguments around and call eachother. (For the purposes of this posting, I have disabled global optimizations and function inlining.)

__declspec(noinline)
void
f3(
   int* c,
   char* b,
   int a
   )
{
   *c = a * 3 + (int)strlen(b);

   __debugbreak();
}

__declspec(noinline)
int
f2(
   char* b,
   int a
   )
{
   int c;

   f3(
      &c,
      b + 1,
      a - 3);

   return c;
}

__declspec(noinline)
int
f1(
   int a,
   char* b
   )
{
   int c;
   
   c = f2(
      b,
      a + 10);

   c ^= (int)rand();

   return c + 2 * a;
}

int
__cdecl
wmain(
   int ac,
   wchar_t** av
   )
{
   int c;

   c = f1(
      (int)rand(),
      "test");

   printf("%d\\n",
      c);

   return 0;
}

If we run the program and break in to the debugger at the hardcoded breakpoint, with symbols loaded, everything is as one might expect:

0:000> k
ChildEBP RetAddr  
0012ff3c 010015ef TestApp!f3+0x19
0012ff4c 010015fe TestApp!f2+0x15
0012ff54 0100161b TestApp!f1+0x9
0012ff5c 01001896 TestApp!wmain+0xe
0012ffa0 77573833 TestApp!__tmainCRTStartup+0x10f
0012ffac 7740a9bd kernel32!BaseThreadInitThunk+0xe
0012ffec 00000000 ntdll!_RtlUserThreadStart+0x23

Regardless of whether FPO optimization is turned on or off, since we have symbols loaded, we’ll get a reasonable call stack either way. The story is different, however, if we do not have symbols loaded. Looking at the same program, with FPO optimizations enabled and symbols not loaded, we get somewhat of a mess if we ask for a call stack:

0:000> k
ChildEBP RetAddr  
WARNING: Stack unwind information not available.
Following frames may be wrong.
0012ff4c 010015fe TestApp+0x15d8
0012ffa0 77573833 TestApp+0x15fe
0012ffac 7740a9bd kernel32!BaseThreadInitThunk+0xe
0012ffec 00000000 ntdll!_RtlUserThreadStart+0x23

Comparing the two call stacks, we lost three of the call frames entirely in the output. The only reason we got anything slightly reasonable at all is that WinDbg’s stack trace mechanism has some intelligent heuristics to guess the location of call frames in a stack where frame pointers are used.

If we look back to how call stacks are setup with frame pointers (from the previous article), the way a program trying to walk the stack on x86 without symbols works is by treating the stack as a sort of linked list of call frames. Recall that I mentioned the layout of the stack when a frame pointer is used:

[ebp-01]   Last byte of the last local variable
[ebp+00]   Old ebp value
[ebp+04]   Return address
[ebp+08]   First argument...

This means that if we are trying to perform a stack walk without symbols, the way to go is to assume that ebp points to a “structure” that looks something like this:

typedef struct _CALL_FRAME
{
   struct _CALL_FRAME* Next;
   void*               ReturnAddress;
} CALL_FRAME, * PCALL_FRAME;

Note how this corresponds to the stack layout relative to ebp that I described above.

A very simple stack walk function designed to walk frames that are compiled with frame pointer usage might then look like so (using the _AddressOfReturnAddress intrinsic to find “ebp”, assuming that the old ebp is 4 bytes before the address of the return address):

LONG
StackwalkExceptionHandler(
   PEXCEPTION_POINTERS ExceptionPointers
   )
{
   if (ExceptionPointers->ExceptionRecord->ExceptionCode
      == EXCEPTION_ACCESS_VIOLATION)
      return EXCEPTION_EXECUTE_HANDLER;

   return EXCEPTION_CONTINUE_SEARCH;
}

void
stackwalk(
   void* ebp
   )
{
   PCALL_FRAME frame = (PCALL_FRAME)ebp;

   printf("Trying ebp %p\\n",
      ebp);

   __try
   {
      for (unsigned i = 0;
          i < 100;
          i++)
      {
         if ((ULONG_PTR)frame & 0x3)
         {
            printf("Misaligned frame\\n");
            break;
         }

         printf("#%02lu %p  [@ %p]\\n",
            i,
            frame,
            frame->ReturnAddress);

         frame = frame->Next;
      }
   }
   __except(StackwalkExceptionHandler(
      GetExceptionInformation()))
   {
      printf("Caught exception\\n");
   }
}

#pragma optimize("y", off)
__declspec(noinline)
void printstack(
   )
{
   void* ebp = (ULONG*)_AddressOfReturnAddress()
     - 1;

   stackwalk(
      ebp);
}
#pragma optimize("", on)

If we recompile the program, disable FPO optimizations, and insert a call to printstack inside the f3 function, the console output is something like so:

Trying ebp 0012FEB0
#00 0012FEB0  [@ 0100185C]
#01 0012FED0  [@ 010018B4]
#02 0012FEF8  [@ 0100190B]
#03 0012FF2C  [@ 01001965]
#04 0012FF5C  [@ 01001E5D]
#05 0012FFA0  [@ 77573833]
#06 0012FFAC  [@ 7740A9BD]
#07 0012FFEC  [@ 00000000]
Caught exception

In other words, without using any symbols, we have successfully performed a stack walk on x86.

However, this all breaks down when a function somewhere in the call stack does not use a frame pointer (i.e. was compiled with FPO optimizations enabled). In this case, the assumption that ebp always points to a CALL_FRAME structure is no longer valid, and the call stack is either cut short or is completely wrong (especially if the function in question repurposed ebp for some other use besides as a frame pointer). Although it is possible to use heuristics to try and guess what is really a call/return address record on the structure, this is really nothing more than an educated guess, and tends to be at least slightly wrong (and typically missing one or more frames entirely).

Now, you might be wondering why you might care about doing stack walk operations without symbols. After all, you have symbols for the Microsoft binaries that your program will be calling (such as kernel32) available from the Microsoft symbol server, and you (presumably) have private symbols corresponding to your own program for use when you are debugging a problem.

Well, the answer to that is that you will end up needing to record stack traces without symbols in the course of normal debugging for a wide variety of problems. The reason for this is that there is a lot of support baked into NTDLL (and NTOSKRNL) to assist in debugging a class of particularly insidious problems: handle leaks (and other problems where the wrong handle value is getting closed somewhere and you need to find out why), memory leaks, and heap corruption.

These (very useful!) debugging features offer options that allow you to configure the system to log a stack trace on each heap allocation, heap free, or each time a handle is opened or closed. Now the way these features work is that they will capture the stack trace in real time as the heap operation or handle operation happens, but instead of trying to break into the debugger to display the results of this output (which is undesirable for a number of reasons), they save a copy of the current stack trace in-memory and then continue execution normally. To display these saved stack traces, the !htrace, !heap -p, and !avrf commands have functionality that locates these saved traces in-memory and prints them out to the debugger for you to inspect.

However, NTDLL/NTOSKRNL needs a way to create these stack traces in the first place, so that it can save them for later inspection. There are a couple of requirements here:

  1. The functionality to capture stack traces must not rely on anything layed above NTDLL or NTOSKRNL. This already means that anything as complicated as downloading and loading symbols via DbgHelp is instantly out of the picture, as those functions are layered far above NTDLL / NTOSKRNL (and indeed, they must make calls into the same functions that would be logging stack traces in the first place in order to find symbols).
  2. The functionality must work when symbols for everything on the call stack are not even available to the local machine. For instance, these pieces of functionality must be deployable on a customer computer without giving that computer access to your private symbols in some fashion. As a result, even if there was a good way to locate symbols where the stack trace is being captured (which there isn’t), you couldn’t even find the symbols if you wanted to.
  3. The functionality must work in kernel mode (for saving handle traces), as handle tracing is partially managed by the kernel itself and not just NTDLL.
  4. The functionality must use a minimum amount of memory to store each stack trace, as operations like heap allocation, heap deallocation, handle creation, and handle closure are extremely frequent operations throughout the lifetime of the process. As a result, options like just saving the entire thread stack for later inspection when symbols are available cannot be used, since that would be prohibitively expensive in terms of memory usage for each saved stack trace.

Given all of these restrictions, the code responsible for saving stack traces needs to operate without symbols, and it must furthermore be able to save stack traces in a very concise manner (without using a great deal of memory for each trace).

As a result, on x86, the stack trace saving code in NTDLL and NTOSKRNL assumes that all functions in the call frame use frame pointers. This is the only realistic option for saving stack traces on x86 without symbols, as there is insufficient information baked into each individual compiled binary to reliably perform stack traces without assuming the use of a frame pointer at each call site. (The 64-bit platforms that Windows supports solve this problem with the use of extensive unwind metadata, as I have covered in a number of past articles.)

So, the functionality exposed by pageheap’s stack trace logging, and handle tracing are how stack traces without symbols end up mattering to you, the developer with symbols for all of your binaries, when you are trying to debug a problem. If you make sure to disable FPO optimization on all of your code, then you’ll be able to use tools like pageheap’s stack tracing on heap operations, UMDH (the user mode heap debugger), and handle tracing to track down heap-related problems and handle-related problems. The best part of these features is that you can even deploy them on a customer site without having to install a full debugger (or run your program under a debugger), only later taking a minidump of your process for examination in the lab. All of them rely on FPO optimizations being disabled (at least on x86), though, so remember to turn FPO optimizations off on your release builds for the increased debuggability of these tough-to-find problems in the field.

Use a custom symbol server in conjunction with IDA with Vladimir Scherbina’s IDA plugin

December 5th, 2006

Vladimir Scherbina has recently released a useful IDA plugin that enhances IDA’s now built-in support for loading symbols via the symbol server to allow custom symbol server paths. This is something I personally have been wanting for some time; IDA’s PDB loading mechanism overrides _NT_SYMBOL_PATH with a hardcoded value of the Microsoft symbol server. This breaks my little trick for injecting symbol server support into programs that do not already support it, which is fairly annoying. Now, with Vladimir’s plugin, you can have IDA use a custom symbol server without having to hack the PDB plugin and change its hardcoded string constant for the Microsoft symbol server path. (Plus, you can have IDA use your local downstream store cache as well – another disadvantage to how IDA normally loads symbols via PDB.)

Improve performance in kernel debugging and remote debugging with WinDbg by closing extra debugger windows

December 5th, 2006

Here’s a little known tidbit of information that you might find useful while debugging things:

All the extra windows that you may have open in WinDbg (disassembly, registers, memory, and soforth) actually slow the debugger down. Really.

This is typically only noticible on targets with a non-local connection in between your local debugger and the target. Primarily, then, this applies to kernel debugging and remote debugging.

These windows slow down the debugger because they represent extra information that the debugger must fetch each time the target breaks into the debugger. The debugger engine tends to follow a rule of “lazy fetching”; that is, it will try to avoid asking the target for extra information until it absolutely must do so. If you don’t have your disassembly window open, then, you’ll save the debugger from having to read memory from the disassembly target address on every debugger event that causes a breakin (until you actually explicitly request that information via the “u” command, for instance). There are similar speed consequences for having the registers window open (or setting your default register mask to show all registers, as usermode debugging does by default – this is, I suspect, why kernel debugging defaults to a very minimal register mask that is displayed on each prompt), and for having the memory window open. Keeping these windows open will require the debugger to go out of its way to request extra register information (and to read memory on each debugger event).

The result of having these extra windows open is that the debugger will just seem significantly slower when you are doing things like stepping or hitting breakpoints and continuing execution. If you’ve ever wondered why sometimes it takes the debugger a couple seconds or more to come back after each step or trace operation, this is a big contributor to that (especially in kernel debugging).

So, close those extra windows unless you really need them, as they really do slow things down in kernel debugging and remote debugging scenarios. You’ll find that your kernel debugging and remote debugging experiences are much more responsive without the debugger having all those extra round trip times being inserted on each trace/step/break event.

Oh, and you can also temporarily suspend the behavior of keeping these windows up to date while still keeping them open with a command that was recently introduced to WinDbg: “.suspend_ui flag“, where flag is either 0 to disable UI window refreshing, or 1 to enable it. This is mostly useful for scenarios where you still find the disassembly or memory windows useful, but are about to execute a large number of traces or steps and want to reduce overhead for the duration of that activity only.

Analysis of a networking problem: The case of the mysterious SMB connection resets (or “How to not design a network protocol”)

December 4th, 2006

Recently, I had the unpleasant task of troubleshooting a particularly strange problem at work, in which a particular SMB-based file server would disconnect users if more than one user attempted to simultaneously initiate a file transfer. This would typically result as a file transfer (i.e. drag and drop file copy) initiated by Explorer failing with a “The specified network name is no longer available.” (error #64) dialog box. As this particular problem involved SMB traffic going through a fair amount of custom code we had written, being a VPN/remote access company (this particular problem was seemingly only occuring when using our VPN software), we (naturally) assumed that the issue was caused by some sort of bug in our software that was doing something to upset either the Windows SMB client or SMB server.

So, first things first; after doing initial troubleshooting, we obtained some packet captures of the problematic SMB server (and SMB clients), detailing just what was happening on the network when the problem occured.

Unfortunately, in this particular case, the packet captures did not really end up providing a whole lot of useful information; they did, however, succeed in raising more questions. What we observed is that in the middle of an SMB datastream, the SMB server would just mysteriously send a TCP RST packet, thereby forcibly closing the TCP connection on which SMB was running. This corresponded exactly with one of the file share clients getting the dreaded error #64 dialog, but there was no clue as to what the problem was. In this particular case, there was no packet loss to speak of, and nothing else to indicate some kind of connectivity problem; the SMB server just simply sent an RST seemingly out of the blue to one of the SMB clients shortly after a different SMB client attempted to initiate a file transfer.

To make matters worse, there was no correlation at all as to what the SMB client whose connection got killed was doing when the connection got reset. The client could be either submitting a read request for more data, waiting for a previously sent read request to finish processing, or doing any other operation; the SMB server would just mysteriously close the connection.

In this particular case, the problem would also only occur when SMB was used in conjunction with our VPN software. When the SMB server was accessed over the LAN, the SMB connection would operate fine in the presence of multiple concurrent users. Additionally, when the SMB server was used in conjunction with alternative remote access methods other than our standard VPN system, the problem would mysteriously vanish.

By this time, this problem was starting to look like a real nightmare. The information we had said that there was some kind of problem that was preventing SMB from being used with our VPN software (which obviously would need to be fixed, and quickly), and yet gave no actual leads as to what might cause the problem to occur. According to logs and packet captures, the SMB server would just arbitrarily reset connections of users connecting to SMB servers when used in conjunction with our VPN software.

Fortunately, we did eventually manage to duplicate the problem in-house with our own internal test network. This eventually turned out to be key to solving the problem, but did not (at least initially) provide any immediate new information. While it did prevent us from having to bother the customer this problem was originally impacting while troubleshooting, it did not immediately get us closer to understanding the problem. In the mean time, our industrious quality assurance group had engaged Microsoft in an attempt to see if there was any sort of known issue with SMB that might possibly explain this problem.

After spending a significant amount of time doing exhaustive code reviews of all of our code in the affected network path, and banging our collective heads on the wall while trying to understand just what might be causing the SMB server to kill off users of our VPN software, we eventually ended up hooking up a kernel debugger to an SMB server machine exhibiting this problem in order to see if I could find anything useful by debugging the SMB server (which is a kernel mode driver known as srv.sys). After not getting anywhere with that initially, I decided to try and trace the problem from the root of its observable effects; the reset TCP connection. Through the contact that our quality assurance group had made with Microsoft Product Support Services (PSS), Microsoft had supplied us with a couple of hotfixes for tcpip.sys (the Windows TCP stack) for various issues that ultimately turned out to be unrelated to the underlying trouble with SMB (and did not end up alleviating our problem). Although the hotfixes we received didn’t end up resolving our problem, we decided to take a closer look at what was happening inside the TCP state machine when the SMB connections were being reset.

This turned out to hit the metaphorical jackpot. I had set a breakpoint on every function in tcpip.sys that is responsible for flagging TCP connections for reset, and the call stack caught the SMB server (srv.sys) red-handed:

kd> k
ChildEBP RetAddr  
f4a4fa98 f4cad286 tcpip!SendRSTFromTCB
f4a4fac0 f4cb0ee3 tcpip!CloseTCB+0xbc
f4a4fad0 f4cb0ec7 tcpip!TryToCloseTCB+0x38
f4a4faf4 f4cace69 tcpip!TdiDisconnect+0x205
f4a4fb40 f4cabbed tcpip!TCPDisconnect+0xfd
f4a4fb5c 804e37f7 tcpip!TCPDispatchInternalDeviceControl+0x14d
f4a4fb6c f4c7d132 nt!IopfCallDriver+0x31
f4a4fb84 f4c7cd93 netbt!TdiDisconnect+0x10a
f4a4fbb4 f4c7d0b8 netbt!TcpDisconnect+0x40
f4a4fbd4 f4c7d017 netbt!DisconnectLower+0x42
f4a4fc14 f4c7d11f netbt!NbtDisconnect+0x339
f4a4fc44 f4c7ba5a netbt!NTDisconnect+0x4b
f4a4fc60 804e37f7 netbt!NbtDispatchInternalCtrl+0xb4
f4a4fc70 f43fb9c1 nt!IopfCallDriver+0x31
f4a4fc7c f441e8db srv!StartIoAndWait+0x1b
f4a4fcb0 f441f13e srv!SrvIssueDisconnectRequest+0x4d
f4a4fccc f43eed3e srv!SrvDoDisconnect+0x18
f4a4fce4 f440d3b5 srv!SrvCloseConnection+0xec
f4a4fd18 f44007ae srv!SrvCloseConnectionsFromClient+0x163
f4a4fd88 f43fba98 srv!BlockingSessionSetupAndX+0x274

As it turns out, the SMB server was explicitly disconnecting the pre-existing SMB client when the second SMB client tried to setup a session with the SMB server. This explains why even though the pre-existing SMB client was operating normally, not breaking the SMB protocol and running with no packet loss, it would mysteriously have its connection reset for apparently no good reason.

Further analysis on packet captures revealed that there was always a correlation to one client sending an SMB_SESSION_SETUP_ANDX command to the SMB server and all other clients on the SMB server being abortively disconnected. After digging around a bit more in srv.sys, it became clear that what was happening is that the SMB server would explicitly kill all SMB connections from an IPv4 address that was opening a new SMB session (via SMB_SESSION_SETUP_ANDX), except for the new SMB TCP connection. It turns out that this behavior is engaged if the SMB client sets the “VcNumber” field of the SMB_SESSION_SETUP_ANDX request to a zero value (which the Windows SMB redirector, mrxsmb.sys, does), and the client enables extended security on the connection (this is typically true for modern Windows versions, say Windows XP and Windows Server 2003).

This explains the problem that we were seeing. In one of the configurations, this customer was setup to have remote clients NAT’d behind a single LAN IP address. This, when combined with the SMB redirector sending a zero VcNumber field, resulted in the SMB server killing everyone else’s SMB connections (behind the NAT) when a new remote client opened an SMB connection through the NAT. Additionally, this also fit with an additional piece of information that we eventually uncovered from running into this trouble with customers; one customer in particular had an old Windows NT 4.0 fileserver which always worked perfectly over the VPN, but their newer Windows Server 2003 boxes would tend to experience these random disconnects. This was because NT4 doesn’t support the extended security options in the SMB protocol that Windows Server 2003 does, further lending credence to this particular theory. (It turns out that the code path that disconnects users for having a zero VcNumber is only active when extended security is being negotiated on an SMB session.)

Additionally, someone else at work managed to dig up knowledge base article 301673, otherwise titled “You cannot make more than one client connection over a NAT device”. Evidently, the problem is actually documented (and has been so since Windows 2000), though the listed workaround of using NetBIOS over TCP doesn’t seem to actually work. (Aside, it’s pretty silly that of all things, Microsoft is recommending that people fall back to NetBIOS. Haven’t we been trying to get rid of NetBIOS for years and years now…?).

Looking around a bit on Google for documentation about this particular problem, I ran into this article about SMB/CIFS which documented the shortcoming relating to SMB and NAT. Specifically:

Whenever a new transport-layer connection is created, the client is supposed to assign a new VC number. Note that the VcNumber on the initial connection is expected to be zero to indicate that the client is starting from scratch and is creating a new logical session. If an additional VC is given a VcNumber of zero, the server may assume that any existing connections with that same client are now bogus, and shut them down.

Why do such a thing?

The explanation given in the LANMAN documentation, the Leach/Naik IETF draft, and the SNIA doc is that clients may crash and reboot without first closing their connections. The zero VcNumber is the client’s signal to the server to clean up old connections. Reasonable or not, that’s the logic behind it. Unfortunately, it turns out that there are some annoying side-effects that result from this behavior. It is possible, for example, for one rogue application to completely disrupt SMB filesharing on a system simply by sending Session Setup requests with a zero VcNumber. Connecting to a server through a NAT (Network Address Translation) gateway is also problematic, since the NAT makes multiple clients appear to be a single client by placing them all behind the same IP address.

So, at least we (finally) found out just what was causing the mysterious resets. Apparently, from a protocol design standpoint, SMB is deliberately incompatible with NAT.

As someone who has designed network protocols before, this just completely blows my mind. I cannot think of any good reason to possibly justify breaking NAT, especially with the incredible proliferation of NAT devices in every day life (it seems like practically everyone has a cable/DSL/wireless router type device that does NAT today, not to mention the increased pressure to reuse IP addresses as pressure on the limited IPv4 address space grows). Not to mention that the Windows file sharing protocol has to be one of the most widely used networking protocols in the world. Breaking NAT for that (evidently just for the sake of breaking NAT!) just seems like such an incredibly horrible, not-well-thought-out design decision. Normally, I am usually fairly impressed by how well Microsoft designs their software (particularly their kernel software), but this particular little part of the SMB protocol design is just uncharacteristically completely wrong.

Even more unbelievably, the stated reason that Microsoft gave for this behavior is an optimization to handle the case where the user’s computer has crashed, rebooted, and reconnected to the server before the SMB server’s TCP stack has noticed that the crashed SMB client’s TCP connection has gone away. Evidently, conserving server resources in the case of a client crash is more important than being compatible with NAT. One has to wonder how unstable the Microsoft SMB redirector must have been at the time that this “feature” was added to the SMB protocol to make anyone in their right mind consider such an absolutely ridiculous, mind-bogglingly bad tradeoff.

To date, we haven’t had a whole lot of luck in trying to get Microsoft to fix this “minor” problem in SMB. I’ll post an update if we ever succeed, but as of now, things are unfortunately not looking very promising on that front.

SDbgExt 1.09 released (support for displaying x64 EH data)

November 23rd, 2006

I’ve put out SDbgExt 1.09. This is an incremental release of my collection of WinDbg debugger extensions.

The 1.09 release primarily adds support for displaying exception handler data on x64. While there is “some” built-in debugger support for this (via the “.fnent”) command, this support is extremely minimal. You are essentially required to dump the unwind data structures yourself and manually parse them out, which isn’t exactly fun. So, I added support for doing all of that hard work to SDbgExt, via the !fnseh SDbgExt extension (display function SEH data). This support is complementary to the !exchain command supplied by ext.dll for x86 targets.

The “!fnseh” command supports displaying most of the interesting fields of the unwind metadata (besides information on how the prologue works). It also properly supports chained unwind information records (both the documented and undocumented formats). There is also basic support for detecting and processing CL’s C/C++ exception scope tables, if a function uses C language exception handling (__try/__except/__finally).

Here’s a couple quick examples of it:

1: kd> !fnseh nt!CcPinRead
nt!CcPinRead L6a 2B,10 [ U ] nt!_C_specific_handler (C)
> fffff800012bf937 L2 (fffff800012fe4c0 -> fffff80001000000)
> fffff800012bf939 L16 (fffff800012fe4c0 -> fffff80001000000)
> fffff800012bf94f L5b (fffff800012fe4c0 -> fffff80001000000)
> fffff800012bf9aa L48 (fffff800012fe4c0 -> fffff80001000000)
> fffff800012c5199 Ld (fffff800012fe4c0 -> fffff80001000000)
> fffff800012c51a6 L58 (fffff800012fe4c0 -> fffff80001000000)
> fffff800012c51fe L1b (fffff800012fe4c0 -> fffff80001000000)
1: kd> !fnseh nt!CcCopyRead
nt!CcCopyRead Lae 3A,10 [E  ] nt!_C_specific_handler (C)
> fffff80001272c01 Lbf (fffff800012fe2e0 -> fffff800012c4c39)
> fffff80001272cc0 Lc (fffff800012fe2e0 -> fffff800012c4c39)
> fffff800012871f4 L2b (fffff800012fe2e0 -> fffff800012c4c39)
> fffff8000128721f L5a (fffff800012fe2e0 -> fffff800012c4c39)
> fffff800012961b1 L8 (fffff800012fe2c0 -> fffff800012c4b93)
> fffff800012961b9 L56 (fffff800012fe2c0 -> fffff800012c4b93)
> fffff800012c4aae Lcc (fffff800012fe2c0 -> fffff800012c4b93)
> fffff800012c4b7a L19 (fffff800012fe2c0 -> fffff800012c4b93)
1: kd> !fnseh nt!NtAllocateVirtualMemory
nt!NtAllocateVirtualMemory L5e 30,10 [E  ]
nt!_C_specific_handler (C)
> fffff8000103f74f L22 ( -> fffff8000103f771)
> fffff8000103f7f9 L16 ( -> fffff8000109adbf)
> fffff8000105ed3e L46 (fffff800010f3dc0 -> fffff8000105f173)
> fffff8000105f14d L22 ( -> fffff8000105f16f)
1: kd> !fnseh nt!KiSystemCall64
nt!KiSystemCall64 L390 50,0C [EU ]
nt!KiSystemServiceHandler (assembler/unknown)
0:000> !fnseh ntoskrnl + 00001180
401180 L2e 29,06 [  C] <none> (none)
 ntoskrnl!CcUnpinFileDataEx L32 13,05 [   ] <none> (none)

The basic output format is for unwind information as presented by the extension is as so:

EH-start-address LEH-effective-length prologue-size,unwind-code-count [unwind-flags (Exception handler, Unwind handler, Chained unwind information)] exception-handler (exception-handler-language)

Additionally, if the extension thinks that the function in question is using C language support, it will display each of the scope table entries as well (scope table entries divide up the various regions in a function that may have a __try or __except; there is typically only one lowest-level exception handler per function, with the scope table being used to implement multiple __try/__except clauses per function):

> EH-start-address LEH-effective-length (__except-filter-function (optional) -> __except-handler-function)

This information can be useful for tracking down exception filters and handlers to breakpoint on and the like, as EH registrations are completely isolated from code on x64. Note that not all functions may have exception or unwind handlers, as an unwind information is required to be provided for all x64 functions that modify the stack or call other functions.

For more information on how x64 exception handling works under the hood, you might look at my article on the subject, or skape’s paper about x64 Windows binary analysis.

The kernel object namespace and Win32, part 3

November 22nd, 2006

Recently, I have posted about various features of the kernel object namespace. Here’s the table of contents for this series:

  1. The kernel object namespace and Win32, part 1.
  2. The kernel object namespace and Win32, part 2.
  3. The kernel object namespace and Win32, part 3.

This posting (the last in the series) attempts to focus on the remaining two parts of the kernel namespace that are visible to Win32. These two parts are broken up as follows:

  1. DOS devices. DOS device names are object names that can be manipulated with the file management functions, such as CreateFile or DefineDosDevice. The set of DOS devices also includes traditional DOS device names (hence the name), such as NUL, CON, AUX, and soforth. This object namespace is “sessionized” by Terminal Server.
  2. Window station and desktop objects. These are infact named kernel objects, although this fact is only barely made visible to Win32. Like DOS devices, window stations and desktops have a namespace that is “sessionized” in the presence of Terminal Server.

Frist, I’ll cover how DOS devices are “sessionized”. This information is as of how the namespace is implemented in Windows XP and later.

In Windows XP, DOS devices are virtualized based off of the LSA logon session ID a token has associated with it (otherwise known as an authentication ID). This value is actually a bit more granular than the Terminal Server session ID.

While a Terminal Server session ID represents a complete “terminal” (essentially, a complete input/output stack with all of the software support that goes with that – a “session”), an LSA logon session ID simply represents an instance of a user that is logged on to the local computer.

Now, in normal circumstances, there is typically a very similar correlation between Terminal Server sessions and LSA logon sessions. For each Terminal Server session, there is typically at least one LSA logon session, which corresponds to the user that is logged on interactively to that Terminal Server session. However, this is not necessarily a one-to-one relationship; for instance, a Terminal Server session that is idle does not really have an LSA logon session ID associated with it (other than perhaps the system logon session, under which winlogon and csrss operate). Or, a Terminal Server session could have many LSA logon session IDs associated with it; think of the case where a user starts a program under a different account with the “RunAs” command. To add to that, some LSA logon session IDs may not even correspond to a local user at all; for example, a remote network logon from a different computer that corresponds to an attempt to access a share hosted on the local machine may be given a logon session ID, even though it corresponds to no user physically logged on to the local machine.

The reason why DOS devices must be virtualized is fairly obvious; different users may have different drive maps, and you need to be able to map the same drive letter as someone else in a different session. The way this was handled needed to be more granular than Terminal Server sessions, though (at least in Windows XP); this is due to how RunAs interacts with drive share access and network drive mappings.

Specifically, the behavior of associating DOS devices with an LSA logon session ID is used because the credentials used to access a remote share are associated with the LSA logon session of the calling user. This means that if you start a program under RunAs, it won’t have access to the same credentials as the parent LSA logon session. As a result, a program started via RunAs doesn’t really have access to network drive mappings (in particular, drive letter based network drive mappings). Because of this, it makes more sense to isolate DOS devices based on LSA logon sessions than Terminal Server sessions, as a separate LSA logon session cannot access drive share connections made by a different LSA logon session anyway.

The resulting architecture that isolates DOS devices on a per-LSA-logon-session basis thus operates as follows:

For each Terminal Server session ID, there is an object directory of the name \Sessions\<TS-session-ID>\DosDevices. Although this object directory is created under the session-named directory for each Terminal Server logon session, only the object directory \Sessions\0\DosDevices is actually used in Windows XP and beyond.

In the the \Sessions\0\DosDevices directory, there is a set of subdirectories that are each named by a LSA logon session id, in the form of a 64-bit hexadecimal number with two 32-bit halves separated by a dash, with each 32-bit half padded out to 8 characters by leading zeros (Terminal Server session IDs used to name the \Sessions\<TS-session-ID> object directories are decimal with no leading zeroes). So, you might see a directory named in the form \Sessions\0\DosDevices\00000000-00b73dfe for an LSA logon session ID 0x0000000000b73dfe.

Each of these object directories named by an LSA logon session ID contains any DOS devices that were created by that LSA logon session. The way the system evaluates DOS device names is by first trying the LSA logon session local object directory, and then trying the global DOS device object directory (which is named \GLOBAL??, and has DOS devices that are associated with no one session in particular, such as physical hard drive DOS device names like C:, or predefined DOS devices like NUL or PhysicalDrive0). To override this lookup behavior and specify that a DOS device name only references the corresponding object in the global DOS device directory, the “Global\” prefix may be used with the DOS device name (similar to named kernel objects, such as mutex objects). This support is handled by a symbolic link named “Global” that is created in each LSA-logon-session ID-named object directory. This symbolic link refers to the “\GLOBAL??” object directory. (Of course, only \DosDevices is publicly documented, to my knowledge.)

If you’ve been reading the previous articles, you may be wondering how the “\DosDevices” and “\??” object directories relate to the “\GLOBAL??” object directory that I just said contained all global DOS device symbolic links (after all, the DDK documents these symbolic links as being present in “\DosDevices”). Well, the way it ends up working out is that there is a symbolic link named “\DosDevices” which points to “\??”. The “\??” name is a bit of a hack; it is not actually a real object name, but instead a hardcoded special case in the object manager’s parser routines that is interpreted as a reference to the global DOS device directory. (This was done for performance reasons; it is very common to look up objects in the \DosDevices directory, and adding a special case check first, before the normal object directory traversal speeds up name translations for Win32 names a bit.) As a result, you can actually use “\??, “\DosDevices”, and “\GLOBAL??” to refer to global DOS devices; they all end up pointing to the same underlying location (eventually).

The other part of Terminal Server session instancing on the kernel object namespace deals with window stations and desktops. Window stations for a particular Terminal Server session are placed in a directory named in the form of \Sessions\<TS-Session-ID>\Windows\WindowStations\<Window-Station-Name> (desktop objects have names that treat a window station as an object directory, such as \Sessions\2\Windows\WindowStations\WinSta0\Default). Note that for session zero, this convention is slightly different; instead, window stations for session zero reside in \Windows\WindowStations.

That wraps up the series on the kernel object namespace (as it relates to Win32). By now, you should have a working understanding of how to use the kernel object namespace from Win32, and how the “magic” behind Terminal Server and LSA logon session isolation is actually implemented. (It is worth noting that the abstraction layer that Win32 provides added real value here, as there were no gross hacks required to be added on to the kernel object namespace to support Terminal Server. Instead, everything is implemented as plain object directories and symbolic links.)

Terminal Server session isolation for the rest of Win32 (namely, the input/output portion) is much more complicated and is hardly as clean as the separation imposed on the object namespace is. That, however, is a story for a different series…

Frame pointer omission (FPO) optimization and consequences when debugging, part 1

November 21st, 2006

During the course of debugging programs, you’ve probably ran into the term “FPO” once or twice. FPO refers to a specific class of compiler optimizations that, on x86, deal with how the compiler accesses local variables and stack-based arguments.

With a function that uses local variables (and/or stack-based arguments), the compiler needs a mechanism to reference these values on the stack. Typically, this is done in one of two ways:

  • Access local variables directly from the stack pointer (esp). This is the behavior if FPO optimization is enabled. While this does not require a separate register to track the location of locals and arguments, as is needed if FPO optimization is disabled, it makes the generated code slightly more complicated. In particular, the displacement from esp of locals and arguments actually changes as the function is executed, due to things like function calls or other instructions that modify the stack. As a result, the compiler must keep track of the actual displacement from the current esp value at each location in a function where a stack-based value is referenced. This is typically not a big deal for a compiler to do, but in hand written assembler, this can get a bit tricky.
  • Dedicate a register to point to a fixed location on the stack relative to local variables and and stack-based arguments, and use this register to access locals and arguments. This is the behavior if FPO optimization is disabled. The convention is to use the ebp register to access locals and stack arguments. Ebp is typically setup such that the first stack argument can be found at [ebp+08], with local variables typically at a negative displacement from ebp.

A typical prologue for a function with FPO optimization disabled might look like this:

push   ebp               ; save away old ebp (nonvolatile)
mov    ebp, esp          ; load ebp with the stack pointer
sub    esp, sizeoflocals ; reserve space for locals
...                      ; rest of function

The main concept is that FPO optimization is disabled, a function will immediately save away ebp (as the first operation touching the stack), and then load ebp with the current stack pointer. This sets up a stack layout like so (relative to ebp):

[ebp-01]   Last byte of the last local variable
[ebp+00]   Old ebp value
[ebp+04]   Return address
[ebp+08]   First argument...

Thereafter, the function will always use ebp to access locals and stack based arguments. (The prologue of the function may vary a bit, especially with functions using a variation __SEH_prolog to setup an initial SEH frame, but the end result is always the same with respect to the stack layout relative to ebp.)

This does (as previously stated) make it so that the ebp register is not available for other uses to the register allocator. However, this performance hit is usually not enough to be a large concern relative to a function compiled with FPO optimization turned on. Furthermore, there are a number of conditions that require a function to use a frame pointer which you may hit anyway:

  • Any function using SEH must use a frame pointer, as when an exception occurs, there is no way to know the displacement of local variables from the esp value (stack pointer) at exception dispatching (the exception could have happened anywhere, and operations like making function calls or setting up stack arguments for a function call modify the value of esp).
  • Any function using automatic C++ objects with destructors must use SEH for compiler unwind support. This means that most C++ functions end up with FPO optimization disabled. (It is possible to change the compiler assumptions about SEH exceptions and C++ unwinding, but the default [and recommended setting] is to unwind objects when an SEH exception occurs.)
  • Any function using _alloca to dynamically allocate memory on the stack must use a frame pointer (and thus have FPO optimization disabled), as the displacement from esp for local variables and arguments can change at runtime and is not known to the compiler at compile time when code is being generated.

Because of these restrictions, many functions you may be writing will already have FPO optimization disabled, without you having explicitly turned it off. However, it is still likely that many of your functions that do not meet the above criteria have FPO optimization enabled, and thus do not use ebp to reference locals and stack arguments.

Now that you have a general idea of just what FPO optimization does, I’ll cover cover why it is to your advantage to turn off FPO optimization globally when debugging certain classes of problems in the second half of this series. (It is actually the case that most shipping Microsoft system code turns off FPO as well, so you can rest assured that a real cost benefit analysis has been done between FPO and non-FPO optimized code, and it is overall better to disable FPO optimization in the general case.)

Update: Pavel Lebedinsky points out that the C++ support for SEH exceptions is disabled by default for new projects in VS2005 (and that it is no longer the recommended setting). For most programs built prior to VS2005 and using the defaults at that time, though, the above statement about C++ destructors causing SEH to be used for a function (and thus requiring the use of a frame pointer) still applies.

Impressions of Vista

November 20th, 2006

Since Vista went RTM, I decided that it was finally time to bite the bullet, so to speak, and install a copy of Vista on a box I use on at least a semiregular basis (since I’ll have to deal with it sooner or later, as it’s going to be in stores soon).

I went into this with a fair amount of reservations, having encountered plenty of annoyances with Vista before (things like UAC prompts as a limited user for things like event viewer or regedit, or other not-so-polished things that were really annoying in the previous releases). However, I pretty much reversed my opinion after what I encountered…

The first thing was installation. Gone completely is text mode setup (although that isn’t something that I am all that excited about – setup is setup, after all, as long is it works). It’s definitely a big visual change from the previous versions of setup, and it’s perhaps a bit less power-user-ish (no option to format drives as NTFS or things like change the Windows install directory on a clean install that I saw, although I might have missed something). Installation went fairly smoothly, although it definitely took it’s sweet time.

The first real surprise, though, was that after Vista (finally) finished installing, I did not need a single third party driver. Not one. Even my Bluetooth transceiver and my GeForce 6800 Go just worked out of the box at full functionality without my having to do anything extra.

This, in and of itself, is a pretty impressive thing. Normally, you’d virtually always have to go and install drivers for your wireless NIC, bluetooth adapter, video card (if you wanted to do 3D), and soforth, but with my Vista install, everything was golden out of the box. Obviously, your experience my vary depending on your hardware, but the idea of the whole thing working at full functionality out of the box (for a non-server Windows flavor) on a brand new (clean) install is something I haven’t had in a looong time.

Now, that isn’t to say that things were perfect. I did have to do a fair amount of tweaking settings to get things to my liking, but most of the things I had to change were truly user preference type things and not necessities like setting up drivers and the like.

Like Srv03 x64, Vista right away prompts you to enable Windows Update in autoinstall mode (and to check for updates right now. This is definitely a Good Thing from a security perspective for your average user, who is likely to not even bother with that sort of thing unless you put it right in their face the first time they log on to their new box).

For my configuration, I had the Vista box joined to a domain, and was intending to use a domain account (a limited user) as my main user account. This turned out to work fairly well; there was the usual stuff with adding the computer to the domain, but unlike in Windows XP, joining the box to the domain doesn’t turn off the user switching features (or new logon UI). Being able to run multiple sessions is a real benefit for me, since I run as a limited user; instead of doing runas, for complex tasks, I find that it is much easier to just start another session up entirely. Previously, you could only do this on server OS’s by starting, say, a TS session to localhost and logged on as admin. With Vista, you can just do that naturally with user switching, even if you’re joined to the domain.

Speaking of user switching and sound/audio, another nice change is that switching sessions doesn’t abruptly terminate A/V output during the session switch, like in XP (where you would allyhave to restart your song in WMP after coming back to your original session). The new audio architecture (you can thank Larry Osterman in part for that) is also pretty cool, that is, being able to control volume on a per-application basis even for things that don’t have their own volume sliders.

A couple of other pet peeves of mine from XP are fixed as well. For instance, you finally no longer need to have the SeSystemtimePrivilege in order to pull up a calendar by clicking on the clock in the notify icon area. Also, the “there is nothing interesting here, go away” nag notices that used to plague the root partition, Windows, and System32 directories of your Windows install after a fresh install are gone as well.

UAC turned out to be not quite as bad as I had thought it would be. I was already planning on running as a limited user, so for me, UAC is pretty much just a glorified RunAs. Looking at it in that sense, it’s somewhat better than before, in that you can start some things as yourself and only later convert them to admin if necessary (especially good for things launched by other applications instead of the shell, where you might not necessarily have a good place to insert a “runas” command or shortcut to boost the new process to admin). I did end up having to disable UAC filesystem and registry virtualization, though (which is under Security Options in Group Policy, thanks to Alex Ionescu for that tip – be warned that it looks like you need to reboot and not just use “gpudate.exe /force” to apply the policy, though). UAC virtualization ended up being the root cause of some nasty performance issues in a couple of programs that I was using, which was strange given that I had already done the work necessary to allow them write access to things that they needed to write to in order to function. On the whole, though, I think I’ll be able to live with UAC as a limited user.

UAC is a bit over the top for my tasks as administrator, since the only time I’m doing something as admin is when I know that I really want to do it as admin, which makes the UAC confirmations there rather redundant at best. However, I still prefer the “improved RunAs” aspect of it while running as a limited user, so I think I’ll keep UAC enabled for now (especially since I don’t end up having to do most daily tasks as admin, just installing things and reconfiguring system-wide things for the most part). Having lived as a limited user on Windows XP and Windows Server 2003 for a long time now, this didn’t turn out to be all that much of a shocking change as it might be for someone who is used to only running as an admin.

Last but not least is DWM and the new video driver model. The box I put Vista on to play around with has a GeForce 6800, which (as I mentioned earlier) came with drivers out of the box. It just so happened that those drivers support WDDM (the new video driver model), though the card itself only implements the DX9 DDI (the first DirectX 10 video cards were just released a week or two ago; the GeForce 8 class from NVIDIA).

Vista enabled Aero Glass up-front, which is essentially the transparency and 3D window flip switching support that you may have heard about. This is somewhat cool in and of itself, but what really blew me away is how it worked when I went ahead and tossed an intensive 3D application at the WDDM drivers – World of Warcraft (naturally). WoW had a bit of weirdness running on Vista at first – it ran really, really slow in windowed mode until I, through trial and error, determined that doing ctrl-alt-del and switching to the secure desktop once after starting WoW (extremely mysteriously) cures the slowness. After working around that little bit of strangeness, I was really surprised to find out that in windowed mode, you could do 3D window flipping (or have Aero windows obscure WoW) while WoW was running and drawing at full speed, with practically no major drop in performance. Seeing that work out with a full D3D program rendering away into a window while in flip-window mode alongside the rest of my desktop applications was definitely very cool, especially when you consider that it all magically “just works” without any modifications to either DirectX programs or plain vanilla GDI programs.

All things considered, I can say that Vista is definitely was more solid than I thought it would be. There were a couple of snags along the way, but none any worse than with previous versions of Windows, and on the whole, a lot more painless in my experience (at least for my system, the hell of trying to find the NVIDIA video driver that crashed the least was gone – the one that shipped with the OS really worked, and while twice I got a “Your video card has encountered a problem and has recovered” bubble in the notification area, my system didn’t bluescreen and none of my programs crashed – a *big* improvement over what I’ve seen from NVIDIA with Windows XP and previous driver versions written against the old display driver model, to be certain). There are other less than savory aspects to consider though, such as the even more pervasive DRM tied in with the OS, although for me, that is more a philosophical problem than a practical problem as I make a point of not buying DRM-laden content. Even so, I’m definitely looking forward to it becoming generally available in the next month or two.

Windows Vista RTM released to MSDN subscribers

November 17th, 2006

If you’ve got an MSDN subscription with operating systems included, you can now head on over to MSDN subscriber downloads and grab a copy of Windows Vista RTM (final build).