Archive for the ‘Windows’ Category

New WinDbg (6.7.5.0) released

Friday, April 27th, 2007

It’s finally here – WinDbg 6.7.5.0.

I haven’t gotten around to trying out all of the new goodies yet, but there are some nice additions. For one, .fnent now decodes unwind information in a more meaningful way on x64 (although it still doesn’t understand C scope table entries, making it less useful than SDbgExt’s !fnseh if that is what you were interested in.

Looks like they’ve finally gotten around to signing WinDbg.exe too (though, curiously, not the .msi the installer extracts), so the elevation prompts for WinDbg are now of the more friendlier sort instead of the “this program will destroy your computer” sort.

There is also reportedly source server support for CVS included; I imagine that I’ll be taking a stab at that again now that it is supposedly fully baked now.

In other news, the blog (and DNS) will be moving to a more ideal hosting location (read: not my apartment) as early as this weekend (if all goes according to plan, that is). It’ll be moving to a yummy new quad core Xeon box (with a real connection), a nice step up from the original hardware that it has been running on until a short while ago (good riddance). Crossing my fingers, but hopefully the random unavailability been hardware dying on me and Road Runner sucking should be going away Real Soon Now(tm).

A brief discussion of Windows Vista’s IE Protected Mode (and user/process level security)

Wednesday, April 25th, 2007

I was discussing the recent QuickTime bug on Matasano Chargen, and the question of whether it would work in the presence IE7 + Vista and protected mode came up. I figured a more in depth explanation as to just what IE7’s Protected Mode actually does might be in order, hence this posting.

One of the new features introduced with Internet Explorer 7 on Windows Vista is something called “Protected Mode” (or “IE Protected Mode”). It’s an on-by-default security that is sold as something that greatly hardens Internet Explorer against meaningful exploitation, even if an exploitable hole in IE (or a component of IE, such as an ActiveX control) is found.

Let’s dig in a little bit deeper as to what IE Protected Mode is (and isn’t), and what it means for you.

First things first. Protected mode is not related to the “enhanced security configuration” that is introduced in Windows Server 2003 in any way. The “enhanced security configuration” IE included with Srv03 is, at its core, just a set of more restrictive (i.e. locked down) default settings with regard to things like scripting, downloading files, and soforth. Protected mode does not rely on locking down security zone settings to the point where you cannot download files or run any scripts by default, and is completely unrelated to the IE hardening work done in the default Srv03 configuration. I’d imagine that protected mode will probably be included in Longhon Server, but the underlying technologies are very different, and are designed to address different “market segments” (“enhanced security configuration” being just a set of more restrictive defaults, whereas protected mode is a fundamental rethink of how the browser interacts with the rest of the operating system).

Protected mode is a feature that is designed to make “surfing the web a safer experience” for end users. Unlike Srv03, where a locked down IE might fly because you are ostensibly not supposed to be doing lots of fancy web-browser-ish things from a server box, end users are clearly not going to take kindly towards not being permitted to download files, run JavaScript, and soforth in the default configuration.

The way protected mode takes a stab at making things better for the end users of the world is to build upon the new “integrity level” security mechanism that has been introduced into the Windows NT security model starting with Vista, with the goal of making the web browser an “untrusted” process that cannot perform “dangerous” things.

To understand what this means, it’s necessary to know what these new-fangled “integrity levels” in Vista are all about. Integrity levels are assigned to a token representing a user, and tokens are assigned to a process (and can be impersonated by a thread, typically something done by something like an IPC server process that needs to perform something on behalf of a lesser-privileged caller). What’s meaningful about integrity levels is that they allow you to partition what we know of as a “user” into something with multiple different “trust levels” (low, medium, high, with several other infrequently-used levels), such that a thread or a process running as a certain integrity level (or “trust level”) cannot “intefere” with something running at a higher integrity level.

The way this is implemented is by an additional level of security check that is performed when some kind of access rights check is performed. This additional check compares the integrity level of the caller (i.e. the thread or process token’s integrity level) with a new type of field in the security descriptor of the target object (called the “mandatory label“) that specifies what sorts of access a caller of a certain integrity level is allowed to request. The “mandatory label” allows an integrity level to be associated with an object for security checks, and allows three basic policies (lower integrity levels cannot request read access, low integrity levels cannot request write access, lower integrity levels cannot request execute access) to be set, comparing the integrity level of a caller against the integrity level specified with an object’s security descriptor. (Only these three generic access rights may be “guarded” by the integrity level in this way; there is no granularity to allow object specific access rights to be given specific minimum caller integrity levels).

The default settings in most places do not allow write access to be granted to processes of a lower integrity level, and the default minimum integrity level is usually “medium”. The new, label/integrity level access check is performed before conventional ACL-based checks.

In this respect, integrity levels are an attempt to inject something of a sort of process-level security into the NT security model.

If you’re at all familiar with how NT security works, this may be a bit new to you. NT is based upon user-level security, where processes (and threads, in the case of impersonation) run under the context of a user, and derive their security rights (i.e. what securable objects they have access to – files, directories, registry keys, and soforth) and privileges (i.e. the ability to shut down the system, the ability to load a driver, the ability to bypass ACL checks for backup/restore, and soforth) from the user context they run under. The thinking behind this sort of model is that each distinct user on a system will run as, well, a different user. Processes from one user cannot interfere with processes (or files, directories, and soforth) running as a different user, without being granted access to do so (i.e. via an ACL, or by special, administrator-level privileges). The “operating system” (i.e. the services and programs that support the system) conceptually runs as yet another “user”, and is thus ostensibly protected from adverse modifications by malicious users on the system. Each user thus exists in a sort of sandbox, unable to interfere with any other user. Conversely, any process running as a particular user can do anything to any other process (or file or directory) owned by that same user; there is no protection within a user security context.

Obviously, this is a gross oversimplification of the NT security model, but it gets the point across (or so I hope!): The security system in NT revolves around the user as the means to control access in a meaningful fashion. This does make sense in environments like large corporate settings, where many users share the same computer (or where computers are centrally managed), such that users cannot interfere with eachother, and ostensibly cannot attack their computers (i.e. the operating system) because they are running as “plain users” without administrator access and cannot perform “dangerous” tasks.

Unfortunately, in the era of the internet, exploitable software bugs, and computers with end users that run code they do not entirely trust, this model isn’t quite as good as we would like. Because the user is the security boundary, here, if an attacker can run code under your user account, they have full access to all of the processes, files, directories (and soforth) that are accessible to that user. And if that user account happened to be a computer administrator account, then things are even worse; now the attacker has free reign over the entire computer, and everything on it (including all other users present on the box).

Clearly, this isn’t such a great situation, especially given the reality that many users run untrusted code (or more generally, buggy or exploitable code) on a frequent basis. In this Internet-enabled age, user-level security as it has been traditionally implemented isn’t really enough.

There are still ways to make things “work” with user-level security; namely, to give each human several user accounts, specific to the task that they are doing. For example, I might have one user account that I use for browsing and games, and another user account that I use for accessing top secret corporate documents. If the user account that I use to browse the Internet with gets compromised somehow, such as by my running an exploitable program and getting “owned”, then my top secret corporate documents are still safe; the malicious code running under the Internet-browsing-and-games account doesn’t have access to do anything to my secret documents, since they are owned by a different account and the default ACL protects them from other users.

Of course, this is a tough pill to expect end users to swallow; having to switch user accounts as they switch between tasks of differing importance is at best inconvenient and at worst confusing and problematic (for example, if I want to download a file from the Internet for use with my top secret corporate documents, I have to go to (comparatively) a lot of trouble to give it to my other user, and doing so opens an implicit trust relationship between my secret-documents-user and my less-trusted-Internet-browsing user, that the program I just downloaded is 1) not inherently malicious, 2) not tampered with or compromised, and 3) not full of exploitable holes that would put my documents at risk anyway the moment my secret-documents-user runs it). Clearly, while you could theoretically still get by with user level access in today’s world, as a single user, doing so as it is implemented in Windows today is a major pain (and even with everyone’s best intentions, few people I have seen really follow through completely with the concept and do not share programs or files between their users in any way whatsoever).

(Note that I am not suggesting that things like running as nonadmin or breaking tasks up into different users are a lost cause, just that to get things truly right and secure, it is a much more difficult job than one might expect initially, so much so that most “joe users” will not stand a chance at doing it perfectly. I’m also not trying to knock on user-level security as just outright flawed, useless, or broken, but the fact remains there are problems in today’s world that merit additonal consideration.)

Whew, that’s a rather long segway into user-level security. Anyways, protected mode is Microsoft’s first real attempt to tackle this problem – the fact that user level security does not always provide fine enough granularity, in the fact of untrusted or buggy programs – in a consumer-level system, in such a way that is palatable to “joe users”. The way that it works is to leverage the integrity level mechanism to create an additonal security barrier between the user’s web browser (i.e. Internet Explorer in protected mode) and the rest of the user’s files and programs. This is done by assigning the IE process a low integrity level. Following with what we know of integrity levels above, this means that the IE process will be denied access (by the security manager in the kernel) to do things like overwrite your documents, place malicious programs in your “startup” Start Menu directory, overwrite executables in your system directory (of course, if you were already running as a plain user, it wouldn’t be able to do this anyway…), and soforth. This is clearly a good thing. In case the implications haven’t fully sunk in yet:

If an attacker compromises a low integrity process, they should not be able to destroy your data or install a trojan (or other malicious code) on your system*.

(*: This is, of course, barring implementation errors, design oversights, and local privilege escalation holes. The latter may prove to be an especially important sticking point, as many companies (Microsoft included) have often “looked down” upon local privilege escalation bugs as relatively not important to fix in a timely fashion. Perhaps the introduction of process-level security control will help add impetus to shatter the idea that leaving local privilege escalation holes sitting around is okay.)

Now, this is a very important departure from where we have been traditionally with user level access control. Combining per process access control with per user access control allows us to do a lot more to protect users from malicious code and buggy software (in other words, protecting users from themselves), in a fashion that is much easier to deal with from a user perspective.

However, I think it would be premature to say that we’re “all the way there” yet. Protected mode and low integrity level processes are definitely a great step in the right direction, but there still remain issues to be solved. For example, as I alluded to previously, the default configuration allows medium integrity objects to still be opened for read access by low integrity callers. This means that, for example, if an attacker compromises an IE process running in protected mode, they still do have a chance at doing some damage. For instance, even though an attacker might not be able to destroy your data, per-se, he or she can still read it (and thus steal it). So, to continue with our previous example of someone who works with top secret corporate documents, an attacker might not be able to wipe out the company financial records, or joe user’s credit card numbers, but he or she could still steal them (and post them on the Internet for anyone to see, or what have you). In other words, an attacker who compromises a low integrity process can’t destroy all your data (as would be the case if there were no process-level security and we were dealing with just one user account), but he or she can still read it and steal it.

There are other things to watch out for, too, with protected mode. Don’t get into the habit of clicking “OK” on that “are you sure you want this program to run outside of IE Protected Mode” dialog box, or you’re setting yourself up to be burned by clever malware. And certainly never click the “don’t ask me again” check box on the consent dialog, or you’re just begging for some piece of malware to abuse your implicit consent without you even realizing that something’s gone wrong. (In case you’re wondering, the mechanism in IE that allows processes to elevate to medium integrity involves an appcompat hook on CreateProcess that requests a medium integrity process (ieuser.exe) to prompt the user for consent, with the medium integrity process creating the process if the user agrees. So user interaction is still required there, though we know how much users love to click “OK” on all those pesky security warnings. Oh, and there is also some hardening that has been done in win32k.sys to prevent lower integrity processes from sending “dangerous” window messages to higher integrity processes (even WM_USER and friends are disabled by default across an integrity level boundary), so “shatter attacks” ought not to work against the consent dialog. Note that if you bypass the appcompat hook, the new process created is also low integrity, and won’t be able to write to anywhere outside of the “low integrity” sandbox writable files and directories.)

So, while running IE in protected mode does, in some respects limit the damage that can be done if you get compromised, I would still recommend not running untrusted programs under the same user account as your important documents (if you really care that much). Perhaps in a future release, we’ll see a solution that addresses the idea of not allowing untrusted programs to read arbitrary user data as well (such should be possible with proper use of the new integrity level mechanisms, although I suspect the true difficulty shall be in getting third party applications to play nicely as we continue to try and place control of the user’s documents more firmly in the control of the actual user instead of in any arbitrary application that runs on the box).

Sometimes, a cheap hack does the trick after all (or how I worked around some mysterious USB/Bluetooth issues)

Monday, April 16th, 2007

My relatively new Dell XPS M1710 laptop (running Vista x64) has an annoying problem that I’ve recently tracked down.

It appears that when you connect a USB 1.1 device (such as a USB keyboard) to it, and then disconnect that device, Bluetooth and the built-in smart card reader tend to break the next time the computer goes through a suspend/resume cycle. This is fairly annoying for me, as I use a USB keyboard at work (and I also use Bluetooth extensively); it got to the point where pretty much every other time I went to the office and came home, Bluetooth and the smart card reader would break.

The internal Bluetooth transceiver and the built-in smart card reader are both connected to the system via an internal USB hub. It seems like this is becoming a popular thing nowadays among laptop manufacturers, connecting “internal” peripherals on laptops via USB instead of them being on-board and hardwired to the PCI bus or the like. Anyways, when the Bluetooth hub and smart card reader get into the broken state, I’ll typically get a “A USB device attached to the system is not functioning” notification, and the internal USB hub shows up as not started in device manager (with a problem code of CM_PROB_FAILED_START – code 10). Occasionally, the Bluetooth transceiver itself shows up with a CM_PROB_FAILED_START error, but the vast majority of the time it is the USB hub that fails.

I’ve done a bit of searching for a real fix for this problem, and closest I’ve found is this KB article which describes a problem that does sound like mine – Bluetooth breaks after suspending – but, the hotfix isn’t publicly available yet. I suppose I could have called PSS and tried to talk them into getting me a copy of that hotfix to try out, but I tend to try and avoid muddling through the various tiers of technical support until I really have no other options. Furthermore, there’s no guarantee that the hotfix would actually solve the underlying problem; the article doesn’t make any mention of internal USB hubs acting flaky in conjunction with a Bluetooth transceiver, only the Bluetooth module itself.

Until recently, to get out of this state, I typically either have to suspend/resume the laptop again (although this is dangerous*), reboot entirely, or remove and reinstall the devnode associated with the USB hub in device manager. (*Trying to suspend/resume in this case doesn’t always work. Sometimes, it won’t fix the problem, and more than a couple of times it has resulted in Vista hanging while trying to suspend, forcing a hard reboot.).

None of these solutions are particularly desirable; nobody likes having to shut everything down and reboot all the time with a laptop (kind of defeats the point of that nice sleep feature, doesn’t it?). Removing and reinstalling the devnode is less painful than a reboot, but only slightly; waiting for everything on the internal USB hub to get reinstalled after doing that takes up to a minute or two of disk thrashing while INFs are searched, and if things are going to take that long then it would almost be faster just to shut down and reboot anyway.

Eventually, though, I discovered that simply disabling and reenabling the devnode corresponding to the USB hub would resolve the problem. This is definitely a much nicer solution than any of the above; it’s (relatively) quick and less painful, and it doesn’t involve me sitting around and waiting for either a shutdown or reboot, or for a bunch of devices to get reinstalled by PnP.

Unfortunately, that workaround is still a rather manual process. It’s not too bad if I just keep an elevated Device Manager window open for easy access, but among other things this prevents me from, say, unlocking my computer with a smart card after an unsuspend (as the Bluetooth transceiver and smart card reader share the same USB hub that tends to flake out after a suspend/resume, after a USB 1.1 device is removed).

So, I set about seeing whether I could try and automate this process. I could have tried to track down the root cause a bit more, but as far as I can tell, the problem is either 1) a bug in Vista’s USB support (perhaps or perhaps not an x64 specific bug), or 2) a bug in firmware/hardware relating to the internal USB hub itself, or 3) a bug in firmware/hardware relating to the Bluetooth transceiver, or 4) a bug in the Bluetooth drivers for my laptop. None of those possibilities would be something that I could easily solve (as far as the root cause goes), even if I managed to track down the originating cause of the problem. As a result, I decided that the most time-effective solution would be to just try and automate the process of disabling and reenabling the devnode associated with the USB hub (if it breaks), or the Bluetooth transceiver (if it breaks instead). Restarting the devnode is comparatively fast when viewed in light of the other workarounds, and in any case it is fast enough to be a viable way to at least alleviate the symptoms of this problem.

After doing some brief research, it looked like the way to go here was a combination of the CM_Xxx APIs and the SetupDi APIs. Although there’s a relatively large amount of indirection you have to go through to restart an individual devnode (and the CM_Xxx APIs / setupapi are not particularly easy to use in the first place), there happens to be a WDK sample that has the capability to do just what I want – DevCon. DevCon is a console equivalent to the Device Manager MMC snapin; it’s capable of enumerating device nodes, installing/removing them, updating drivers, disabling/reenabling devnodes, and soforth.

Sure enough, I verified that DevCon’s `restart’ command was sufficient to restart the broken devnode (in a fashion similar to disabling and reenabling it in Device Manager), with the end result of causing the USB hub to start working again.

At this point, all I had to do was come up with a good way to locate the broken devnodes, a good way to know when the computer went in/out of sleep (as a sleep cycle causes the problem to occur), and then inject the DevCon code responsible for restarting a device into my program. To make a long story short, I ended up keying the program based on the hardware ids of the internal USB hub and Bluetooth transceiver such that it would check for all devnodes that 1) matched a hardware id in that list, and 2) were in a disabled state with a problem code of CM_PROB_FAILED_START. Detecting a sleep transition is fairly easy for a service (services receive notifications of PnP/Power events through the callback registered via RegisterServiceCtrlHandlerEx, and as I wanted the program to function continuously, a service seemed like the logical way to run it anyway.

An hour or two later and I had my workaround problem done. It’s certainly not pretty, and it doesn’t do anything to fix the root cause of the problem, but as far as treating the symtoms goes, it gets the job done. The basic way the program works is that every time the system resumes from suspend, it will poll all devnodes every second for 30 seconds, looking for a devnode that failed to start and is one of the known list of problematic hardware ids. If it finds such a device, it attempts to restart it (thereby working around the root cause of the problem and alleviating the symptoms). Polling for a fixed time after resume isn’t pretty by any means, but it can occasionally take a bit for one of the devnodes to show as broken when it’s not working, so this works around that.

While you could safely call it a giant hack, the program does get the job done; now, I can unlock my laptop via smart card or use Bluetooth almost immediately after a resume, even if the breakage-after-USB1-device-is-unplugged problem strikes, and all without having to manually futz around in device manager every time.

In case anyone’s interested, I’ve put the source code for the program (“Broken Device Bouncer”, or DevBouncer for short) up. It’s fairly hardcoded to be specific to my machine, so if you wanted to use it for some reason, you’d need to rebuild it.

How I ended up in the kernel debugger while trying to get PHP and Cacti working…

Saturday, April 14th, 2007

Some days, nothing seems to work properly. This is the sad story of how something as innocent as trying to install a statistics graphing Web application culminated in my breaking out the kernel debugger in an attempt to get things working. (I don’t seem to have a lot of luck with web applications. So much for the way of the future being “easy to develop/deploy/use” web-based applications…)

Recently, I decided that to try installing Cacti in order to get some nice, pretty graphs describing resource utilization on several boxes at my apartment. Cacti is a PHP program that queries SNMP data and, with the help of a program called RRDTool, creates friendly historical graphs for you. It’s commonly used for monitoring things like network or processor usage over time.

In this particular instance, I was attempting to get Cacti working on a Windows Server 2003 x64 SP2 box. Running an amalgam of unix-ish programs on Windows is certainly “fun”, and doing it on native x64 is even more “interesting”. I didn’t expect to find myself in the kernel debugger while trying to get Cacti working, though…

To start out, the first thing I had to do was convert IIS6’s worker processes to 32-bit instead of 64-bit, as the standard PHP 5 distribution doesn’t support x64. (No, I don’t consider spending who knows how many hours to get PHP building on x64 natively a viable solution here, so I just decided to stick with the 32-bit release. I don’t particularly want to be in the habit of having to then maintain rebuild my own PHP distribution from a custom build environment each time security updates come out either…).

This wasn’t too bad (at least not at first); a bit of searching revealed this KB article that documented an IIS metabase flag that you can set to turn on 32-bit worker processes (with the help of the adsutil.vbs script included in the IIS Adminscripts directory).

One small snag here was that I happened to be running a symbol proxy in native x64 mode on this system already. Since the 32-bit vs 64-bit IIS worker process flag is an all-or-nothing option, I had to go install the 32-bit WinDbg distribution on this system and copy over the 32-bit symproxy.dll and symsrv.dll into %systemroot%\system32\inetsrv. Additionally, the registry settings used by the 64-bit symproxy weren’t directly accessible to the 32-bit version (due to a compatiblity feature in 64-bit versions of Windows known as Registry Reflection), so I had to manually copy over the registry settings describing which symbol paths the symbol proxy used to the Wow64 version of HKLM\Software. No big deal so far, just a minor annoyance.

The first unexpected problem that cropped up happened after I had configured the 32-bit symbol proxy ISAPI filter and installed PHP; after I enabled 32-bit worker processes, IIS started tossing HTTP 500 Internal Server Error statuses whenever I tried to browse any site on the box. Hmm, not good…

After determining that everything was still completely broken even after disabling the symbol proxy and PHP ISAPI modules, I discovered some rather unpleasant-looking event log messages:

ISAPI Filter ‘%SystemRoot%\Microsoft.NET\Framework64\v2.0.50727\aspnet_filter.dll’ could not be loaded due to a configuration problem. The current configuration only supports loading images built for a x86 processor architecture. The data field contains the error number. To learn more about this issue, including how to troubleshooting this kind of processor architecture mismatch error, see http://go.microsoft.com/fwlink/?LinkId=29349.

It seemed that the problem was the wrong version of ASP.NET being loaded (still the x64 version). The link in the event message wasn’t all that helpful, but a bit of searching located yet another knowledge base article – this time, about how to switch back and forth between 32-bit and 64-bit versions of ASP.NET. After running aspnet_regiis as described in that article, IIS was once again in a more or less working state. Another problem down, but the worst was yet to come…

With IIS working again, I turned towards configuring Cacti in IIS. Although, at first it appeared as though everything might actually go as planned (after configuring Cacti’s database settings, I soon found myself at its php-based initial configuration page), such things were not meant to be. The first sign of trouble appeared after I completed the initial configuration page and attempted to log on with the default username and password. Doing so resulted in my being thrown back to the log on page, without any error messages. A username and password combination not matching the defaults did result in a logon failure error message, so something besides a credential failure was up.

After some digging around in the Cacti sources, it appeared that the way that Cacti tracks whether a user is logged in or not is via setting some values in the standard PHP session mechanism. Since Cacti was apparently pushing me back to the log on page as soon as I logged on, I guessed that there was probably some sort of failure with PHP’s session state management.

Rewind a bit to back when I installed PHP. In the interest of expediency (hah!), I decided to try out the Win32 installer package (as opposed to just the zip distribution for a manual install) for PHP. Typically, I’ve just installed PHP for IIS the manual way, but I figured that if they had an installer nowadays, it might be worth giving it a shot and save myself some of the tedium.

Unfortunately, it appears that PHP’s installer is not all that intelligent. It turns out that in the IIS ISAPI mode, PHP configures the system-wide PHP settings to point the session state file directory to the user-specific temp directory (i.e. pointing to a location under %userprofile%). This, obviously, isn’t going to work; anonymous users logged on to IIS aren’t going to have access to the temp directory of the account I used to install PHP with.

After manually setting up a proper location for PHP’s session state with the right security permissions (and reconfiguring php.ini to match), I tried logging in to Cacti again. This time, I actually got to the main screen after changing the password (hooray, progress!).

From here, all that I had left to do was some minor reconfiguring of the Windows SNMP service in order to allow Cacti to query it, set up the Cacti poller task job (Cacti is designed to poll data from its configured data sources at regular intervals), and configure my graphs in Cacti.

Configuring SNMP wasn’t all that difficult (though something I hadn’t done before with the Windows SNMP service), and I soon had Cacti successfully querying data over SNMP. All that was left to do was graph it, and I was home free…

Unfortunately, getting Cacti to actually graph the data turned out to be rather troublesome. In fact, I still haven’t even got it working, though I’ve at least learned a bit more about just why it isn’t working…

When I attempted to create graphs in Cacti, everything would appear to work okay, but no RRDTool datafiles would ever appear. No amount of messing with filesystem permissions resolved the problem, and the Cacti log files were not particularly helpful (even on debug severity). Additionally, attempting to edit graph properties in Cacti would result in that HTTP session mysteriously hanging forever more (definitely not a good sign). After searching around (unsuccessfully) for any possible solutions, I decided to try and take a closer look at what exactly was going on when my requests to Cacti got stuck.

Checking the process list after repeating the sequence that caused a particular Cacti session to hang several times, I found that there appeared to be a pair of cmd.exe and rrdtool.exe instances corresponding to each hung session. Hmm, it would appear that something RRDTool was doing was freezing and PHP was waiting for it… (PHP uses cmd.exe to call RRDTool, so I guessed that PHP would be waiting for cmd.exe, which would be waiting for RRDTool).

At first, I attempted to attach to one of the cmd processes with WinDbg. (Incidentally, it would appear that there are currently no symbols for the Wow64 versions of the Srv03SP2 ntdll, kernel32, user32, and a large number of other core DLLs with Wow64 builds available on the Microsoft symbol server for some reason. If any Microsoft people are reading this, it would be greaaaat if you could fix the public symbol server for Srv03 SP2 x64 Wow64 DLLs …) However, symbols for cmd.exe were fortunately available, so it was relatively easy to figure out what it was up to, and prove my earlier hypothesis that it was simply waiting on an rrdtool instance:

0:001:x86> ~1k
ChildEBP RetAddr
0012fac4 7d4d8bf1 ntdll_7d600000!NtWaitForSingleObject+0x15
0012fad8 4ad018ea KERNEL32!WaitForSingleObject+0x12
0012faec 4ad02611 cmd!WaitProc+0x18
0012fc24 4ad01a2b cmd!ExecPgm+0x3e2
0012fc58 4ad019b3 cmd!ECWork+0x84
0012fc70 4ad03c58 cmd!ExtCom+0x40
0012fe9c 4ad01447 cmd!FindFixAndRun+0xa9
0012fee0 4ad06cf6 cmd!Dispatch+0x137
0012ff44 4ad07786 cmd!main+0x108
0012ffc0 7d4e7d2a cmd!mainCRTStartup+0x12f
0012fff0 00000000 KERNEL32!BaseProcessInitPostImport+0x8d
0:001:x86> !peb
[...]
CommandLine: 'cmd.exe /c c:/progra~2/rrdtool/rrdtool.exe -'
[...]

Given this, the next logical step to investigate would be the RRDTool.exe process. Unfortunately, something really weird seemed to be going on with all the RRDTool.exe processes (naturally). WinDbg would give me an access denied error for all of the RRDTool PIDs in the F6 process list, despite my being a local machine administrator.

Attempting to attach to these processes failed as well:

Microsoft (R) Windows Debugger Version 6.6.0007.5
Copyright (c) Microsoft Corporation. All rights reserved.

Cannot debug pid 4904, NTSTATUS 0xC000010A
“An attempt was made to duplicate an object handle into or out of an exiting process.”
Debuggee initialization failed, NTSTATUS 0xC000010A
“An attempt was made to duplicate an object handle into or out of an exiting process.”

This is not something that you want to be seeing on a server box. This particular error means that the process in question is in the middle of being terminated, which prevents a debugger from successfully attaching. However, processes typically terminate in timely fashion; in fact, it’s almost unheard of to actually see a process in the terminating state, since it happens so quickly. However, in this particular instances, the RRDTool processes were remaining in this half-dead state for what appeared to be an indefinite interval.

There are two things that commonly cause this kind of problem, and all of them are related to the kernel:

  1. The disk hardware is not properly responding to I/O requests and they are hanging indefinitely. This can block a process from exiting while the operating system waits for an I/O to finishing canceling or completing. Since this particular box was brand new (and with respectable, high-quality server hardware), I didn’t think that failing hardware was the cause here (or at least, I certainly hoped not!). Given that there were no errors in the event log about I/Os failing, and that I was still able to access files on my disks without issue, I decided to rule this possiblity out.
  2. A driver (or other kernel mode code in the I/O stack) is buggy and is not allowing I/O requests to be canceled or completed, or has deadlocked itself and is not able to complete an I/O request. (You might be familiar with the latter if you’ve tried to use the 1394 mass storage support in Windows for a non-trivial length of time.) Given that I had tentatively ruled out bad hardware, this would seem to be the most likely cause here.

Since the frozen process would be stuck in kernel mode, in either case, to proceed any further I would need to use the kernel debugger. I decided to start out with local kd, as that is sufficient for at least retrieving thread stacks and doing basic passive analysis of potential deadlock issues where the system is at least mostly still functional.

Sure enough, the stuck RRDTool process I had unsuccessfully tried to attach to was blocked in kernel mode:


lkd> !process 0n4904
Searching for Process with Cid == 1328
PROCESS fffffadfcc712040
SessionId: 0 Cid: 1328 Peb: 7efdf000 ParentCid: 1354
DirBase: 5ea6c000 ObjectTable: fffffa80041d19d0 HandleCount: 68.
Image: rrdtool.exe
[...]
THREAD fffffadfcca9a040 Cid 1328.1348 Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Unknown) KernelMode Non-Alertable
fffffadfccf732d0 SynchronizationEvent
Impersonation token: fffffa80041db980 (Level Impersonation)
DeviceMap fffffa8001228140
Owning Process fffffadfcc712040 Image: rrdtool.exe
Wait Start TickCount 6545162 Ticks: 367515 (0:01:35:42.421)
Context Switch Count 445 LargeStack
UserTime 00:00:00.0000
KernelTime 00:00:00.0015
Win32 Start Address windbg!_imp_RegCreateKeyExW (0x0000000000401000)
Start Address 0x000000007d4d1510
Stack Init fffffadfc4a95e00 Current fffffadfc4a953b0
Base fffffadfc4a96000 Limit fffffadfc4a8f000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 1
RetAddr Call Site
fffff800`01027752 nt!KiSwapContext+0x85
fffff800`0102835e nt!KiSwapThread+0x3c9
fffff800`013187ac nt!KeWaitForSingleObject+0x5a6
fffff800`012b2853 nt!IopAcquireFileObjectLock+0x6d
fffff800`01288dff nt!IopCloseFile+0xad
fffff800`01288f0e nt!ObpDecrementHandleCount+0x175
fffff800`0126ceb0 nt!ObpCloseHandleTableEntry+0x242
fffff800`0128d7a6 nt!ExSweepHandleTable+0xf1
fffff800`012899b6 nt!ObKillProcess+0x109
fffff800`01289d3b nt!PspExitThread+0xa3a
fffff800`0102e3fd nt!NtTerminateProcess+0x362
00000000`77ef0caa nt!KiSystemServiceCopyEnd+0x3
0202c9fc`0202c9fb ntdll!NtTerminateProcess+0xa

Hmm… not quite what I expected. If a buggy driver was involved, it should have at least been somewhere on the call stack, but in this particular instance all we have is ntoskrnl code, straight from the system call to the wait that isn’t coming back. Something was definitely wrong in kernel mode, but it wasn’t immediately clear what was causing it. It appeared as if the kernel was blocked on the file object lock (which, to my knowledge, is used to guard synchronous I/O’s that are issued for a particular file object), but, as the file object lock is built upon KEVENTs, the usual lock diagnostics extensions (like `!locks’) would not be particularly helpful. In this instance, what appeared to be happening was that the process rundown logic in the kernel was attempting to release all still-open handles in the exiting RRDTool process, and it was (for some reason) getting stuck while trying to close a handle to a particular file object.

I could at least figure out what file was “broken”, though, by poking around in the stack of IopCloseFile:

lkd> !fileobj fffffadf`ccf73250
\Temp\php\session\sess_bkcavai8fak8antv9coq46at95
LockOperation Set Device Object: 0xfffffadfce423370 \Driver\dmio
Vpb: 0xfffffadfce864840
Access: Read Write SharedRead SharedWrite
Flags: 0x40042
Synchronous IO
Cache Supported
Handle Created
File Object is currently busy and has 1 waiters.
FsContext: 0xfffffa800390e110 FsContext2: 0xfffffa8000106a10
CurrentByteOffset: 0
Cache Data:
Section Object Pointers: fffffadfcd601c20
Shared Cache Map: fffffadfccfdebb0 File Offset: 0 in VACB number 0
Vacb: fffffadfce97fb08
Your data is at: fffff98070e80000

From here, there are a couple of options:

  1. We could look for processes with an open handle to that file and check their stacks.
  2. We could look for an IRP associated with that file object and try and trace our way back from there.

Initially, I tried the first option, but this ended up not working particularly well. I attempted to use Process Explorer to locate all processes that had a handle to that file, but this ended up failing rather miserably as Process Explorer itself got deadlocked after it opened a handle to the file. This was actually rather curious; it turned out that processes could open a handle to this “broken” file just fine, but when they tried to close the handle, they would get blocked in kernel mode indefinitely.

That unsuccessful, I tried the second option, which is made easier by the use of `!irpfind’. Normally, this extension is very slow to operate (over a serial cable), but local kd makes it quite usable. This revealed something of value:


lkd> !irpfind -v 0 0 fileobject fffffadf`ccf73250
Looking for IRPs with file object == fffffadfccf73250
Scanning large pool allocation table for Tag: Irp? (fffffadfccdf6000 : fffffadfcce56000)
Searching NonPaged pool (fffffadfcac00000 : fffffae000000000) for Tag: Irp?
Irp [ Thread ] irpStack: (Mj,Mn) DevObj [Driver] MDL Process
fffffadfcc225380: Irp is active with 7 stacks 7 is current (= 0xfffffadfcc225600)
No Mdl: No System Buffer: Thread fffffadfccea27d0: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
[...]
>[ 11, 1] 2 1 fffffadfce7b6040 fffffadfccf73250 00000000-00000000 pending
\FileSystem\Ntfs
Args: fffffadfcd70e0a0 00000000 00000000 00000000

There was an active IRP for this file object. Hopefully, it could be related to whatever is holding the file object lock for that file object. Digging a bit deeper, it’s possible to determine what thread is associated with the IRP (if it’s a thread IRP), and from there, we can grab a stack (which might just give us the smoking gun we’re looking for)…:

lkd> !irp fffffadfcc225380 1
Irp is active with 7 stacks 7 is current (= 0xfffffadfcc225600)
No Mdl: No System Buffer: Thread fffffadfccea27d0: Irp stack trace.
Flags = 00000000
ThreadListEntry.Flink = fffffadfcc2253a0
ThreadListEntry.Blink = fffffadfcc2253a0
[...]
CancelRoutine = fffff800010ba930 nt!FsRtlPrivateResetLowestLockOffset
[...]
lkd> !thread fffffadfccea27d0
THREAD fffffadfccea27d0 Cid 10f8.138c Teb: 00000000fffa1000 Win32Thread: fffffa80023cd860 WAIT: (Unknown) UserMode Non-Alertable
fffffadfccf732e8 NotificationEvent
Impersonation token: fffffa8002c62060 (Level Impersonation)
DeviceMap fffffa8002f3b7b0
Owning Process fffffadfcc202c20 Image: w3wp.exe
Wait Start TickCount 6966187 Ticks: 952 (0:00:00:14.875)
Context Switch Count 1401 LargeStack
UserTime 00:00:00.0000
KernelTime 00:00:00.0000
Win32 Start Address 0x00000000003d87d8
Start Address 0x000000007d4d1504
Stack Init fffffadfc4fbee00 Current fffffadfc4fbe860
Base fffffadfc4fbf000 Limit fffffadfc4fb8000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 0
RetAddr : Call Site
fffff800`01027752 : nt!KiSwapContext+0x85
fffff800`0102835e : nt!KiSwapThread+0x3c9
fffff800`012afb38 : nt!KeWaitForSingleObject+0x5a6
fffff800`0102e3fd : nt!NtLockFile+0x634
00000000`77ef14da : nt!KiSystemServiceCopyEnd+0x3
00000000`00000000 : ntdll!NtLockFile+0xa

This might just be what we’re looking for. There’s a thread in w3wp.exe (the IIS worker process), which is blocking on a synchronous NtLockFile call for that same file object that is in the “broken” state. Since I’m running PHP in ISAPI mode, this does make sense – if PHP is doing something to that file (which it could certainly be, since it’s a PHP session state file as we saw above), then it should be in the context of w3wp.exe.

In order to get a better user mode stack trace as to what might be going on, I was able to attach a user mode debugger to w3wp.exe and get a better picture as to what the deal was:


0:006> .effmach x86
Effective machine: x86 compatible (x86)
0:006:x86> ~6s
ntdll_7d600000!ZwLockFile+0x12:
00000000`7d61d82e c22800 ret 28h
0:006:x86> k
ChildEBP RetAddr
WARNING: Stack unwind information not available. Following frames may be wrong.
014edf84 023915a2 ntdll_7d600000!ZwLockFile+0x12
014edfbc 0241d886 php5ts!flock+0x82
00000000 00000000 php5ts!zend_reflection_class_factory+0xb576

It looks like that thread is indeed related to PHP; PHP is trying to acquire a file lock on the session state file. With a bit of work, we can figure out just what kind of lock it was trying to acquire.

The prototype for NtLockFile is as so:

// NtLockFile locks a region of a file.
NTSYSAPI
NTSTATUS
NTAPI
NtLockFile(
IN HANDLE FileHandle,
IN HANDLE Event OPTIONAL,
IN PIO_APC_ROUTINE ApcRoutine OPTIONAL,
IN PVOID ApcContext OPTIONAL,
OUT PIO_STATUS_BLOCK IoStatusBlock,
IN PULARGE_INTEGER LockOffset,
IN PULARGE_INTEGER LockLength,
IN ULONG Key,
IN BOOLEAN FailImmediately,
IN BOOLEAN ExclusiveLock
);

Given this, we can easily deduce the arguments off a stack dump:

0:006:x86> dd @esp+4 l0n10
00000000`014edf48 000002b0 00000000 00000000 014edfac
00000000`014edf58 014edfac 014edf74 014edf7c 00000000
00000000`014edf68 00000000 00000001
0:006:x86> dq 014edf74 l1
00000000`014edf74 00000000`00000000
0:006:x86> dq 014edf7c l1
00000000`014edf7c 00000000`00000001

It seems that PHP is trying to acquire an exclusive lock for a range of 1 byte starting at offset 0 in this file, with NtLockFile configured to wait until it acquires the lock.

Putting this information together, it’s now possible to surmise what is going on here:

  1. The child processes created by php have a file handle to the session state file (probably there from process creation inheritance).
  2. PHP tries to acquire an exclusive lock on part of the session state file. This takes the file object lock for that file and waits for the file to become exclusively available.
  3. The child process exits. Now, it tries to acquire the file object lock so that it can close its file handle. However, the file object lock cannot be acquired until the child process releases its handle as the handle is blocking PHP’s NtLockFile from completing.
  4. Deadlock! Neither thread can continue, and PHP appears to hang instead of configuring my graphs properly.

In this particular instance, it was actually possible to “recover” from the deadlock without rebooting; the IIS worker process’s wait in NtLockFile is marked as a UserMode wait, so it is possible to terminate the w3wp.exe process, which releases the file object lock and ultimately allows all the frozen processes that are trying to close a handle to the PHP session state file to finish the close handle operation and exit.

This is actually a nasty little problem; it looks like it’s possible for one user mode process to indefinitely freeze another user mode process in kernel mode via a deadlock. Although you can break the deadlock by terminating the second user mode process, the fact that a user mode process can, at all, cause the kernel to deadlock during process exit (“breakable” or not) does not appear to be a good thing to me.

Meanwhile, knowing this right now doesn’t really solve my problem. Furthermore, I suspect that there’s probably a different problem here too, as the command line that was given to RRDTool (simply “-“) doesn’t look all that valid to me. I’ll see if I can come up with some way to work around this deadlock problem, but it definitely looks like an unpleasant one. If it really is a file handle being incorrectly inherited to a child process, then it might be possible to un-mark that handle for inheritance with some work. The fact that I am having to consider making something to patch PHP to work around this is definitely not a happy one, though…

Silly me for thinking that it would just take a kernel debugger to get a web application running…

Debugger internals: Why do ntoskrnl and ntdll have type information?

Wednesday, February 14th, 2007

Johan Johansson sent me a mail asking why nt and ntdll have partial type information included (if you’re at all experienced with debugging on Windows, you’ll know that public symbols, such as what Microsoft ships on the public symbol server at http://msdl.microsoft.com/download/symbols, don’t typically include type information. Instead, one typically needs access to private symbols in order to view types.

However, nt and ntdll are an exception to this rule, on Windows XP or later. Unlike all the other PDBs shipped by Microsoft, the ones corresponding to ntdll and ntoskrnl do include type information for a seemingly arbitrary mix of types, some publicly documented and some undocumented. There is, however, a method to the madness with respect to what symbols are included in the public nt/ntdll PDBs.

To understand what symbols are chosen and why, though, it’s necessary to know a bit of history.

Way back in the days when Windows NT was still called Windows NT (and not Windows 2000 or Windows XP), the debugger scene was a much less friendly place. In those days, “remote debugging” involved dialing up your debugger person with a modem, and if you wanted to kernel debug a target, you had to run a different kernel debugger program specific to the architecture on the target computer.

Additionally, you had to use a debugger version that was newer than the operating system on the target computer or things wouldn’t work out very well. Furthermore, one had to load architecture-specific extension dlls for many kernel debugging tasks. One of the reasons for these restrictions, among other things, is that for different architectures (and different OS releases), the size and layout of many internal structures used by the debugger (and debugger extension modules) to do their work varied. In other words, the size and layout of, say, EPROCESS might not be the same on Windows NT 3.1 x86 vs Windows NT 4.0 for the DEC Alpha.

When Windows 2000 was released, things became a little bit better- Windows 2000 only publicly supported x86, which reduced the number of different architectures that WinDbg needed to publicly support going forward. However, Windows XP and Windows Server 2003 reintroduced mainstream support for non-x86 architectures (first IA64 and then x64).

At some point on the road to Windows XP and Windows Server 2003, a decision was made to clean things up from the debugger perspective and introduce a more manageable way of supporting a large matrix of operating systems and target architectures.

Part of the solution devised involved providing a unified, future-compatible (where possible; obviously, new or radically redesigned OS functionality would require debugger extension changes, but things like simple structure size chages shouldn’t require such drastic measures) method for accessing data on the remote system. Since all that a debugger extension does in the common case is simply quickly reformat data on the target system into an easily human-readable format (such as the process list returned by !process), a unified way to communicate structure sizes and layouts to debugger extensions (and the debugger engine itself) would greatly reduce the headache of supporting the debugger on an ever-expanding set of platforms and operating systems. This solution involved putting the structures used by the debugger engine itself and many debugger extension into the symbol files shipped with ntoskrnl and ntdll, and then providing a well-defined API for retrieving that type information from the perspective of a debugger extension.

Fast-forward to 2007. Now, there is a single, unified debugger engine for all target platforms and architectures (the i386kd.exe and ia64kd.exe that ship with the DTW distribution are essentially the same as the plain kd.exe and are vestigal remains of a by-gone era; these files are simply retained for backwards compability with scripts and programs that drive the debugger), real remote debugging exists, and your debugger doesn’t break every time a service pack is released. All of this is made possible in part due to the symbol information available in the ntoskrnl and ntdll PDB files. This is why you can use WinDbg 6.6.7.5 to debug Windows Vista RTM, despite the fact that WinDbg 6.6.7.5 was released months before the final RTM build was shipped.

Symbol support is also a reason why there is no ‘srv03ext’, ‘srv03fre’, or ‘srv03chk’ extension directories under your Debugging Tools for Windows folder. The nt4chk/nt4fre/w2kchk/w2kfre directories contain debugger extensions specific to that Windows build. Due to the new unified debugger architecture, there is no longer a need to tie a binary to a particular operating system build going forward. Because Windows 2000 and Windows NT 4.0 doesn’t include type data, however, the old directories still remain for backwards compatibility with those platforms.

So, to answer Johan’s question: All of the symbols in the ntoskrnl or ntdll PDBs should be used by the debugger engine itself, or some debugger extension, somewhere. This is the final determining factor as to what types are exposed via those PDBs, to my knowledge; whether a public debugger extension DLL (or the debugger engine itself) uses them or not.

Enabling the local kernel debugger on Vista RTM

Friday, February 9th, 2007

If you’re a kernel developer, and you’ve upgraded to Vista, then one of the changes that you may have noticed is that you can’t perform local kernel debugging anymore.

This is true even if you elevate WinDbg. If you try, you’ll get an error message stating that the debugger failed to get KD version information (error 5), which corresponds to the Win32 ERROR_ACCESS_DENIED error code.

This is due to a change from Vista RC2 to Vista RTM, where the kernel function responsible for supporting much of the local KD functionality in WinDbg (KdSystemDebugControl) was altered to require the system to be booted with /DEBUG. This is apparent if we compare RC2 to RTM.

RC2 has the following check in KdSystemDebugControl (one comparison against KdpBootedNodebug):

nt!KdSystemDebugControl:
push    0F4h
push    81841938
call    nt!_SEH_prolog4
xor     ebx,ebx
mov     dword ptr [ebp-28h],ebx
mov     dword ptr [ebp-20h],ebx
mov     dword ptr [ebp-24h],ebx
cmp     byte ptr [nt!KdpBootedNodebug],bl
je      nt!KdSystemDebugControl+0x2c ; Success
mov     eax,0C0000022h ; STATUS_ACCESS_DENIED

On Vista RTM, two additional checks were added against nt!KdPitchDebugger and nt!KdDebuggerEnabled (disregard the fact that the RTM disassembly is from the x64 version; both the x86 and x64 Vista versions have the same checks):

nt!KdSystemDebugControl:
mov     qword ptr [rsp+8],rbx
mov     qword ptr [rsp+10h],rdi
push    r12
sub     rsp,170h
mov     r10,rdx
and     dword ptr [rsp+44h],0
and     qword ptr [rsp+48h],0
and     qword ptr [rsp+50h],0
cmp     byte ptr [nt!KdpBootedNodebug)],0
jne     nt!KdSystemDebugControl+0x8b7 ; Fail
cmp     byte ptr [nt!KdPitchDebugger],0
jne     nt!KdSystemDebugControl+0x8b7 ; Fail
cmp     byte ptr [nt!KdDebuggerEnabled],0
je      nt!KdSystemDebugControl+0x8b7 ; Fail

The essence of these checks is that you need to be booted with /DEBUG enabled in order for local kernel debugging to work.

There is a simple way to accomplish this, however, without the usual painful aspects of having a kernel debugger attached (e.g. breaking on user mode exceptions or breakpoints).

All you have to do is enable kernel debugging, and then disable user mode exception handling. This requires the following options to be set via BCDEdit.exe, the Vista boot configuration database manager:

  1. bcdedit /debug on. This enables kernel debugging for the booted OS configuration.
  2. bcdedit /dbgsettings <type> /start disable /noumex (where type corresponds to a usable KD type on your computer, such as 1394). This disables user mode exception handling for the kernel debugger. You should still be able to boot the system without a kernel debugger attached.

After setting these options, reboot, and you should be set. You’ll now be able to use local KD (you must still remember to elevate the debugger, though), but you won’t have user mode programs try to break into the kernel debugger when they crash without a user mode debugger attached.

Note, however, that you’ll still be able to break in to the system with a kernel debugger after boot if you choose these options (and if the box crashes in kernel mode, it’ll freeze waiting for a debugger to attach). However, at least you will not have to contend with errant user mode programs causing the system to break into the kernel debugger.

Programming against the x64 exception handling support, part 7: Putting it all together, or building a stack walk routine

Monday, February 5th, 2007

This is the final post in the x64 exception handling series, comprised of the following articles:

  1. Programming against the x64 exception handling support, part 1: Definitions for x64 versions of exception handling support
  2. Programming against the x64 exception handling support, part 2: A description of the new unwind APIs
  3. Programming against the x64 exception handling support, part 3: Unwind internals (RtlUnwindEx interface)
  4. Programming against the x64 exception handling support, part 4: Unwind internals (RtlUnwindEx implementation)
  5. Programming against the x64 exception handling support, part 5: Collided unwinds
  6. Programming against the x64 exception handling support, part 6: Frame consolidation unwinds
  7. Programming against the x64 exception handling support, part 7: Putting it all together, or building a stack walk routine

Armed with all of the information in the previous articles in this series, we can now do some fairly interesting things. There are a whole lot of possible applications for the information described here (some common, some estoric); however, for simplicities sake, I am choosing to demonstrate a classical stack walk (albeit one that takes advantage of some of the newly available information in the unwind metadata).

Like unwinding on x86 where FPO is disabled, we are able to do simple tasks like determine frame return addresses and stack pointers throughout the call stack. However, we can expand on this a great deal on x64. Not only are our stack traces guaranteed to be accurate (due to the strict calling convention requirements on unwind metadata), but we can retrieve parts of the nonvolatile context of each caller with perfect reliability, without having to manually disassemble every function in the call stack. Furthermore, we can see (at a glance) which functions modify which non-volatile registers.

For the purpose of implementing a stack walk, it is best to use RtlVirtualUnwind instead of RtlUnwindEx, as the RtlUnwindEx will make irreversible changes to the current execution state (while RtlVirtualUnwind operates on a virtual copy of the execution state that we can modify to our heart’s content, without disturbing normal program flow). With RtlVirtualUnwind, it’s fairly trivial to implement an unwind (as we’ve previously seen based on the internal workings of RtlUnwindEx). The basic algorithm is simply to retrieve the current unwind metadata for the active function, and execute a virtual unwind. If no unwind metadata is present, then we can simply set the virtual Rip to the virtual *Rsp, and increment the virtual *Rsp by 8 (as opposed to invoking RlVirtualUnwind). Since RtlVirtualUnwind does most of the hard work for us, the only thing left after that is to interpret and display the output (or save it away, if we are logging stack traces).

With the above information in mind, I have written an example x64 stack walking routine that implements a basic x64 stack walk, and displays the nonvolatile context as viewed by each successive frame in the stack. The example routine also makes use of the little-used ContextPointers argument to RtlVirtualUnwind in order to detect functions that have used a particular non-volatile register (other than the stack pointer, which is immediately obvious). If a function in the call stack writes to a non-volatile register, the stack walk routine takes note of this and displays the modified register, its original value, and the backing store location on the stack where the original value is stored. The example stack walk routine should work “as-is” in both user mode and kernel mode on x64.

As an aside, there is a whole lot of information that is being captured and displayed by the stack walk routine. Much of this information could be used to do very interesting things (like augment disassembly and code analysis by defintiively identifying saved registers and parameter usage. Other possible uses are more estoric, such as skipping function calls at run-time in a safe fashion, or altering the non-volatile execution context of called functions via modification of the backing store pointers returned by RtlVirtualUnwind in ContextPointers. The stack walk use case, as such, really only begins to scratch the surface as it relates to some of the many very interesting things that x64 unwind metadata allows you to do.

Comparing the output of the example stack walk routine to the debugger’s built-in stack walk, we can see that it is accurate (and in fact, even includes more information; the debugger does not, presently, have support for displaying non-volatile context for frames using unwind metadata (Update: Pavel Lebedinsky points out that the debugger does, in fact, have this capability with the “.frame /r” command)):

StackTrace64: Executing stack trace...
FRAME 00: Rip=00000000010022E9 Rsp=000000000012FEA0
 Rbp=000000000012FEA0
r12=0000000000000000 r13=0000000000000000
 r14=0000000000000000
rdi=0000000000000000 rsi=0000000000130000
 rbx=0000000000130000
rbp=0000000000000000 rsp=000000000012FEA0
 -> Saved register 'Rbx' on stack at 000000000012FEB8
   (=> 0000000000130000)
 -> Saved register 'Rbp' on stack at 000000000012FE90
   (=> 0000000000000000)
 -> Saved register 'Rsi' on stack at 000000000012FE88
   (=> 0000000000130000)
 -> Saved register 'Rdi' on stack at 000000000012FE80
   (=> 0000000000000000)

FRAME 01: Rip=0000000001002357 Rsp=000000000012FED0
[...]

FRAME 02: Rip=0000000001002452 Rsp=000000000012FF00
[...]

FRAME 03: Rip=0000000001002990 Rsp=000000000012FF30
[...]

FRAME 04: Rip=00000000777DCDCD Rsp=000000000012FF60
[...]
 -> Saved register 'Rbx' on stack at 000000000012FF60
   (=> 0000000000000000)
 -> Saved register 'Rsi' on stack at 000000000012FF68
   (=> 0000000000000000)
 -> Saved register 'Rdi' on stack at 000000000012FF50
   (=> 0000000000000000)

FRAME 05: Rip=000000007792C6E1 Rsp=000000000012FF90
[...]


0:000> k
Child-SP          RetAddr           Call Site
00000000`0012f778 00000000`010022c6 DbgBreakPoint
00000000`0012f780 00000000`010022e9 StackTrace64+0x1d6
00000000`0012fea0 00000000`01002357 FaultingSubfunction3+0x9
00000000`0012fed0 00000000`01002452 FaultingFunction3+0x17
00000000`0012ff00 00000000`01002990 wmain+0x82
00000000`0012ff30 00000000`777dcdcd __tmainCRTStartup+0x120
00000000`0012ff60 00000000`7792c6e1 BaseThreadInitThunk+0xd
00000000`0012ff90 00000000`00000000 RtlUserThreadStart+0x1d

(Note that the stack walk routine doesn’t include DbgBreakPoint or the StackTrace64 frame itself. Also, for brevity, I have snipped the verbose, unchanging parts of the nonvolatile context from all but the first frame.)

Other interesting uses for the unwind metadata include logging periodic stack traces at choice locations in your program for later analysis when a debugger is active. This is even more powerful on x64, especially when you are dealing with third party code, as even without special compiler settings, you are guaranteed to get good data with no invasive use of symbols. And, of course, a good knowledge of the fundamentals of how the exception/unwind metadata works is helpful for debugging failure reporting code (such as custom unhandled exception filter) on x64.

Hopefully, you’ve found this series both interesting and enlightening. In a future series, I’m planning on running through how exception dispatching works (as opposed to unwind dispatching, which has been relatively thoroughly covered in this series). That’s a topic for another day, though.

Thoughts on PatchGuard (otherwise known as Kernel Patch Protection)

Monday, January 29th, 2007

Recently, there has been a fair bit of press about PatchGuard. I’d like to clarify a couple of things (and clear up some common misconceptions that appear to be floating around out there).

First of all, these opinions are my own, based on the information that I have available to me at this time, and are not sponsored by either Microsoft or my employer. Furthermore, these views are based on PatchGuard as it is implemented today, and do not relate to any theoretical extensios to PatchGuard that may occur sometime in the future. That being said, here’s some of what I think about PatchGuard:

PatchGuard is (mostly) not a security system.

Although some people out there might try to tell you this, I don’t buy it. The thing about PatchGuard is that it protects the kernel (and a couple of other Microsoft-supplied core kernel libraries) from being patched. PatchGuard also protects a couple of kernel-related processor registers (MSRs) that are used in conjunction with functionality like making system calls. However, this doesn’t really equate to improving computer security directly. Some persons out there would like to claim that PatchGuard is the next great thing in the anti-malware/anti-rootkit war, but they’re pretty much just wrong, and here’s why:

  1. Malware doesn’t need to patch the kernel in the vast majority of all cases. Virtually all of the “interesting” things out there that malware does on compromised computers (create botnets, blast out spam emails, and the like) don’t require anything that is even remotely related to what PatchGuard blocks in terms of kernel patch prevention. Even in the rootkit case, most of the hiding of nefarious happenings could be done with clever uses of documented APIs (such as creating threads in other processes, or filesystem filters) without even having to patch the kernel at all. I will admit that many rootkits out there now do simply patch the kernel, but I attribute this mostly to a 1) lack of knowledge about how the kernel works by the part of rootkit authors, and 2) the fact that it might be “easier” to simply patch the kernel and introduce race conditions that crash rootkit’d computers rarely than to do things the right way. Once things like system call hooks in rootkits start to run afoul of PatchGuard, though, rootkit authors have innumerable other choices that are completely unguarded by PatchGuard. As a result, it’s not really correct (in my opinion) to call PatchGuard an anti-malware technology.
  2. Malware authors are more agile than Microsoft. What I mean when I say “more agile” is that malware vendors can much more easily release new software versions without the “burdens” of regression testing, quality assurance, and the like. After all, if you’re already in the malicious software business, it probably doesn’t matter if 5% of your customers (sorry, victims) will have systems that crash with your software installed because you didn’t fully test your releases and fix a corner case bug. On the other hand, Microsoft does have to worry about this sort of thing (and Microsoft’s install base for Windows is huge), which means that Microsoft needs to be very, very careful about releasing updates to something as “dangerous” as PatchGuard. I say “dangerous” in the sense that if a PatchGuard version is released with a bug, it could very well cause some number of Microsoft’s customers to bluescreen on boot, which would clearly not fly very well. Given the fact that Microsoft can’t really keep up with malware authors (many who are dedicated to writing malicious code with financial incentives to keep doing so) as it comes to the “cat and mouse” game with PatchGuard, it doesn’t make sense to try and use PatchGuard to stop malware.
  3. PatchGuard is targetted at vendors that are accountable to their customers. This point seems to be often overlooked, but PatchGuard works by making it painful for vendors to patch the kernel. This pain comes in the fact that ISVs who choose to bypass PatchGuard are at risk of causing customer computers to bluescreen on boot en-masse the next time that Microsoft releases a PatchGuard update. For malware authors, this is really much less of an issue; one compromised computer is the same as another, and many flavors of malware out there will try to do things like break automatic updates anyway. Furthermore, it is generally easier for malware authors to push out new versions of software to victims (that might counter a PatchGuard update) than it is for most ISVs to deploy updates to their paying customers, who often tend to be stubborn on the issue of software updates, typically insisting on internal testing before mass deployment.
  4. By the time malware is in a position to have to deal with PatchGuard as a potential blocking point, the victim has already lost. In order for PatchGuard to matter to malware, said malware must be able to run kernel level code on the victim’s computer. At this point, guarding the computer is really somewhat of a moot point, as the victim’s passwords, personal information, saved credit card numbers, secret documents, and whatnot are already toast in that the malware already has complete access to anything on the system. Now, there are a couple of scenarios (like trojaning a library computer kiosk) where the information desired by an attacker is not yet available, and the attacker has to hide his or her malicious code until a user comes around to type it in on a keyboard, but for the typical home or business computer scenario, the game is over the moment malicious code gets kernel level access to the box in question.

Now, for a bit of clarification. I said that PatchGuard was mostly not a security system. There is one real security benefit that PatchGuard conferns on customers, and that is the fact that PatchGuard keeps out ill-written software that does things like patch system calls and then fail to validate parameters correctly, resulting in new security issues being introduced. This is actually a real issue, believe it or not. Now, not everything that hooks the kernel introduces race conditions, security problems, or the like; it is possible (though difficult) to do so in a safe and correct manner in many circumstances, depending on what it is exactly that you’re trying to do. However, most of the software out there does tend to do things incorrectly, and in the process often inadvertently introduces security holes (typically local privilege escalation issues).

PatchGuard is not a DRM mechanism.

People who try to tell you that PatchGuard is DRM-related likely don’t really understand what it does. Even with PatchGuard installed, it’s still possible to extend the kernel with custom drivers. Additionally, the things that PatchGuard protects are only related to the new HD DRM mechanisms in Vista in a very loose sense. Many of the DRM provisions in Windows Vista are implemented in the form of drivers and not the kernel image itself, and are thus not protected by PatchGuard. If nothing else, third party drivers are responsible for the negotiation and lowest level I/O as relating to most of the new HD DRM schemes, and present an attractive target for DRM-bypass attempts that PatchGuard has no jurasdiction over. Unless Microsoft supplies all multimedia-output-related drivers in the system, this is how it will have to stay for the forseeable future, and it would be extremely difficult for Microsoft to protect just DRM-related drivers with PatchGuard in a non-trivially bypassable faction.

Preventing poorly written code from hooking the kernel is good for Windows customers.

There is a whole lot of bad code out there that hooks things incorrectly, and typically introduces race conditions and crash conditions. I have a great deal of first hand experience with this, as our primary product here at Positve has run afoul of third party software that hooks things and causes hard to debug random breakage that we get the blame for from a customer support perspective. The most common offenders that cause us pain are third party hooks that inject themselves into networking-related code and cause issues by subtlely breaking API semantics (for instance, we’ve run into certain versions of McAfee’s LSPs where if you call select in a certain way (Yes, I know that select is hardly optimal on Windows), the LSP will close a garbage handle value, sometimes nuking something important in the process. We’ve also run into a ton of cases where poorly behaved Internet Explorer add-ons will cause various corruption and other issues that we tend to get the blame for when our (rather complicated, code-wise, being a full-fledged VPN client) browser-based product is used in a partially corrupted iexplore process and eventually falls over and crashes. Another common trouble case relates to third party software that attempts to hook the CreateProcess* family of routines in order to propagate hooked code to child processes, and in the process manages to break legitimate usage of CreateProcess in certain obscure scenarios.

In our case, though, this is all just third-party “junkware” that breaks stuff in usermode. When you get to poorly written code that breaks the kernel, things get just that much worse; instead of a process blowing up, now the entire system hangs, crashes (i.e. bluescreens), experiences subtle filesystem corruption, or has local guest-to-kernel privilege escalation vulnerabilities introduced. Not only are the consequences of failure that much worse with hooking in kernel mode, but you can guess who gets the blame when a Windows box bluescreens: Microsoft, even though that sleazy anti-virus (or whatnot) software you just installed was really to blame. Microsoft claims that there are a significant amount of crashes on x86 Windows from third parties hooking the kernel wrong, and based on my own analysis of certain anti-virus software (and my own experience with hooking things blowing up our code in usermode), I don’t have any basis for disagreeing with Microsoft’s claim. From this perspective, PatchGuard is a good thing for consumers, as it represents a non-trivial attempt to force the industry to clean house and fix code that is at best questionable.

PatchGuard makes it significantly more difficult for ISVs out there which provide value on top of Windows through the use of careful hooking (or other non-blessed) means for extending the kernel.

Despite the dangers involved in kernel hooking, it is in fact possible to do it right in many circumstances, and there really are products out there which require kernel-level alterations that simply can’t be done without hooking (or other aspects that are blocked by PatchGuard). Note that, for the most part, I consider AV software as not in this class of software, and if nothing else, Microsoft’s Live OneCare demonstrates that it is perfectly feasible to deploy an AV solution without kernel hacks. Nonetheless, there are a number of niche solutions that revolve around things like kernel-level patching, which are completely shut out by PatchGuard. This presents an unfortunate negative atmosphere to any ISV that falls into this zone of having deployed technology that is now blocked in principle by PatchGuard, just because XYZ large anti-virus (the most common offenders, though there are certainly plenty of non-AV programs out there that are similarly “kernel-challenged”) vendor couldn’t be bothered to clean up their code and do things the correct way.

From this perspective, PatchGuard is damaging to customers (and affected ISVs), as they are essesntially forced to stay away from x64 (or take the risky road of trying to play cat-and-mouse with Microsoft and hack PatchGuard). This is unfortunate, and until recently, Microsoft has provided an outward appearance that they were taking a hard-line, stonewall stance against anything that ran afoul of PatchGuard, regardless of whether the program in question really had a “legitimate” need to alter the kernel. Fortunuately, for ISVs and customers, Microsoft appears to have very recently (or at least, very recently started saying the opposite thing in public) warmed up to the fact that completely shutting out a subset of ISVs and Windows customers isn’t such a great idea, and has reversed its previous statements regarding its willingness to cooperate with ISVs that run into problems with PatchGuard. I see this as a completely positive thing myself (keeping in mind that at work here, we don’t ship anything that conflicts with PatchGuard), as it signals that Microsoft is willing to work with ISVs that have “legitimate” needs that are being blocked by PatchGuard.

There is no giant conspiracy in Microsoft to shut out the ISVs of the world.

While I may not completely agree with the new reality that PatchGuard presents to ISVs, I have no illusions that Microsoft is trying to take over the world with PatchGuard. Like it or not, Microsoft is quite aware that the success of the Windows platform depends on the applications (and drivers) that run on top of it. As a result, it would be grossly stupid of Microsoft to try and leverage PatchGuard as a way to shut out other vendors entirely; customers don’t like changing vendors and ripping out all the experience and training that they have with an existing install base, and so you can bet that Microsoft trying to take over the software market “by force” with the use of PatchGuard wouldn’t go over well with Windows customers (or help Microsoft in its sales case for future Windows upgrades featuring PatchGuard), not to mention the legal minefield that Microsoft would be waltzing into should they attempt to do such a thing. Now, that being said, I do believe that somebody “dropped the ball”, so to speak, as far as cooperating with ISVs when PatchGuard was initially deployed. Things do appear to be improving from the perspective of Microsoft’s willingness to work with ISVs on PatchGuard, however, which is a great first step in the right direction (though it will remain to be seen how well Microsoft’ s current stance willl work out with the industry).

PatchGuard does represent, on some level, Microsoft exerting control over what users do with their hardware.

This is the main aspect of PatchGuard that I am most uncomfortable with, as I am of the opinion that when I buy a computer and pay for software, that I should absolutely be permitted to do what I want with it, even if that involves “dangerous” things like patching my own box’s kernel. In this regard, I think that Microsoft is attempting to play the “benevolent dictrator” with respect to kernel software; drivers that are dangerous to the reliablity of Windows computers, on average, are being blocked by Microsoft. Now, I do trust Microsoft and its code a whole lot more than most ISVs out there (I know first-hand how much more seriously Microsoft considers issues like reliablity and security at the present day (I’m not talking about Windows 95 here…) than a whole lot of other software vendors out there. Still, I don’t particularly like the fact that PatchGuard is akin to Microsoft telling me that “sorry, that new x64 computer and Windows license you bought won’t let you do things that we have classified as dangerous to system stability”. For example, I can’t write a program on Windows x64 that patches things blocked by PatchGuard, even for research and education purposes. I can understand that Microsoft is finding themselves up against the wall here against an uncooperative industry that doesn’t want to clean up its act, but still, it’s my computer, so I had damn well better be able to use it how I like. (No, I don’t consider the requirement of having a kernel debugger attached at boot time as an acceptable one, especially in automated scenarios.)

PatchGuard could make it more difficult to prove that a system is uncompromised, or to analyze a known compromised system for clues about an attack, if the malware involved is clever enough.

If you are trying to analyze a compromised (or suspected compromised) system for information about an attack, PatchGuard represents a dangerous unknown: a deliberately obfuscated chunk of code with sophisticated anti-analysis/anti-debugging/anti-reverse-engineering that ships with the operating system. This makes it very difficult to look at a system and determine if, say, it’s really PatchGuard that is running every so often, or perhaps some malware that has hijacked PatchGuard for nefarious purposes. Without digging though layer upon layer of obfuscation and anti-debugging code, definitively saying that PatchGuard on a system is uncompromised is just plain not do-able. In this respect, PatchGuard’s obfuscation presents a potentially attractive place for malicious code to hide and avoid being picked up in the course of post-compromise forensics of a running system that has been successfully compromised.

PatchGuard will not be obfuscation-based forever.

Eventually, I would expect that it will be replaced by hardware-enforced (hypervisor) based systems that utilize hardware-supported virtualization technology in new processors to provide a “ring -1” for code presently guarded by PatchGuard to execute in. This approach would move the burden of guarding kernel code to the processor itself, instead of the current “cat and mouse” game in software that exists with PatchGuard, as PatchGuard executes at the same privilege isolation level as code that might try to subvert it. Note that, in a hypervisor based system, hardware drivers would ideally be unable to cause damage (in terms of things like memory corruption and the like) to the kernel itself, which might eventually allow the system to continue functioning even if a driver fails. Of course, if drivers rely on being able to rewrite the kernel, this goal is clearly unattainable, and PatchGuard helps to ensure that in the future, there won’t be a backwards compatibility nightmare caused by a plethora of third-party drivers that rely on being able to directly alter the behavior of the kernel. (I suspect that x64 will supplant x86 in terms of being the operating sytem execution environment in the not-too-distant future). In usermode, x86 will likely live on for a very long time, but as x64 processors can execute x86 code at native speed, this is likely to not be an issue.

When PatchGuard is hypervisor-backed, it won’t be feasible to simply patch it out of existance, which means that ISVs will either have to comply with Microsoft’s requirements or find a way to evade PatchGUard entirely.

Overall, I think that there are both significant positive (and negative) aspects for PatchGuard. Whether it will turn out to be the best business decision (or the best experience for customers) remains to be seen; in an ideal world, though, only developers that really understand the full implications of what they are doing would patch the kernel (and only in safe ways), and things like PatchGuard would be unnecessary. I fear that this has become too much to expect for every programmer to do the right thing, though; one need only look at the myriad security vulnerabilities in software all over to see how little so many programmers care about correctness.

Programming against the x64 exception handling support, part 6: Frame consolidation unwinds

Sunday, January 14th, 2007

In the last post in the programming x64 exception handling series, I described how collided unwinds were implemented, and just how they operate. That just about wraps up the guts of unwinding (finally), except for one last corner case: So-called frame consolidation unwinds.

Consolidation unwinds are a special form of unwind that is indicated to RtlUnwindEx with a special exception code, STATUS_UNWIND_CONSOLIDATE. This exception code changes RtlUnwindEx’s behavior slightly; it suppresses the behavior of substituting the TargetIp argument to RtlUnwindEx with the unwound context’s Rip value.

As far as RtlUnwindEx goes, that’s all that there is to consolidation unwinds. There’s a bit more that goes on with this special form of unwind, though. Specifically, as with longjump style unwinds, there is special logic contained within RtlRestoreContext (used by RtlUnwindEx to realize the final, unwound execution context) that detects the consolidation unwind case (by virtue of the ExceptionCode member of the ExceptionRecord argument), and enables a special code path. As it is currently implemented, some of the logic relating to unwind consolidation is also resident within the pseudofunction RcFrameConsolidation. This function is tightly coupled with RtlUnwindContext; it is only separated into another logical function for purposes of describing RtlRestoreContext and RcFrameConsolidation in the unwind metadata for ntdll (or ntoskrnl).

The gist of what RtlRestoreContext/RcFrameConsolidation does in this case is essentially the following:

  1. Make a local copy of the passed-in context record.
  2. Treating ExceptionRecord->ExceptionInformation[0] as a callback function, this callback function is called (given a single argument pointing to the ExceptionRecord provided to RtlRestoreContext).
  3. The return address of the callback function is treated as a new Rip value to place in the context that is about to be restored.
  4. The context is restored as normal after the Rip value is updated based on the callback’s decision.

The callback routine pointed to by ExceptionRecord->ExceptionInformation[0] has the following signature:

// Returns a new RIP value to be used in the restored context.
typedef ULONG64 (* PFRAME_CONSOLIDATION_ROUTINE)(
   __in PEXCEPTION_RECORD ExceptionRecord
   );

After the confusing interaction between multiple instances of the case function that is collided unwinds, frame consolidation unwinds may seem a little bit anti-climactic.

Frame consolidation unwinds are typically used in conjunction with C++ try/catch/throw support. Specifically, when a C++ exception is thrown, a consolidation unwind is executed within a function that contains a catch handler. The frame consolidation callback is used by the CRT to actually invoke the various catch filters and handlers (these do not necessarily directly correspond to standard C SEH scope levels). The C++ exception handling routines use the additional ExceptionInformation fields available in a consolidation unwind in order to pass information about the exception object itself; this usage of the ExceptionInformation fields does not have any special support in the OS-level unwind routines, however. Once the C++ exception is going to cross a function-level boundary, in my observations, it is converted into a normal exception for purposes of unwinding and exception handling. Then, consolidation unwinds are used again to invoke any catch handlers encountered within the next function (if applicable).

Essentially, consolidation unwinds can be thought as a normal unwind, with a conditionally assigned TargetIp whose value is not determined until after all unwind handlers have been called, and the specified context has been unwound. In most circumstances, this functionality is not particularly useful (or critical) when speaking in terms of programming the raw OS-level exception support directly. Nonetheless, knowing how it works is still useful, if only for debugging purposes.

I’m not covering the details of how all of the C++ exception handling framework is built on top of the unwind handling routines in this post series; there is no further OS-level “awareness” of C++ exceptions within the OS’s unwind related support routines, however.

Next up: Tying it all together, or using x64’s improved exception handling support to our gain in the real world.

Things that don’t quite work right in Windows Vista

Friday, January 12th, 2007

Having switched to Windows Vista full-time, I’ve now had an opportunity to run into most of the little “daily annoyances” that detract from the general experience (at least, my experience, anyway – your mileage my vary). Many of these problems existed in Windows XP, but that doesn’t change the fact that they’re annoying and/or frustrating. Some of them are new to Vista. Following is a list of some of the big annoyances that bother me on a regular basis:

  1. Vista’s 1394 support is mostly broken. Specifically, the 1394 storage support (a-la sbp2port.sys) in Vista is pretty much horrific. This sad state of affairs is carried over from Windows XP and Windows Server 2003, and isn’t really new to Vista, but it’s highly frustrating nonetheless. It seems that the 1394 storage support is prone to randomly breaking (in my experience), and when it breaks, it breaks badly; usually, you end up with either a bluescreen in sbp2port.sys due to it corrupting one of it’s internal data structures, or the I/O system grinds to a halt as all IRPs that make their way to a 1394 storage device get “stuck” forever, essentially permanently pinning (and freezing) any process that touches a 1394 storage device. Since there are an amazing number of things that do this indirectly by asking for information about storage volumes, this typically manifests itself as an inability to do practically anything (logon, logoff, start programs, and soforth).

    This problem is particularly prone to happening when you disconnect a 1394 device, but I’ve also seen it spontaneously happen without user interaction (and without cables becoming lose, as far as I can tell). I’ve experienced these problems on Windows XP, Windows Server 2003, and Windows Vista (x64 and x86), across a wide range of 1394 chipsets and 1394 storage devices.

    In order to recover from this breakage, a hard reboot is usually necessary, since shutting down is virtually impossible due to any I/Os hitting a 1394 device getting frozen indefinitely.

    This is really a shame, as 1394 is a pretty nice bus for storage devices. Other uses of 1394 on Windows are fairly stable; kernel debugging works fine, and networking (used) to work fine, although Vista removes support for IP/1394 (this negatively impacted me, as it was fairly handy for transferring large amounts of data between laptops which typically have 1394 but not gigabit ethernet. Not the end of the world, but it is a feature that I used which disappeared out from under me).

  2. Hybrid sleep takes forever to complete the low power transition on a laptop with 2GB (or 1.5GB) of RAM. Microsoft advertises it as powering the system down in a few seconds for a laptop, but as far as I can tell, this is pure marketing fiction. It regularly takes 30-45 seconds (sometimes more) for hybrid sleep to finish doing its thing and transition the system to a low power state. I’ve observed this on at least two respectable laptops, so I’m fairly sure this isn’t just bad hardware and just a limitation of the hibernate-based technology. Still, despite the annoyance factor, I think hybrid sleep is an overall improvement (especially with being able to swap out batteries without fear if your system went into suspend due to low power).
  3. The Windows Vista RDP client has an ill-advised default behavior relating to the selection of the account domain name sent to the remote system when logging on. Unlike previous RDP client versions, the Vista RDP client requires you to enter credentials before connecting. The problem here is that if you don’t explicitly specify a domain name in the username part of your name, and you aren’t connecting to a target system with its netbios name, then your logon will virtually always fail. This is because the RDP client will prepend “hostname\” to the username sent to the remote system, where “hostname” is the hostname you tell the RDP client to connect to. This results in all sorts of stupidly broken scenarios, where mstsc provides an IP address as your logon domain, or a FQDN for a computer that isn’t a member of that domain, or a different FQDN that leads to a computer but doesn’t match the computer name. All of these will result in a failed logon, for no particularly good reason. The workaround to this is fairly simple; specify “\username” as your username in mstsc, and the local (SAM) account database of the remote system is used to log you on. Still, there is almost no reason to not default to this behavior…
  4. Credential saving in IE7 is unintuitive at best. Specifically, even if you check the “save my credentials” box, the credentials are mysteriously not used unless you explicitly put the target site in your Intranet zone. This is highly annoying until you figure out what’s going on, as the credential manager UI in the control panel shows your credentials as being saved, but they’re never actually used. At the very least, there needs to be some kind of UI feedback on the save credentials dialog if your current settings dictate that your saved credentials will never actually be used.
  5. Switching users will kill RAS (dial-up) connections. There used to be a nicely documented way for how to suppress this behavior on pre-Vista systems, but from disassembling and debugging Vista, it’s abundantly clear that there is no possible way to suppress this behavior in Vista, at least without patching rasmans.dll to not kill connections on logon/logoff). Vista does have an explicit exemption for a connection shared by Internet Connection Sharing (ICS), that prevents it from being killed on logon/logout, but even this is stupidly buggy: there are no checks done to make sure that dependant connections of the ICS RAS connection aren’t killed on logout. This is especially ironic, given all of the work that has been put into developing routing compartments for Vista and “Longhorn” (culminating in the nice advantages of them being completely ignored on Vista). For me, this is a real problem, as I often use a dependant RAS connection setup for remote Internet access: A dial-up connection through Bluetooth to my cell phone (EVDO) for physical Internet access, and then an L2TP/IPSec link on top of that for security. Unfortunately, this means that no matter what I do, each time I switch users in Vista, my mobile Internet access gets nuked. I’ve been pondering writing something to patch out this stupidness in rasmans.dll…
  6. Something seems to cause Vista to hold on to IP addresses after their associated adapter has gone down. I’ve noticed this with my Broadcom NetXtra 57xx gigabit NIC; my previous laptop had a similar problem with its Broadcom 100mbit NIC as well. I don’t know if this is a Broadcom problem or an OS-level problem, but what ends up happening is that tcpip still claims ownership of the IP address that you had assigned to you while that interface is disconnected (media unplugged). Normally, this isn’t a big problem, but if you have a setup where VPN’ing in and plugging into Ethernet will net you the same static IP on a network, you’ll occasionally run into bizzare problems where the VPN connection doesn’t get an IP address. This tends to be a result of Vista claiming that same IP address that you would be assigned via the VPN on an ethernet interface that is now media-disconnected (but was previously connected to the network in question and had the IP address in question). This problem isn’t new to Vista; I’ve seen it on Windows XP as well. I’ve pretty much given up on retaining the same IP for remoting and just being plugged directly into my network as a result of this problem…
  7. You can’t use runas on explorer or IE anymore. This means you’re forced to use Fast User Switching for things that UAC doesn’t have a convenient UI for, which also means that you’re in trouble if you’re using RAS for Internet access. Doh.
  8. Managing RAS connections is a royal pain. In order to make a copy of a RAS connection, rename it from the default name (“original name – Copy”), and edit the properties (presumably, you wanted to actually change the connection without just duplicating it for no reason, right?), you need to go through no less than three elevation prompts in rapid succession. This isn’t so terrible if you’re logged on as an admin, but having to type your admin password three different times (if you’re logged on as a limited user, the “recommended” configuration) is just stupidly redundant and annoying.
  9. There is a no way that I can tell to change the default domain name used in UAC elevation prompts (if you’re a non-admin user providing over-the-shoulder (OTS) credentials). This sucks if you’re on a domain, as the typical “recommended” scenario is you would be logged on with a domain account (and a limited user) on the local system. If you need to perform an administrative task, then you elevate using a local admin account (presumably, you don’t give all your users domain admin accounts). Unfortunately, Vista always defaults to your domain as the logon domain, instead of the local domain (for elevation prompts). This means that in the “recommended”, “secure” configuration, you have to type computername\ before your account name every single time you get an elevation prompt. This is a basic convenience thing that really needs a way to change the default logon domain, as you pretty much always use one or the other, and if you aren’t the network domain for elevation, you’re stuck doing completely redundant extra work each time you elevate.
  10. You can’t scroll down in the bottom pane of the “Networking” tab in Task Manager anymore, if you have enough adapters to cause a scroll bar to appear. Each time Task Manager refreshes, the listbox scrolls to the top and blows away whatever you were looking at. This (minor, but still annoying) issue is new to Vista.
  11. There is no way to modify Terminal Server connection ACLs. The tool to do this (tscc.msc) isn’t shipped on “workstation” systems. It’s there on Windows Server 2003, but not Windows XP. I suppose this is a “business decision” by Microsoft, but I happen to want to be able to make myself able to perform session connections from the command line or Task Manager without going through the switch user GUI. This is a pretty minor gripe, but it’s still something that wastes a bit of my time each time I need to switch sessions.

Okay, enough ranting about things that aren’t quite the way I would like in Vista. Despite these shortcomings (which you may or may not agree about), I still think that Vista’s worth having over XP (there are a whole lot of things that *do* work right in Vista, and plenty of useful new additions to boot). Nothing’s perfect, though, and improving awareness of issues can only improve the chances of them getting fixed, someday.

Oh, and if anyone’s got any suggestions or information about how to work around some of the problems I’ve talked about here, I’d be happy to hear them.