Some days, nothing seems to work properly. This is the sad story of how something as innocent as trying to install a statistics graphing Web application culminated in my breaking out the kernel debugger in an attempt to get things working. (I don’t seem to have a lot of luck with web applications. So much for the way of the future being “easy to develop/deploy/use” web-based applications…)
Recently, I decided that to try installing Cacti in order to get some nice, pretty graphs describing resource utilization on several boxes at my apartment. Cacti is a PHP program that queries SNMP data and, with the help of a program called RRDTool, creates friendly historical graphs for you. It’s commonly used for monitoring things like network or processor usage over time.
In this particular instance, I was attempting to get Cacti working on a Windows Server 2003 x64 SP2 box. Running an amalgam of unix-ish programs on Windows is certainly “fun”, and doing it on native x64 is even more “interesting”. I didn’t expect to find myself in the kernel debugger while trying to get Cacti working, though…
To start out, the first thing I had to do was convert IIS6’s worker processes to 32-bit instead of 64-bit, as the standard PHP 5 distribution doesn’t support x64. (No, I don’t consider spending who knows how many hours to get PHP building on x64 natively a viable solution here, so I just decided to stick with the 32-bit release. I don’t particularly want to be in the habit of having to then maintain rebuild my own PHP distribution from a custom build environment each time security updates come out either…).
This wasn’t too bad (at least not at first); a bit of searching revealed this KB article that documented an IIS metabase flag that you can set to turn on 32-bit worker processes (with the help of the adsutil.vbs script included in the IIS Adminscripts directory).
One small snag here was that I happened to be running a symbol proxy in native x64 mode on this system already. Since the 32-bit vs 64-bit IIS worker process flag is an all-or-nothing option, I had to go install the 32-bit WinDbg distribution on this system and copy over the 32-bit symproxy.dll and symsrv.dll into %systemroot%\system32\inetsrv. Additionally, the registry settings used by the 64-bit symproxy weren’t directly accessible to the 32-bit version (due to a compatiblity feature in 64-bit versions of Windows known as Registry Reflection), so I had to manually copy over the registry settings describing which symbol paths the symbol proxy used to the Wow64 version of HKLM\Software. No big deal so far, just a minor annoyance.
The first unexpected problem that cropped up happened after I had configured the 32-bit symbol proxy ISAPI filter and installed PHP; after I enabled 32-bit worker processes, IIS started tossing HTTP 500 Internal Server Error statuses whenever I tried to browse any site on the box. Hmm, not good…
After determining that everything was still completely broken even after disabling the symbol proxy and PHP ISAPI modules, I discovered some rather unpleasant-looking event log messages:
ISAPI Filter ‘%SystemRoot%\Microsoft.NET\Framework64\v2.0.50727\aspnet_filter.dll’ could not be loaded due to a configuration problem. The current configuration only supports loading images built for a x86 processor architecture. The data field contains the error number. To learn more about this issue, including how to troubleshooting this kind of processor architecture mismatch error, see http://go.microsoft.com/fwlink/?LinkId=29349.
It seemed that the problem was the wrong version of ASP.NET being loaded (still the x64 version). The link in the event message wasn’t all that helpful, but a bit of searching located yet another knowledge base article – this time, about how to switch back and forth between 32-bit and 64-bit versions of ASP.NET. After running aspnet_regiis as described in that article, IIS was once again in a more or less working state. Another problem down, but the worst was yet to come…
With IIS working again, I turned towards configuring Cacti in IIS. Although, at first it appeared as though everything might actually go as planned (after configuring Cacti’s database settings, I soon found myself at its php-based initial configuration page), such things were not meant to be. The first sign of trouble appeared after I completed the initial configuration page and attempted to log on with the default username and password. Doing so resulted in my being thrown back to the log on page, without any error messages. A username and password combination not matching the defaults did result in a logon failure error message, so something besides a credential failure was up.
After some digging around in the Cacti sources, it appeared that the way that Cacti tracks whether a user is logged in or not is via setting some values in the standard PHP session mechanism. Since Cacti was apparently pushing me back to the log on page as soon as I logged on, I guessed that there was probably some sort of failure with PHP’s session state management.
Rewind a bit to back when I installed PHP. In the interest of expediency (hah!), I decided to try out the Win32 installer package (as opposed to just the zip distribution for a manual install) for PHP. Typically, I’ve just installed PHP for IIS the manual way, but I figured that if they had an installer nowadays, it might be worth giving it a shot and save myself some of the tedium.
Unfortunately, it appears that PHP’s installer is not all that intelligent. It turns out that in the IIS ISAPI mode, PHP configures the system-wide PHP settings to point the session state file directory to the user-specific temp directory (i.e. pointing to a location under %userprofile%). This, obviously, isn’t going to work; anonymous users logged on to IIS aren’t going to have access to the temp directory of the account I used to install PHP with.
After manually setting up a proper location for PHP’s session state with the right security permissions (and reconfiguring php.ini to match), I tried logging in to Cacti again. This time, I actually got to the main screen after changing the password (hooray, progress!).
From here, all that I had left to do was some minor reconfiguring of the Windows SNMP service in order to allow Cacti to query it, set up the Cacti poller task job (Cacti is designed to poll data from its configured data sources at regular intervals), and configure my graphs in Cacti.
Configuring SNMP wasn’t all that difficult (though something I hadn’t done before with the Windows SNMP service), and I soon had Cacti successfully querying data over SNMP. All that was left to do was graph it, and I was home free…
Unfortunately, getting Cacti to actually graph the data turned out to be rather troublesome. In fact, I still haven’t even got it working, though I’ve at least learned a bit more about just why it isn’t working…
When I attempted to create graphs in Cacti, everything would appear to work okay, but no RRDTool datafiles would ever appear. No amount of messing with filesystem permissions resolved the problem, and the Cacti log files were not particularly helpful (even on debug severity). Additionally, attempting to edit graph properties in Cacti would result in that HTTP session mysteriously hanging forever more (definitely not a good sign). After searching around (unsuccessfully) for any possible solutions, I decided to try and take a closer look at what exactly was going on when my requests to Cacti got stuck.
Checking the process list after repeating the sequence that caused a particular Cacti session to hang several times, I found that there appeared to be a pair of cmd.exe and rrdtool.exe instances corresponding to each hung session. Hmm, it would appear that something RRDTool was doing was freezing and PHP was waiting for it… (PHP uses cmd.exe to call RRDTool, so I guessed that PHP would be waiting for cmd.exe, which would be waiting for RRDTool).
At first, I attempted to attach to one of the cmd processes with WinDbg. (Incidentally, it would appear that there are currently no symbols for the Wow64 versions of the Srv03SP2 ntdll, kernel32, user32, and a large number of other core DLLs with Wow64 builds available on the Microsoft symbol server for some reason. If any Microsoft people are reading this, it would be greaaaat if you could fix the public symbol server for Srv03 SP2 x64 Wow64 DLLs …) However, symbols for cmd.exe were fortunately available, so it was relatively easy to figure out what it was up to, and prove my earlier hypothesis that it was simply waiting on an rrdtool instance:
0:001:x86> ~1k
ChildEBP RetAddr
0012fac4 7d4d8bf1 ntdll_7d600000!NtWaitForSingleObject+0x15
0012fad8 4ad018ea KERNEL32!WaitForSingleObject+0x12
0012faec 4ad02611 cmd!WaitProc+0x18
0012fc24 4ad01a2b cmd!ExecPgm+0x3e2
0012fc58 4ad019b3 cmd!ECWork+0x84
0012fc70 4ad03c58 cmd!ExtCom+0x40
0012fe9c 4ad01447 cmd!FindFixAndRun+0xa9
0012fee0 4ad06cf6 cmd!Dispatch+0x137
0012ff44 4ad07786 cmd!main+0x108
0012ffc0 7d4e7d2a cmd!mainCRTStartup+0x12f
0012fff0 00000000 KERNEL32!BaseProcessInitPostImport+0x8d
0:001:x86> !peb
[...]
CommandLine: 'cmd.exe /c c:/progra~2/rrdtool/rrdtool.exe -'
[...]
Given this, the next logical step to investigate would be the RRDTool.exe process. Unfortunately, something really weird seemed to be going on with all the RRDTool.exe processes (naturally). WinDbg would give me an access denied error for all of the RRDTool PIDs in the F6 process list, despite my being a local machine administrator.
Attempting to attach to these processes failed as well:
Microsoft (R) Windows Debugger Version 6.6.0007.5
Copyright (c) Microsoft Corporation. All rights reserved.Cannot debug pid 4904, NTSTATUS 0xC000010A
“An attempt was made to duplicate an object handle into or out of an exiting process.”
Debuggee initialization failed, NTSTATUS 0xC000010A
“An attempt was made to duplicate an object handle into or out of an exiting process.”
This is not something that you want to be seeing on a server box. This particular error means that the process in question is in the middle of being terminated, which prevents a debugger from successfully attaching. However, processes typically terminate in timely fashion; in fact, it’s almost unheard of to actually see a process in the terminating state, since it happens so quickly. However, in this particular instances, the RRDTool processes were remaining in this half-dead state for what appeared to be an indefinite interval.
There are two things that commonly cause this kind of problem, and all of them are related to the kernel:
- The disk hardware is not properly responding to I/O requests and they are hanging indefinitely. This can block a process from exiting while the operating system waits for an I/O to finishing canceling or completing. Since this particular box was brand new (and with respectable, high-quality server hardware), I didn’t think that failing hardware was the cause here (or at least, I certainly hoped not!). Given that there were no errors in the event log about I/Os failing, and that I was still able to access files on my disks without issue, I decided to rule this possiblity out.
- A driver (or other kernel mode code in the I/O stack) is buggy and is not allowing I/O requests to be canceled or completed, or has deadlocked itself and is not able to complete an I/O request. (You might be familiar with the latter if you’ve tried to use the 1394 mass storage support in Windows for a non-trivial length of time.) Given that I had tentatively ruled out bad hardware, this would seem to be the most likely cause here.
Since the frozen process would be stuck in kernel mode, in either case, to proceed any further I would need to use the kernel debugger. I decided to start out with local kd, as that is sufficient for at least retrieving thread stacks and doing basic passive analysis of potential deadlock issues where the system is at least mostly still functional.
Sure enough, the stuck RRDTool process I had unsuccessfully tried to attach to was blocked in kernel mode:
lkd> !process 0n4904
Searching for Process with Cid == 1328
PROCESS fffffadfcc712040
SessionId: 0 Cid: 1328 Peb: 7efdf000 ParentCid: 1354
DirBase: 5ea6c000 ObjectTable: fffffa80041d19d0 HandleCount: 68.
Image: rrdtool.exe
[...]
THREAD fffffadfcca9a040 Cid 1328.1348 Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Unknown) KernelMode Non-Alertable
fffffadfccf732d0 SynchronizationEvent
Impersonation token: fffffa80041db980 (Level Impersonation)
DeviceMap fffffa8001228140
Owning Process fffffadfcc712040 Image: rrdtool.exe
Wait Start TickCount 6545162 Ticks: 367515 (0:01:35:42.421)
Context Switch Count 445 LargeStack
UserTime 00:00:00.0000
KernelTime 00:00:00.0015
Win32 Start Address windbg!_imp_RegCreateKeyExW (0x0000000000401000)
Start Address 0x000000007d4d1510
Stack Init fffffadfc4a95e00 Current fffffadfc4a953b0
Base fffffadfc4a96000 Limit fffffadfc4a8f000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 1
RetAddr Call Site
fffff800`01027752 nt!KiSwapContext+0x85
fffff800`0102835e nt!KiSwapThread+0x3c9
fffff800`013187ac nt!KeWaitForSingleObject+0x5a6
fffff800`012b2853 nt!IopAcquireFileObjectLock+0x6d
fffff800`01288dff nt!IopCloseFile+0xad
fffff800`01288f0e nt!ObpDecrementHandleCount+0x175
fffff800`0126ceb0 nt!ObpCloseHandleTableEntry+0x242
fffff800`0128d7a6 nt!ExSweepHandleTable+0xf1
fffff800`012899b6 nt!ObKillProcess+0x109
fffff800`01289d3b nt!PspExitThread+0xa3a
fffff800`0102e3fd nt!NtTerminateProcess+0x362
00000000`77ef0caa nt!KiSystemServiceCopyEnd+0x3
0202c9fc`0202c9fb ntdll!NtTerminateProcess+0xa
Hmm… not quite what I expected. If a buggy driver was involved, it should have at least been somewhere on the call stack, but in this particular instance all we have is ntoskrnl code, straight from the system call to the wait that isn’t coming back. Something was definitely wrong in kernel mode, but it wasn’t immediately clear what was causing it. It appeared as if the kernel was blocked on the file object lock (which, to my knowledge, is used to guard synchronous I/O’s that are issued for a particular file object), but, as the file object lock is built upon KEVENTs, the usual lock diagnostics extensions (like `!locks’) would not be particularly helpful. In this instance, what appeared to be happening was that the process rundown logic in the kernel was attempting to release all still-open handles in the exiting RRDTool process, and it was (for some reason) getting stuck while trying to close a handle to a particular file object.
I could at least figure out what file was “broken”, though, by poking around in the stack of IopCloseFile:
lkd> !fileobj fffffadf`ccf73250
\Temp\php\session\sess_bkcavai8fak8antv9coq46at95
LockOperation Set Device Object: 0xfffffadfce423370 \Driver\dmio
Vpb: 0xfffffadfce864840
Access: Read Write SharedRead SharedWrite
Flags: 0x40042
Synchronous IO
Cache Supported
Handle Created
File Object is currently busy and has 1 waiters.
FsContext: 0xfffffa800390e110 FsContext2: 0xfffffa8000106a10
CurrentByteOffset: 0
Cache Data:
Section Object Pointers: fffffadfcd601c20
Shared Cache Map: fffffadfccfdebb0 File Offset: 0 in VACB number 0
Vacb: fffffadfce97fb08
Your data is at: fffff98070e80000
From here, there are a couple of options:
- We could look for processes with an open handle to that file and check their stacks.
- We could look for an IRP associated with that file object and try and trace our way back from there.
Initially, I tried the first option, but this ended up not working particularly well. I attempted to use Process Explorer to locate all processes that had a handle to that file, but this ended up failing rather miserably as Process Explorer itself got deadlocked after it opened a handle to the file. This was actually rather curious; it turned out that processes could open a handle to this “broken” file just fine, but when they tried to close the handle, they would get blocked in kernel mode indefinitely.
That unsuccessful, I tried the second option, which is made easier by the use of `!irpfind’. Normally, this extension is very slow to operate (over a serial cable), but local kd makes it quite usable. This revealed something of value:
lkd> !irpfind -v 0 0 fileobject fffffadf`ccf73250
Looking for IRPs with file object == fffffadfccf73250
Scanning large pool allocation table for Tag: Irp? (fffffadfccdf6000 : fffffadfcce56000)
Searching NonPaged pool (fffffadfcac00000 : fffffae000000000) for Tag: Irp?
Irp [ Thread ] irpStack: (Mj,Mn) DevObj [Driver] MDL Process
fffffadfcc225380: Irp is active with 7 stacks 7 is current (= 0xfffffadfcc225600)
No Mdl: No System Buffer: Thread fffffadfccea27d0: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
[...]
>[ 11, 1] 2 1 fffffadfce7b6040 fffffadfccf73250 00000000-00000000 pending
\FileSystem\Ntfs
Args: fffffadfcd70e0a0 00000000 00000000 00000000
There was an active IRP for this file object. Hopefully, it could be related to whatever is holding the file object lock for that file object. Digging a bit deeper, it’s possible to determine what thread is associated with the IRP (if it’s a thread IRP), and from there, we can grab a stack (which might just give us the smoking gun we’re looking for)…:
lkd> !irp fffffadfcc225380 1
Irp is active with 7 stacks 7 is current (= 0xfffffadfcc225600)
No Mdl: No System Buffer: Thread fffffadfccea27d0: Irp stack trace.
Flags = 00000000
ThreadListEntry.Flink = fffffadfcc2253a0
ThreadListEntry.Blink = fffffadfcc2253a0
[...]
CancelRoutine = fffff800010ba930 nt!FsRtlPrivateResetLowestLockOffset
[...]
lkd> !thread fffffadfccea27d0
THREAD fffffadfccea27d0 Cid 10f8.138c Teb: 00000000fffa1000 Win32Thread: fffffa80023cd860 WAIT: (Unknown) UserMode Non-Alertable
fffffadfccf732e8 NotificationEvent
Impersonation token: fffffa8002c62060 (Level Impersonation)
DeviceMap fffffa8002f3b7b0
Owning Process fffffadfcc202c20 Image: w3wp.exe
Wait Start TickCount 6966187 Ticks: 952 (0:00:00:14.875)
Context Switch Count 1401 LargeStack
UserTime 00:00:00.0000
KernelTime 00:00:00.0000
Win32 Start Address 0x00000000003d87d8
Start Address 0x000000007d4d1504
Stack Init fffffadfc4fbee00 Current fffffadfc4fbe860
Base fffffadfc4fbf000 Limit fffffadfc4fb8000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 0
RetAddr : Call Site
fffff800`01027752 : nt!KiSwapContext+0x85
fffff800`0102835e : nt!KiSwapThread+0x3c9
fffff800`012afb38 : nt!KeWaitForSingleObject+0x5a6
fffff800`0102e3fd : nt!NtLockFile+0x634
00000000`77ef14da : nt!KiSystemServiceCopyEnd+0x3
00000000`00000000 : ntdll!NtLockFile+0xa
This might just be what we’re looking for. There’s a thread in w3wp.exe (the IIS worker process), which is blocking on a synchronous NtLockFile call for that same file object that is in the “broken” state. Since I’m running PHP in ISAPI mode, this does make sense – if PHP is doing something to that file (which it could certainly be, since it’s a PHP session state file as we saw above), then it should be in the context of w3wp.exe.
In order to get a better user mode stack trace as to what might be going on, I was able to attach a user mode debugger to w3wp.exe and get a better picture as to what the deal was:
0:006> .effmach x86
Effective machine: x86 compatible (x86)
0:006:x86> ~6s
ntdll_7d600000!ZwLockFile+0x12:
00000000`7d61d82e c22800 ret 28h
0:006:x86> k
ChildEBP RetAddr
WARNING: Stack unwind information not available. Following frames may be wrong.
014edf84 023915a2 ntdll_7d600000!ZwLockFile+0x12
014edfbc 0241d886 php5ts!flock+0x82
00000000 00000000 php5ts!zend_reflection_class_factory+0xb576
It looks like that thread is indeed related to PHP; PHP is trying to acquire a file lock on the session state file. With a bit of work, we can figure out just what kind of lock it was trying to acquire.
The prototype for NtLockFile is as so:
// NtLockFile locks a region of a file.
NTSYSAPI
NTSTATUS
NTAPI
NtLockFile(
IN HANDLE FileHandle,
IN HANDLE Event OPTIONAL,
IN PIO_APC_ROUTINE ApcRoutine OPTIONAL,
IN PVOID ApcContext OPTIONAL,
OUT PIO_STATUS_BLOCK IoStatusBlock,
IN PULARGE_INTEGER LockOffset,
IN PULARGE_INTEGER LockLength,
IN ULONG Key,
IN BOOLEAN FailImmediately,
IN BOOLEAN ExclusiveLock
);
Given this, we can easily deduce the arguments off a stack dump:
0:006:x86> dd @esp+4 l0n10
00000000`014edf48 000002b0 00000000 00000000 014edfac
00000000`014edf58 014edfac 014edf74 014edf7c 00000000
00000000`014edf68 00000000 00000001
0:006:x86> dq 014edf74 l1
00000000`014edf74 00000000`00000000
0:006:x86> dq 014edf7c l1
00000000`014edf7c 00000000`00000001
It seems that PHP is trying to acquire an exclusive lock for a range of 1 byte starting at offset 0 in this file, with NtLockFile configured to wait until it acquires the lock.
Putting this information together, it’s now possible to surmise what is going on here:
- The child processes created by php have a file handle to the session state file (probably there from process creation inheritance).
- PHP tries to acquire an exclusive lock on part of the session state file. This takes the file object lock for that file and waits for the file to become exclusively available.
- The child process exits. Now, it tries to acquire the file object lock so that it can close its file handle. However, the file object lock cannot be acquired until the child process releases its handle as the handle is blocking PHP’s NtLockFile from completing.
- Deadlock! Neither thread can continue, and PHP appears to hang instead of configuring my graphs properly.
In this particular instance, it was actually possible to “recover” from the deadlock without rebooting; the IIS worker process’s wait in NtLockFile is marked as a UserMode wait, so it is possible to terminate the w3wp.exe process, which releases the file object lock and ultimately allows all the frozen processes that are trying to close a handle to the PHP session state file to finish the close handle operation and exit.
This is actually a nasty little problem; it looks like it’s possible for one user mode process to indefinitely freeze another user mode process in kernel mode via a deadlock. Although you can break the deadlock by terminating the second user mode process, the fact that a user mode process can, at all, cause the kernel to deadlock during process exit (“breakable” or not) does not appear to be a good thing to me.
Meanwhile, knowing this right now doesn’t really solve my problem. Furthermore, I suspect that there’s probably a different problem here too, as the command line that was given to RRDTool (simply “-“) doesn’t look all that valid to me. I’ll see if I can come up with some way to work around this deadlock problem, but it definitely looks like an unpleasant one. If it really is a file handle being incorrectly inherited to a child process, then it might be possible to un-mark that handle for inheritance with some work. The fact that I am having to consider making something to patch PHP to work around this is definitely not a happy one, though…
Silly me for thinking that it would just take a kernel debugger to get a web application running…
Passing “-” as the command line usually means input will be taken from stdin, so it shouldn’t be a problem.
It’s that synchronous I/O in Windows is badly designed, because it makes system calls block inside a lock. The same reason System.Io.Stream.Synchronized in .NET is a really stupid idea: it’s a wrapper for Stream objects that makes them “thread-safe” by wrapping all operations with a lock (think about it). I don’t think the implications of passing file around through handle duplication (as opposed to reopening – POSIX dup/dup2, which is often implemented through handle duplication, is actually a reopen) have sunk in yet
Yeah, I’ve been coming to the same conclusion while investigating this issue the past few days – whoever designed how byte range locking works in NT just didn’t consider how that would work given handle duplication/inheritance and the synchronous file I/O lock.
The more I look at this problem, it seems like the way locking was designed and implemented is just not completely baked, and that given the infrequency of handle duplication in conjuction with byte range locks, it just slipped through the cracks.