One useful utility for tracking down performance problems that you might not have heard of until now is kernrate, the Windows kernel profiler. This utility currently ships with the Windows Server 2003 Resource Kit Tools package (though you can use kernate on Windows XP is well) and is freely downloadable. Currently, you’ll have to match the version of kernrate you want to use with your processor architecture, so if you are using your processor in x64 mode with an x64 Windows edition, then you’ll have to dig up an x64 version of kernrate (the one that ships with the Srv03 resource kit tools is x86); KrView (see below) ships with an x64 compatible version of kernrate.
Kernrate requires that you have the SeProfilePrivilege assigned (which is typically only granted to administrators), so in most cases you will need to be a local administrator on your system in order to use it. This privilege allows access to the (undocumented) profile object system services. These APIs allow programmatic access to sample the instruction pointer at certain intervals (typically, a profiler program selects the timer interrupt for use with instruction pointer sampling). This allows you to get a feel for what the system is doing over time, which is in turn useful for identifying the cause of performance issues where a particular operation appears to be processor bound and taking longer than you would like.
There are a multitude of options that you can give kernrate (and you are probably best served by experimenting with them a bit on your own), so I’ll just cover the common ones that you’ll need to get started (use “kernrate -?” to get a list of all supported options).
Kernrate can be used to profile both user mode and kernel mode performance issues. By default, it operates only on kernel mode code, but you can override this via the -a (and -av) options, which cause kernrate to include user mode code in its profiling operations in addition to kernel mode code. Additionally, by default, kernrate operates over the entire system at once; to get meaningful results with profiling user mode code, you’ll want to specify a process (or group of processes) to profile, with the “-p pid” and/or “-n process-name” arguments. (The process name is the first 8 characters of a process’s main executable filename.)
To terminate collection of profiling data, use Ctrl-C. (On pre-Windows-Vista systems where you might be running kernrate.exe via runas, remember that Ctrl-C does not work on console processes started via runas.) Additionally, you can use the “-s seconds” argument to specify that profling should be automagically stopped after a given count of seconds have elapsed.
If you run kernrate on kernel mode code only, or just specify a process (or group of processes) as described above, you’ll notice that you get a whole lot of general system-wide output (information about interrupt counts, global processor time usage, context switch counts, I/O operation counts) in addition to output about which modules used a noteworthy amount of processor time. Here’s an example output of running kernrate on just the kernel on my system, as described above (including just the module totals):
D:\\Programs\\Utilities>kernrate Kernrate User-Specified Command Line: kernrate Kernel Profile (PID = 0): Source= Time, Using Kernrate Default Rate of 25000 events/hit Starting to collect profile data ***> Press ctrl-c to finish collecting profile data ===> Finished Collecting Data, Starting to Process Results ------------Overall Summary:-------------- [...] OutputResults: KernelModuleCount = 153 Percentage in the following table is based on the Total Hits for the Kernel Time 197 hits, 25000 events per hit -------- Module Hits msec %Total Events/Sec intelppm 67 980 34 % 1709183 ntkrnlpa 52 981 26 % 1325178 win32k 35 981 17 % 891946 hal 19 981 9 % 484199 dxgkrnl 6 980 3 % 153061 nvlddmkm 6 980 3 % 153061 fanio 3 981 1 % 76452 bcm4sbxp 2 981 1 % 50968 portcls 2 980 1 % 51020 STAC97 2 980 1 % 51020 bthport 1 981 0 % 25484 BTHUSB 1 981 0 % 25484 Ntfs 1 980 0 % 25510
Using kernrate in this fashion is a good first step towards profiling a performance problem (especially if you are working with someone else’s program), as it quickly allows you to narrow down a processor hog to a particular module. While this is useful as a first step, however, it doesn’t really give you a whole lot of information about what specific code in a particular mode is taking a lot of processor time.
To dig in deeper as to the cause of the problem (beyond just tracing it to a particular module), you’ll need to use the “-z module-name” option. This option tells kernrate to “zoom in” on a particular module; that is, for the given module, kernrate will track instruction pointer locations within the module to individual functions. This level of granularity is often what you’ll need for tracking down a performance issue (at least as far as profiling is concerned). You can repeat the “-z” option multiple times to “zoom in” to multiple modules (useful if the problem you are tracking down involves high processor usage across multiple DLLs or binaries).
Because kernrate is resolving instruction pointer sampling down to a more granular level than modules (with the “-z” option), you’ll need to tell it how to load symbols for all affected modules (otherwise, the granularity for profiler output will typically be very poor, often restricted to just exported functions). There are two ways to do this. First, you can use the “-j symbol-path” command line option; this option tells kernrate to pass a particular symbol path to DbgHelp for use with loading symbols. I recommend the second option, however, which is to configure your _NT_SYMBOL_PATH before-hand so that it points to a valid DbgHelp symbol path. This relieves you of having to manually tell kernrate a symbol path every time you execute it.
Continuing with the example I gave above, we might be interested in just what the “win32k” (the Win32 kernel mode support driver for USER/GDI) module is doing that was taking up 17% of the processor time spent in kernel mode on my system (for the interval that I was profiling). To do that, we can use the following command line (the output has been truncated only include information that we are interested in):
D:\\Programs\\Utilities>kernrate -z win32k Kernrate User-Specified Command Line: kernrate -z win32k Kernel Profile (PID = 0): Source= Time, Using Kernrate Default Rate of 25000 events/hit CallBack: Finished Attempt to Load symbols for 90a00000 \\SystemRoot\\System32\\win32k.sys Starting to collect profile data ***> Press ctrl-c to finish collecting profile data ===> Finished Collecting Data, Starting to Process Results ------------Overall Summary:-------------- [...] OutputResults: KernelModuleCount = 153 Percentage in the following table is based on the Total Hits for the Kernel Time 2465 hits, 25000 events per hit -------- Module Hits msec %Total Events/Sec ntkrnlpa 1273 14799 51 % 2150483 win32k 388 14799 15 % 655449 intelppm 263 14799 10 % 444286 hal 236 14799 9 % 398675 bcm4sbxp 66 14799 2 % 111494 spsys 55 14799 2 % 92911 nvlddmkm 48 14799 1 % 81086 STAC97 31 14799 1 % 52368 [...] ===> Processing Zoomed Module win32k.sys... ----- Zoomed module win32k.sys (Bucket size = 16 bytes, Rounding Down) -------- Percentage in the following table is based on the Total Hits for this Zoom Module Time 388 hits, 25000 events per hit -------- Module Hits msec %Total Events/Sec xxxInternalDoPaint 44 14799 10 % 74329 XDCOBJ::bSaveAttributes 20 14799 4 % 33786 DelayedDestroyCacheDC 20 14799 4 % 33786 HANDLELOCK::vLockHandle 15 14799 3 % 25339 mmxAlphaPerPixelOnly 15 14799 3 % 25339 XDCOBJ::RestoreAttributes 13 14799 2 % 21960 DoTimer 12 14799 2 % 20271 _SEH_prolog4 11 14799 2 % 18582 memmove 9 14799 2 % 15203 _GetDCEx 6 14799 1 % 10135 HmgLockEx 6 14799 1 % 10135 XDCOBJ::bCleanDC 5 14799 1 % 8446 XEPALOBJ::ulIndexToRGB 5 14799 1 % 8446 HmgShareCheckLock 4 14799 0 % 6757 RGNOBJ::bMerge 4 14799 0 % 6757 [...]
This should give you a feel for the kind of information that you’ll get from kernrate. Although the examples I gave were profiling kernel mode code, the whole process works the same way for user mode if you use the “-p” or “-n” options as I mentioned earlier. In conjunction with a debugger, the information that kernrate gives you can often be a great help in narrowing down CPU usage performance problems (or at the very least point you in the general direction as to where you’ll need to do further research).
There are also a variety of other options that are available in kernrate, such as features for gathering information about “hot” locks that have a high degree of contention, and support for launching new processes under the profiler. There is also support for outputting the raw sampled profile data, which can be used to graph the output (such as you might see used with tools like KrView).
Although kernrate doesn’t have all the “bells and whistles” of some of the high-end profiling tools (like Intel’s vTune), it’s often enough to get the job done, and it’s also available to you at no extra cost (and can be quickly and easily deployed to help find the source of a problem). I’d highly recommend giving it a shot if you are trying to analyze a performance problem and don’t already have a profiling solution that you are using.