x64 Debugging Review

July 17th, 2006

Here’s an index of all of the x64 debugging topics I have covered thus far. This series takes you through the experience of doing user mode debugging on x64, including native x64 debugging, Wow64 debugging, and the various different combinations of 32-bit and 64-bit debuggers that you’ll find available to you on a 64-bit machine (and when to use which one).

  1. Introduction to x64 debugging, part 1
  2. Introduction to x64 debugging, part 2
  3. Introduction to x64 debugging, part 3
  4. Introduction to x64 debugging, part 4
  5. Introduction to x64 debugging, part 5

Ed: This was back-posted to appear after the last x64 debugging posting for sorting purposes.

Introduction to x64 debugging, part 5

July 17th, 2006

If you are porting an program to x64, one of the first things that you might have to debug are 64-bit portability problems. The most common types of these problems are pointer truncation problems, where assumptions are made by your (previously 32-bit) program that a LONG/DWORD/other 32-bit integral type can completely contain a pointer. On x64, this is no longer the case, and if you happen to be given a pointer with more than 32 significant bits, you’ll probably crash.

Although the compiler has very good support for helping detect some of these problems, (the /Wp64 command line option, or “Detect 64-bit portability issues” in the VC++ GUI) sometimes it won’t catch all of them. Fortunately, using your debugging knowledge, you can help catch many of these problems very quickly.

There is some built-in support to do this already, in the way of a feature that forces the operating system to load DLLs top-down on 64-bit Windows. This means that instead of starting at the low end of the user mode address space and going upwards when looking for free address space, the memory manager will start at the high end and move downwards (when loading DLLs). In practical terms, this means that instead of usually getting base addresses that are entirely contained within 32 significant bits of address space, you will often get load addresses that are above the 4GB boundary, thus quickly exposing pointer truncation problems with global variable pointers or function pointers. You can enable this support with the gflags utility in the Debugging Tools for Windows package.

Unfortunately, as far as I could tell, there isn’t any corresponding functionality to randomize other memory allocations. This means that things like heap allocations or VirtualAlloc-style allocations will still often get back pointers that are below 4GB, which can result in pointer truncation bugs being masked when you are testing your program and only showing up in high load conditions, maybe even on a customer site. Not good!

However, we can work around this with a conditional breakpoint in the DTW debuggers. Conditional breakpoints are extremely useful, and what we’ll use one for here is to set a particular flag that causes allocations to be done in a top-down fashion to the lowest level memory allocation routine (that ultimately the Win32 heap manager and things built on top of it, such as new or malloc will call) that is accessible to user mode: NtAllocateVirtualMemory. This function is the system call interface to ask the memory manager to allocate a block of address space (and possibly commit it). It is what VirtualAlloc is implemented against, and what the heap manager is implemented against, so by passing the appropriate flag here, we can guarantee that almost all user mode allocations will be top down.

How do we do this? Well, it’s actually pretty simple. Create a process under the debugger and then enter the following command:

bp ntdll!NtAllocateVirtualMemory "eq @rsp+28 qwo(@rsp+28)|100000;g"

This command sets a breakpoint on NtAllocateVirtualMemory that sets the 0x100000 flag in the fifth parameter (recall my previous discussion on x64 calling conventions). After altering that parameter, execution is resumed and the program continues to run normally.

If we look at the prototype for NtAllocateVirtualMemory:

// NtAllocateVirtualMemory allocates
// virtual memory in the user mode
// address range.
NTSYSAPI
NTSTATUS
NTAPI
NtAllocateVirtualMemory(
IN HANDLE ProcessHandle,
IN OUT PVOID *BaseAddress,
IN ULONG ZeroBits,
IN OUT PULONG AllocationSize,
IN ULONG AllocationType,
IN ULONG Protect
);
 

 

… we can see that we are modifying the “AllocationType” parameter. Compare this to the documentation of the VirtualAlloc function, and you’ll see what is going on here (the flAllocationType parameter is passed as AllocationType). The flag we passed is MEM_TOP_DOWN, which, according to MSDN, “allocates memory at the highest possible address”.

After performing this modification, most allocations will have more than 32 significant bits, which will help catch pointer truncation bugs that deal with dynamic memory allocations very quickly.

Earlier, I said that this will only affect most allocations. There are a couple of caveats for this tecnique:

  • It does not modify data section view mappings (file mappings). I leave it as an exercise for the reader to make a similar conditional breakpoint for ntdll!NtMapViewOfSection.
  • It does not catch the first heap segment in the first heap (the process heap) normally, unless you go out of your way to apply the breakpoint before the process heap is created. One workaround is to just add some dummy allocations at the start of the program to consume the first heap segment, such that subsequent allocations are forced to go through a new heap segment which will be allocated in the high end of the address space.

Despite these limitations, however, I think you’ll find this to be an effective tool to help catch pointer truncation bugs quickly.

For my next few posts, I’m going to take a break from x64 debugging topics and focus on a different topic for a bit. Stay tuned!

Update: Pavel Lebedinsky commented that you can set HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management\AllocationPreference (REG_DWORD) to 0x100000 to achieve a similar effect as the steps I posted, without some of the caveats of the conditional breakpoint I described above (in particular, the initial heap segments will reside in the high end of the address space).  This is a more elegant solution than the one I proposed, so I would recommend using it instead.  Note that ths alters the allocation granularity on a system-wide basis instead of a process-wide basis.

Nothing a little cardboard won’t fix…

July 16th, 2006

The box that I have been hosting this blog on has been having hardware problems lately, or so it seems. Well, that’s never a fun thing, certainly. It’s been hard-locking periodically for awhile now, enough that I can’t get anything out of it via the “Special Administration Console” (SAC) – the Windows version of a serial console. I can’t even break in with the kernel debugger when it locks up, so I’m fairly sure it is hardware related.

Now, a bit of background; this isn’t exactly a new box, and in fact I got it used (for free), so I can’t really complain too much about it. But, suffice to say it has seen better days. There is only one screw holding down the motherboard (!) and both of the NICs that the box came with, well, suck – a Netgear FA310 (the predecessor to the infamously terrible FA311) that doesn’t even have drivers past NT4, and a VIA NIC that occasionally stopped receiving traffic until reboots. Well, I’ve since replaced the NICs with good ol’ reliable 3COM 3C905C-TX-based cards and that at least stopped the NIC-related problems the box has had. (The box needs two NICs as it is my gateway system that sits in between the cable modem and the rest of my LAN here.)

Anyways, there still remained the problem of the periodic complete system lockups. I tried repositioning the computer a few times (on its side, put it sitting up straight, etc), but nothing really helped. This was getting fairly annoying, as it would die every couple of days, naturally in the middle of some extremely inconvenient situation in which to be disconnected in Wow. Since the box only had one motherboard screw, I suspected that it might be shorting out against something (as is not really very well secured against the case). Unfortunately, I didn’t have any screws compatible with the motherboard/case on hand (my other computers here are laptops). So, I set to looking for other solutions; what I came up with is none other than cardboard. Specifically, I ended up just slipping a chunk of cardboard in between the motherboard and the part of the case that it seemed most likely to be shorting out against.

Ever since, I’ve been lockup-free for at least a couple days now. Hopefully this will tide me over for now until I manage to grab some screws…

Introduction to x64 debugging, part 4

July 14th, 2006

Last time, I talked about how exception handling and unwinding works in x64, what it means to you when debugging, and how you can access exception handlers from the debugger. In this installment, I’ll be covering some more of the common pitfalls that can sneak up and bite you when doing Wow64 debugging with the native x64 debugger.

As I had alluded to in the first installment of this series, debugging Wow64 programs with the x64 debugger introduces a lot of extra complexity. I had already illustrated one of the major annoyances – that you need to manually toggle between the x86 and x64 contexts in many places.

The problems don’t end there, though. Many extensions, especially legacy extensions that were written long before x64 was introduced do not handle the Wow64 case gracefully. This results from extensions not properly checking the current effective processor (IDebugControl::GetEffectiveProcessorType). This is something to watch out for if you are writing a debugger extension of your own, as it is no longer enough to just see if the target uses 64-bit pointers or not, since with Wow64 debugging, the target processor type can change rapidly within the debugging session as the user switches modes with the “.effmach” command.

One example of a very useful extension that breaks like this is “!locks”, which analyzes the list of critical sections in a process (maintained by NTDLL) in order to help provide information about deadlocks. The !locks extension will always currently operate on the 64-bit critical section list, which makes it difficult to debug deadlocks in Wow64 programs with the native debugger.

Another common cause for confusion with Wow64 debugging is that references to NTDLL may not actually do what you expect. Under Wow64, there are actually two copies of NTDLL in every 32-bit process; the native 64-bit NTDLL, used by the Wow64 layer itself, and a modified version of the original 32-bit NTDLL (which thunks to Wow64 instead of making system calls itself). The problem here is that if you reference the name “ntdll”, you will tend to get the 64-bit version of ntdll back, even if you are in x64 mode. For instance, consider the following:

0:026:x86> u ntdll!NtClose ntdll!ZwClose:
00000000`78ef1350 4c      dec     esp
00000000`78ef1351 8bd1    mov     edx,ecx
00000000`78ef1353 b80c000000 mov  eax,0xc
00000000`78ef1358 0f05    syscall
00000000`78ef135a c3      ret
00000000`78ef135b 666690  nop
00000000`78ef135e 6690    nop
ntdll!NtQueryObject:
00000000`78ef1360 4c      dec     esp
0:026:x86> .effmach .
Effective machine: x64 (AMD64)
0:026> u ntdll!NtClose
ntdll!ZwClose:
00000000`78ef1350 4c8bd1           mov     r10,rcx
00000000`78ef1353 b80c000000       mov     eax,0xc
00000000`78ef1358 0f05             syscall
00000000`78ef135a c3               ret
00000000`78ef135b 666690           nop
00000000`78ef135e 6690             nop
ntdll!NtQueryObject:
00000000`78ef1360 4c8bd1           mov     r10,rcx
00000000`78ef1363 b80d000000       mov     eax,0xd

Here, we got the same address back even if we switched to x86 mode, and as a result the code we tried to disassemble wasn’t valid (because of the new instruction prefixes added by x64). This can get particularly insidious if you are trying to set a breakpoint in the middle of an ntdll function, since if you are not careful, you might set a breakpoint in the wrong copy of ntdll (and probably in the middle of an instruction, which would likely lead to a crash later on instead of the expected stop at a breakpoint exception). If you want to reference the 32-bit ntdll, then you have to use a special name that is a concatenation of the string “ntdll_” and the base address at which the 32-bit ntdll was loaded. For instance:

0:026:x86> u ntdll_7d600000!NtClose
ntdll_7d600000!NtClose:
00000000`7d61c917 b80c000000 mov  eax,0xc
00000000`7d61c91c 33c9    xor     ecx,ecx
00000000`7d61c91e 8d542404 lea    edx,[esp+0x4]
00000000`7d61c922 64ff15c0000000 call dword ptr fs:[000000c0]
00000000`7d61c929 c20400  ret     0x4
00000000`7d61c92c 8d4900  lea     ecx,[ecx]
ntdll_7d600000!NtQueryObject:
00000000`7d61c92f b80d000000 mov  eax,0xd
00000000`7d61c934 33c9    xor     ecx,ecx

Another common gotcha is forgetting that you are in the wrong processor mode for the code you are disassembling. The disassembler operates in the current effective processor as set by “.effmach”, regardless of whether you are disassembling code in a 32-bit or 64-bit module. This can be confusing if you forget to change the processor type, as you can end up looking at something that is almost valid code, but not quite (due to some subtle differences in the 32-bit and 64-bit instruction sets).

Finally, one other source of confusion can be filenames. Remember that under Wow64, programs have an altered view of cetain filesystem locations, such as %SystemRoot%\System32. Some filenames (especially for loaded modules) may refer to %SystemRoot%\system32, and some may refer to %SystemRoot%\syswow64. Despite the difference in apparent filenames, if you are debugging a Wow64 process, these two directories are the same (and both refer to %SystemRoot%\SysWOW64 on the actual filesystem as viewed from 64-bit programs).

Next time: Tricks for catching 64-bit portability problems with the debugger.

Introduction to x64 debugging, part 3

July 13th, 2006

The last installment of this series described some of the basics of the new calling convention in use on x64 Windows, and how it will impact the debugging experience. This post describes how the unwinding and exception handling aspects matter to you when you debug programs.

I touched on some of the benefits of the new unwind mechanism in the last post – specifically, how you can expect to see full stack traces even without symbols – but, I didn’t really go into a whole lot of detail as to how they are implemented. Microsoft has the full set of details available on MSDN. Rather than restate them all here, I’m going to try to put them into perspective with respect to debugging and how they matter to you.

Perhaps the easiest way to do this is to compare them with x86 exception handling (EH)/unwind support. In the x86 Win32 world, EH/unwind are implemented as a linked list of EXCEPTION_REGISTRATION structures stored at fs:[0] (the start of the current threads TEB). When an exception occurs, the exception dispatcher code (either in NTDLL for a user mode exception or NTOSKRNL for a kernel mode exception) searches through this linked list and calls each handler with information about the exception. The exception handler can indicate that control should be resumed immediately to the faulting context, or that the next handler should be called, or that the exception handler has handled the exception and that the stack should be unwound to it. The first two paths are fairly straightforward; either a context record is continued via NtContinue (if you aren’t familiar with the native API layer, this is effectively a longjmp), or the next handler in the chain is called. If the last handler in the list is reached and does not handle the exception then the thread is terminated (for Win32 programs, this should never happen, as Kernel32 installs an exception handler that will catch all exceptions before it calls process / thread entrypoint functions definened by an application).

The unwind path is a bit more interesting; here, all of the exception handlers between the one that requested an unwind and the top of the list are called with a flag indicating that they should unwind the stack. Each exception handler routine “knows” how to unwind the procedure(s) that it is responsible for. In this mechanism, the stack gets unwound properly back to the point where the exception was handled. While this works well enough for the actual exception handling process itself, there is a flaw in this design; it precludes unwinding call frames without actually calling the unwind handlers in question. In addition, functions in the middle of an unwind path which did not register an exception handler are invisible to the unwind code itself (this does not pose a problem for normal unwinds, as for any function that has any unwind special unwind requirements, such as functions with C++ objects on the stack that have destructors, will implicitly register an exception handler).

What this means for you as it relates to debugging is that on x86, it isn’t generally possible to cleanly unwind *without calling the unwind/exception handler functions*. This means that the debugger cannot automatically unwind the stack and produce a valid stack trace with reliable results, without special help, typically in the form of symbols that specify how a function uses the stack. If a function in the middle of the call stack doesn’t have symbols, then there is a good chance that any debugger-initiated stack traces will stop at that function (a common and frustrating occurance if you are debugging code without symbols on x64).

As I alluded to in the previous posting, this problem has gone away on x64, thanks to the new unwind semantics. The way this works under the hood is that every function that is a non-leaf function (that is, every function which calls another function) is required to have a set of metadata associated with it that describes how the function is to be unwound. This is similar in prinicple to the symbol unwind information used in x86 if you have symbols, except that it is built into the binary itself (or dynamically registered at runtime, for dynamically generated code, like .NET). This unwind metadata has everything necessary to unwind a function without actually having to call exception handling code (and, indeed, exception handlers no longer perform “manual” unwinds as is the case on x86 – the NTDLL or NTOSKRNL exception dispatcher can take care of this for you thanks to the new unwind metadata).

For most purposes, you can be oblivious to this fact while debugging something; the debugger will automagically use the unwind metadata to construct accurate stack traces, even with no symbols available. An example of this is:

 

0:000> k
Child-SP          RetAddr           Call Site
00000000`0012fa28 00000000`78ef6301 ntdll!ZwRequestWaitReplyPort+0xa
00000000`0012fa30 00000000`78ddc6ed ntdll!CsrClientCallServer+0x61
00000000`0012fa60 00000000`78ddc92a kernel32!GetConsoleInputWaitHandle+0x39d
00000000`0012fbd0 00000000`4ad1df2c kernel32!ReadConsoleW+0x7a
00000000`0012fca0 00000000`4ad15fa7 cmd+0x1df2c
00000000`0012fd60 00000000`4ad02530 cmd+0x15fa7
00000000`0012fdc0 00000000`4ad035ca cmd+0x2530
00000000`0012fe30 00000000`4ad17027 cmd+0x35ca
00000000`0012fe80 00000000`4ad04eef cmd+0x17027
00000000`0012ff20 00000000`78d5965c cmd+0x4eef
00000000`0012ff80 00000000`00000000 kernel32!BaseProcessStart+0x2c

 

With symbols loaded, we can see that the stack trace is exactly the same:

 

0:000> k
Child-SP          RetAddr           Call Site
00000000`0012fa28 00000000`78ef6301 ntdll!ZwRequestWaitReplyPort+0xa
00000000`0012fa30 00000000`78ddc6ed ntdll!CsrClientCallServer+0x9f
00000000`0012fa60 00000000`78ddc92a kernel32!ReadConsoleInternal+0x23d
00000000`0012fbd0 00000000`4ad1df2c kernel32!ReadConsoleW+0x7a
00000000`0012fca0 00000000`4ad15fa7 cmd!ReadBufFromConsole+0x11c
00000000`0012fd60 00000000`4ad02530 cmd!FillBuf+0x3d6
00000000`0012fdc0 00000000`4ad035ca cmd!Lex+0xd2
00000000`0012fe30 00000000`4ad17027 cmd!Parser+0x132
00000000`0012fe80 00000000`4ad04eef cmd!main+0x458
00000000`0012ff20 00000000`78d5965c cmd!mainCRTStartup+0x171
00000000`0012ff80 00000000`00000000 kernel32!BaseProcessStart+0x29

 

As you can see, even with no symbols, we still get a stack trace that includes all of the functions active in the selected thread context.

Sometimes you will need to manually examine the unwind data, however. One of the major reasons for this is if you need to do some work with an exception handler. On x86, the familiar set of instructions “push fs:[0]; mov fs:[0], esp” (or equivalent) signify an exception handler registration. In x64 debugging, you won’t see anything like this, because there is no runtime registration of exception handlers (except via calls to RtlAddFunctionTable). To determine if a function has an exception handler (and what the address is), you’ll need to use a command that you have probably never touched before – .fnent. The .fnent (function entry) command displays the active EH/unwind metadata associated with a function, among other misc. information about the function in question (such as its extents). For instance:

 

0:000> .fnent kernel32!LocalAlloc
Debugger function entry 00000000`01dc2ab0 for:
(00000000`78d6e690)   kernel32!LocalAlloc   |
(00000000`78d6e730)   kernel32!GetCurrentProcessId
Exact matches:
kernel32!LocalAlloc = 

BeginAddress      = 00000000`0002e690
EndAddress        = 00000000`0002e6c3
UnwindInfoAddress = 00000000`000d9174

 

Unfortunately, this command does not directly translate the exception handler information that we are interested in, so we have to do some manual work. The offsets provided are relative to the base of the module in which the function resides, so working with our existing example, we’ll need to add the value “kernel32” to each of the offsets to form a completed address.

The format of the unwind information itself is described on MSDN; the important parts are as follows:

 

typedef struct _UNWIND_INFO {
UBYTE Version       : 3;
UBYTE Flags         : 5;
UBYTE SizeOfProlog;
UBYTE CountOfCodes;
UBYTE FrameRegister : 4;
UBYTE FrameOffset   : 4;
UNWIND_CODE UnwindCode[1];
/*  UNWIND_CODE MoreUnwindCode[((CountOfCodes + 1) & ~1) - 1];
*   union {
*       OPTIONAL ULONG ExceptionHandler;
*       OPTIONAL ULONG FunctionEntry;
*   };
*   OPTIONAL ULONG ExceptionData[]; */
} UNWIND_INFO, *PUNWIND_INFO; 

typedef union _UNWIND_CODE {
struct {
UBYTE CodeOffset;
UBYTE UnwindOp : 4;
UBYTE OpInfo   : 4;
};
USHORT FrameOffset;
} UNWIND_CODE, *PUNWIND_CODE;

 

Given the structure definition above, we can write a simplified debugger expression to parse the unwind information structure and tell us the interesting bits. This expression does not handle all cases – in particular, it doesn’t handle chained unwind information properly, for which you would need to write a more complicated expression or do the work manually.

 

0:000> u kernel32+dwo(kernel32+00000000`000d9174+
@@c++((1+ @@masm(by(2+kernel32+00000000`000d9174))) & ~1) * 2 + 4)
kernel32!_C_specific_handler:
00000000`78d92180 ff25eafafaff jmp qword ptr
[kernel32!_imp___C_specific_handler (0000000078d41c70)]

 

The expression finds the count of unwind codes from an UNWIND_INFO structure, performs the necessary alignment calculates, multiplies the resulting value by the size of the UNWIND_CODE union, and adds the resultant value to the offset into the UNWIND_INFO structure where unwind codes are stored. Then, this value is added to the pointer to the UNWIND_INFO structure itself, which gives us a pointer to UNWIND_INFO.ExceptionHandler. This value is an offset into the module for which the exception handler routine is associated with, so by adding the base address of the module, we (finally!) get the address of the exception handler function itself. In this case, it’s __C_specific_handler, which is the equivalent of _except_handler3 in x86 (the standard VC++ generated exception handler for C/C++ code). __C_specific_handler has its own metadata stored in the “ExceptionData” member that describes where the actual C/C++ exception handlers are (i.e. the exception filter/exception handler defined with __except in CL). The format of these structures is as so:

 

typedef struct _CL_SCOPE {
ULONG BeginOffset;   // imagebase relative
ULONG EndOffset;     // imagebase relative
ULONG HandlerOffset; // imagebase relative
ULONG TargetOffset;  // imagebase relative
} CL_SCOPE, * PCL_SCOPE; 

typedef struct _CL_EXCEPTION_DATA {
ULONG NumEntries;
CL_SCOPE ScopeEntries;
} CL_EXCEPTION_DATA, * PCL_EXCEPTION_DATA;

 

If the exception handler is a CL one using __C_specific_handler (as is the case here), we can find the code corresponding to the __except filter/handler by dumping the CL scope table entries as so:
 

0:000> dd kernel32+00000000`000d9174+
@@c++((1+ @@masm(by(2+kernel32+00000000`000d9174))) & ~1)
* 2 + 4 + 4 + 4) L dwo(kernel32+00000000`000d9174+
@@c++((1+ @@masm(by(2+kernel32+00000000`000d9174))) & ~1) * 2 + 4 + 4) * 4
00000000`78e19198  000164fb 00016524 00000001 000709ef
00000000`78e191a8  00016524 00016565 00000001 000709ef
00000000`78e191b8  00016565 00016583 00000001 000709ef
00000000`78e191c8  00016583 00016585 00000001 000709ef
00000000`78e191d8  00070968 0007098d 00000001 000709ef
00000000`78e191e8  0007098d 000709cc 00000001 000709ef
00000000`78e191f8  000709cc 000709ef 00000001 000709ef

 

This command gave us a list of address ranges within kernel32!LocalAlloc that are covered by a C/C++ exception handler, whether there is a filter expression or not (depending on the value of HandlerOffset; 1 signifies that the exception is simply handled by executing the “TargetOffset” routine), and the offset of the handler (TargetOffset). All of the offsets are relative to the base address to kernel32. We can unassemble the handler specified by each of them to see that it is simply setting the last Win32 error based on an exception code:

 

0:000> u kernel32+000709ef
kernel32!LocalAlloc+0x1cb:
00000000`78db09ef 33ff             xor     edi,edi
00000000`78db09f1 48897c2420       mov     [rsp+0x20],rdi
00000000`78db09f6 8bc8             mov     ecx,eax
00000000`78db09f8 e863dcfbff call kernel32!BaseSetLastNTError (0000000078d6e660)
00000000`78db09fd 8d7701           lea     esi,[rdi+0x1]
00000000`78db0a00 448b642460       mov     r12d,[rsp+0x60]
00000000`78db0a05 488b5c2428       mov     rbx,[rsp+0x28]
00000000`78db0a0a e9765bfaff jmp kernel32!LocalAlloc+0x1e6 (0000000078d56585)

 

That’s all for this post. Next time, I’ll talk about some of the common “gotchas” when dealing with Wow64 debugging.

Additional credits for this article: C++ exception handling information from “Improved Automated Analysis of Windows x64 Binaries” by skape.

VMware Server 1.0 released

July 12th, 2006

It’s here – VMware Server 1.0 has been released!  You can get it here.

I’ve been using the VMware Server betas for some time and it is well worth a look if you need to setup some dedicated/always on VMs.  It is not quite a replacement for Workstation (in particular, with its lack of multiple snapshot support) for certain testing scenarios, but if you need to run a set of VMs always on it does the job well.

Be sure to read my earlier posting for an interoperability problem with RDP if you try to connect to a console session, as this problem may limit its usefulness if you do not use full Terminal Server (a reason to consider installing it on Windows Server 2003 instead of Windows XP).

Introduction to x64 debugging, part 2

July 12th, 2006

Last time, I talked about some of the basic differences you’ll see when switching to an x64 system if you are doing debugging using the Debugging Tools for Windows package.  In this installment, I’ll run through some of the other differences with debugging that you’ll likely run into – in particular, how changes to the x64 calling convention will make your life much easier when debugging.

Although the x64 architecture is in many respects very similar to x86, many of the conventions of x86-Win32 that you might be familiar with have changed.  Microsoft took the opportunity to “clean house” with many aspects of Win64, since for native x64 programs, there is no concern of backwards binary compatibility.

One of the major changes that you will quickly discover is that the calling conventions that x86 used (__fastcall, __cdecl, __stdcall) are not applicable to x64.  Instead of many different calling conventions, x64 unifies everything into a single calling conention that all functions use.  You can read the full details of the new calling convention on MSDN, but I’ll give you the executive summary as it applies to debugging programs here.

  •  The first four arguments of a function are passed as registers; rcx, rdx, r8, and r9 respectively.  Subsequent arguments are passed on the stack.
  • The caller allocates the space on the stack for parameter passing, like for __stdcall on x86.  However, the caller must allocate at least 32 bytes of stack space for the callee to use a “register home space” the first four parameters (or scratch space).  This must be done even if the callee has no arguments or less than four arguments.
  • The caller always cleans the stack of arguments passed (like __cdecl on x86) if necessary.
  • Stack unwinding and exception handling are significantly different on x64; more details on that later.  The new stack unwinding model is data-driven rather than code-driven (like on x86).
  • Except for dynamic stack adjustments (like _alloca), all stack space must be allocated in the prologue.  Effectively, for most functions, the stack pointer will remain constant throughout the execution process.
  • The rax register is used for return values.  For return values larger than 64 bits, a hidden pointer argument is used.  There is no more spillover into a second register for large return values (like edx:eax, on x86).
  • The rax, rcx, rdx, r8, r9, r10, r11 registers are volatile, all other registers must be preserved.  For floating point usage, the xmm0, xmm1, xmml2, xmm3, xmm4, xmm5 registers are volatile, and the other registers must be preserved.
  • For floating point arguments, the xmm0 through xmm3 registers are used for the first four arguments, after which stack spillover is performed.
  • The instructions permitted in function prologues and epilogues are highly restricted to a very small subset of the instruction set to facilitate unwinding operations.

The main takeaways here from a debugging pespective are thus:

  • Even though a register calling convention like __fastcall is used, the register arguments are often spilled to the “home area” and so are typically visible in call stacks, especially in debug builds.
  • Due to the nature of parameter passing on x64, the “push” instruction is seldom used for setting up arguments.  Instead, the compiler allocates all space up front (like for local variables on x86) and uses the “mov” instruction to write stack parameters onto the stack for function calls.  This also means that you typically will not see an “add rsp” (or equivalent) after each function call, despite the fact that the caller cleans the stack space.
  • The first stack arguments (argument 5, etc) will appear at [rsp+28h] instead of [rsp+08h], because of the mandatory register home area.  This is a departure from how __fastcall worked on x86, where the first stack argument would be at [esp+04h].
  • Because of the data driven unwind semantics, you will see perfect stack unwinding even without symbols.  This means that even if you don’t have any symbols at all for a third party binary, you should always get a complete stack trace all the way back to the thread start routine.  As a side effect, this means that the stack traces captured by PageHeap or handle traces will be much more reliable than on x86, where they tended break at the first function that did not use ebp (because those stack traces never used symbols).
  • Because of the restrictions on the prologue and epilogue instruction usage, it is very easy to recognize where the actual important function code begins and the boilerplate prologue/epilogue code ends.

If you’ve been debugging on x86 for a long time, then you are probably pretty excited about the features of the new calling convention.  Because of the perfect unwind semantics and constant stack pointer throughout function execution model, debugging code that you don’t have symbols for (and using the built-in heap and handle verification utilities) is much more reliable than x86.  Additionally, compiler generated code is usually easier to understand, because you don’t have to manually track the value of the stack pointer changing throughout the function call like you often did on x86 functions compiled with frame pointer omission (FPO) optimizations.

 There are some exceptions to the rules I laid out above for the x64 calling convention.  For functions that do not call any other functions (called “leaf” functions), it is permissible to utilize custom calling conventions so long as the stack pointer (rsp) is not modified.  If the stack pointer is modified then regular calling convention semantics are required.

Next time, I’ll go into more detail on how exception handling and unwinding is different on x64 from the perspective of what the changes mean to you if you are debugging programs, and how you can access some of the metadata associated with unwinding/exception handling and use it to your advantage within the debugger.

Introduction to x64 debugging, part 1

July 11th, 2006

There are some subtle differences between using the Debugging Tools for Windows (DTW) toolset on x86 and x64 that are worth mentioning, especially if you are new to doing x64 debugging. Most of this post applies to all of the debuggers shipped in the DTW package, which is why I avoid talking about WinDbg or ntsd or cdb specifically, and often just refer to the “DTW debuggers”. This is the first post in a multipart series, and it provides a general overview of the options you have for doing 32-bit and 64-bit debugger on an x64 machine, and how to setup the debugger properly to support both, using either the 32-bit or 64-bit packages.

There are many ways to do x64 debugging, which can get confusing, simply because there are so many different choices. You can use both the 32-bit and 64-bit DTW packages, with some restrictions. Here’s a summary of the most common cases (including “cross-debugging” scenarios, where you are using the 32-bit debugger to debug 64-bit processes). For now, I’ll just limit this to user mode, although you can use many of these options for kernel debugging too.

  • Natively debugging 64-bit processes on the same computer using the 64-bit DTW package
  • Natively debugging 32-bit (Wow64) processes on the same computer using the 64-bit DTW package
  • Debugging 32-bit (Wow64) processes on the same computer using the 32-bit DTW package (running the debugger itself under Wow64)
  • Debugging 64-bit processes or 32-bit (Wow64) processes on the same or a different computer using either the 64-bit or 32-bit DTW package, with the remote debugging support (e.g. dbgsrv.exe, or -remote/-server). This requires a 64-bit remote debugger server.
  • Debugging 32-bit (Wow64) processes on the same or a different computer using either the 64-bit or 32-bit DTW package, with the remote debugger support (e.g. dbgsrv.exe, or -remote/-server). This works with a 32-bit remote debugging server.
  • Debugging a 64-bit or 32-bit dump file using the 32-bit or 64-bit DTW package. Both DTW packages are capable of doing this task natively.

There are actually even more combinations, but to keep it simple, I just listed the major ones. Now, as for which setup you want to use, there are a couple of considerations to keep in mind. Most of the important differences for the actual debugging experience stem from whether the process that is making the actual Win32 debugger API calls is a 64-bit or 32-bit process. For the purposes of this discussion, I’ll call the process that makes the actual debugger API calls (e.g. DebugActiveProcess) the actual debugger process.

If the actual debugger process is a 32-bit process under Wow64, then it will be unable to interact meaningfully with 64-bit processes (if you are using WinDbg, 64-bit processes will all show as “System” in the process list). For 32-bit processes, it will see them exactly as you would under an x86 Win32 system; there is no direct indication that they are running under Wow64, and the extra Wow64 functionality is completely isolated from the debugger (and the person driving the debugger). This can be handy, as the extra Wow64 infrastructure can in many cases just get in the way if you are debugging a pure 32-bit program running under Wow64 (unless you suspect a bug in Wow64 itself, which is fairly unlikely to be the case).

If the actual debugger process is a native 64-bit process, then the whole debugging environment changes. The native 64-bit debugging environment allows you to debug both 32-bit (Wow64) and 64-bit targets. However, when you are debugging 32-bit targets, the experience is not the same as if you were just debugging a 32-bit program on a 32-bit Windows installation. The 64-bit debugger will see all of the complexities of Wow64, which often gets confusing and can get in your way. I’ll go into specifics of what exactly is different and how the 64-bit debugger can sometimes be annoying when working with Wow64 processes in a moment; for now, stick with me.

So, if you need to do development on 64-bit computers, which debugging package is the best for you to use? Well, that really depends on what you are doing, but I would recommend installing both the 32-bit and 64-bit DTW packages. The main reason to do this is that it will allow you to debug 32-bit processes without having to deal with the Wow64 layer all the time, but it at the same time it will allow you to debug native 64-bit processes.

After you have installed the DTW packages, one of the familiar first steps with setting up the debugger tools on a new system is to register WinDbg as your default post-portem debugger. This turns out to be a bit more complicated on 64-bit systems than on 32-bit systems, however, in large part due to a new concept added to Windows to support Wow64: registry reflection. Registry reflection allows for 32-bit and 64-bit applications to have their own virtualized view of several key sections of the registry, such as HKEY_LOCAL_MACHINE\Software. What this means in practice is that if you write to the registry from a 32-bit process, you might not see the changes from 64-bit processes (and vice versa), depending on which registry keys are changed. Alternatively, you might see different changes than you made, such as if you are registering a COM interface in HKEY_CLASSES_ROOT.

So, what does all of this mean to you, as it relates to doing debugging on 64-bit systems? Well, the main difference that impacts you is that there are different JIT handlers for 32-bit and 64-bit processes. This means that if you register a 32-bit DTW debugger as a default postmortem debugger, it won’t be activated for 64-bit processes. Conversely, if you register a 64-bit DTW debugger as a default postmortem debugger, it won’t be activated for 32-bit processes.

This leaves you with a couple of options: Register both the 32-bit and 64-bit DTW packages as default postmortem debuggers (if you only want to use the 64-bit DTW package on 64-bit processes and not 32-bit (Wow64) processes as a JIT debugger), or register the 64-bit DTW debugger as a default postmortem debugger for both 32-bit and 64-bit processes. If you want to do the former, then what you need to do is as simple as logging in as an administrator and running both the 32-bit and 64-bit DTW debuggers with the -I command line option (install as default postmortem debugger), and then you’re set. However, if you want to use the 64-bit debugger for both 64-bit and 32-bit processes as a JIT debugger, then things are a bit more complicated. The best way to set this up is to install the 64-bit DTW debugger as a default postmortem debugger (run it with -I), and then open the 64-bit version of regedit.exe, navigate to HKEY_LOCAL_MACHINE\Software\Microsoft\Windows NT\CurrentVersion\AeDebug, and copy the value of the “Debugger” entry into the clipboard. Then, navigate to the 32-bit view of this key, located at HKEY_LOCAL_MACHINE\Software\Wow6432Node\Microsoft\Windows NT\CurrentVersion\AeDebug, create (or modify, if it already exists) the “Auto” string value and set it to “1”, then create (or modify, if it already exists) the “Debugger” string value and set it to the value you copied from the 64-bit view of the AeDebug key. For my system, the “Debugger” value is set to something like “C:\Program Files\Debugging Tools for Windows 64-bit\WinDbg.exe” -p %ld -e %ld -g. If you don’t see a Wow6432Node registry key under HKEY_LOCAL_MACHINE\Software, then you are probably accidentally running the 32-bit version of regedit.exe and not the 64-bit version of regedit.exe.

Now, there are a couple of other considerations to take into account when picking whether to use the 32-bit or 64-bit DTW tools on 32-bit processes. Besides the ease of use consideration (which I’ll come back to in more detail shortly), many third party extension DLLs (including my own SDbgExt, for the moment) are only available as 32-bit binaries. While these extension DLLs might support 64-bit targets, they will only run under a 32-bit debugger host.

I said I’d describe some of the reasons why debugging Wow64 processes under the native 64-bit debugger can be cumbersome. The main problem with doing this is that you need to be careful with whether the debugger is active as a 32-bit or 64-bit debugger. This is controlled by something that the DTW package calls the effective machine, which is a way to tell the debugger that it should be treating the program as a 32-bit or 64-bit program. If you are using the native 64-bit debugger on a Wow64 process, you will often find yourself having to manually switch between the native (x64) machine mode and the Wow64 (x86) mode.

To give you an idea of what I mean, let’s take a simple example of breaking into the 32-bit version of CMD.EXE, and getting a call stack of the first thread (thread 0). If you are experienced with the DTW tools, then you probably already know how to do this on x86-based systems: the “~0k” command, which means “show me a stack trace for thread 0”. If you run this on the 32-bit CMD.exe process, though, you won’t quite get what you were expecting:

0:000> ~0k
Child-SP          RetAddr           Call Site
00000000`0013e318 00000000`78ef6301 ntdll!ZwRequestWaitReplyPort+0xa
00000000`0013e320 00000000`78bc0876 ntdll!CsrClientCallServer+0x9f
00000000`0013e350 00000000`78ba1394 wow64win!ReadConsoleInternal+0x236
00000000`0013e4c0 00000000`78be6866 wow64win!whReadConsoleInternal+0x54
00000000`0013e510 00000000`78b83c7d wow64!Wow64SystemServiceEx+0xd6
00000000`0013edd0 00000000`78be6a5a wow64cpu!ServiceNoTurbo+0x28
00000000`0013ee60 00000000`78be5e0d wow64!RunCpuSimulation+0xa
00000000`0013ee90 00000000`78ed8501 wow64!Wow64LdrpInitialize+0x2ed
00000000`0013f6c0 00000000`78ed6416 ntdll!LdrpInitializeProcess+0x17d9
00000000`0013f9d0 00000000`78ef3925 ntdll!LdrpInitialize+0x18f
00000000`0013fab0 00000000`78d59630 ntdll!KiUserApcDispatch+0x15
00000000`0013ffa8 00000000`00000000 0x78d59630
00000000`0013ffb0 00000000`00000000 0x0
00000000`0013ffb8 00000000`00000000 0x0
00000000`0013ffc0 00000000`00000000 0x0
00000000`0013ffc8 00000000`00000000 0x0
00000000`0013ffd0 00000000`00000000 0x0
00000000`0013ffd8 00000000`00000000 0x0
00000000`0013ffe0 00000000`00000000 0x0
00000000`0013ffe8 00000000`00000000 0x0

Hey, that doesn’t look like the 32-bit CMD at all! Well, the reason for the strange call stack is that the 32-bit CMD’s first thread is sleeping in a system call to the 64-bit kernel, and the last active processor state for that thread was native 64-bit mode, and NOT 32-bit mode. You will find that this is the common case for threads that are not spinning or doing actual work when you break in with the debugger.

In order to get the more useful 32-bit stack trace, we’ll have to use a debugger command that is probably unfamiliar to you if you haven’t done Wow64 debugging before: .effmach. This command controls the “effective machine” of the debugger, which I previously described. We’ll want to tell the debugger to show us the 32-bit state of the debugger, which we can do with the “.effmach x86” command. Then, we can get a 32-bit stack trace for the first thread with the “~0k” command:

0:002> .effmach x86
Effective machine: x86 compatible (x86)
0:002:x86> ~0k
ChildEBP          RetAddr
002dfd68 7d542f32 KERNEL32!ReadConsoleInternal+0x15
002dfdf4 4ad0fe14 KERNEL32!ReadConsoleW+0x42
002dfe5c 4ad15803 cmd!ReadBufFromConsole+0xb5
002dfe88 4ad02378 cmd!FillBuf+0x174
002dfe8c 4ad02279 cmd!GetByte+0x11
002dfea8 4ad026c5 cmd!Lex+0x6b
002dfeb8 4ad02783 cmd!GeToken+0x20
002dfec8 4ad02883 cmd!ParseStatement+0x36
002dfedc 4ad164c0 cmd!Parser+0x46
002dff44 4ad04cdd cmd!main+0x1d6
002dffc0 7d4e6e1a cmd!mainCRTStartup+0x12f
002dfff0 00000000 KERNEL32!BaseProcessStart+0x28

Much better! That’s more in line with what we’d be expecting an idle CMD.EXE to be doing. We can now treat the target as a 32-bit process, including things like displaying and altering registry contexts, disassembling, and soforth. For instance:

 

0:002:x86> ~0s
KERNEL32!ReadConsoleInternal+0x15:
00000000`7d54e9c3 c22000  ret     0x20
0:000:x86> r
eax=00000001 ebx=002dfe84 ecx=00000000 edx=00000000 esi=00000003 edi=4ad2faa0
eip=7d54e9c3 esp=002dfd6c ebp=002dfdf4 iopl=0         nv up ei pl nz na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202
KERNEL32!ReadConsoleInternal+0x15:
00000000`7d54e9c3 c22000  ret     0x20
0:000:x86> u poi(esp)
KERNEL32!ReadConsoleW+0x42:
00000000`7d542f32 8b4dfc  mov     ecx,[ebp-0x4]
00000000`7d542f35 5f      pop     edi
00000000`7d542f36 5e      pop     esi
00000000`7d542f37 5b      pop     ebx
00000000`7d542f38 e8545df9ff call KERNEL32!__security_check_cookie (7d4d8c91)
00000000`7d542f3d c9      leave
00000000`7d542f3e c21400  ret     0x14
00000000`7d542f41 90      nop

If we want to switch the debugger back to the 64-bit view of the process, we can use “.effmach .” to change to the native processor type:

0:000:x86> .effmach .
Effective machine: x64 (AMD64)

Now, we’re back to 64-bit mode, and all of the debugger commands will reflect this:

0:000> r
rax=000000000000000c rbx=000000000013e3a0 rcx=0000000000000000
rdx=00000000002df1f4 rsi=0000000000000000 rdi=00000000003e0cd0
rip=0000000078ef148a rsp=000000000013e318 rbp=00000000002dfdf4
r8=000000007d61c929  r9=000000007d61caf1 r10=0000000000000000
r11=00000000002df1f4 r12=00000000002dfe34 r13=0000000000000001
r14=00000000002dfe84 r15=000000004ad2faa0
iopl=0         nv up ei pl zr na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000244
ntdll!ZwRequestWaitReplyPort+0xa:
00000000`78ef148a c3               ret

That should give you a basic idea as to what you will be needing to do most of the time when you are doing Wow64 debugging. If you are running the 32-bit debugger packages, then all of this extra complexity is hidden and the process will appear to be a regular 32-bit process, with all of the transitions to Wow64 looking like 32-bit system calls (these typically happen in places like ntdll or user32.dll/gdi32.dll).

That’s the end of this post. The next in this series will go into more detail as to what has changed when you take the plunge and start debugging things on a 64-bit system.

SDbgExt extensions – part 2.

July 10th, 2006

Last post, I discussed some of the major extensions available in SDbgExt.  This post is a continuation that explains the remaining major / interesting extensions available in the extension library.

 Another set of extensions that may prove useful if you do frequent C++ development are the STL datastructures display functions.  These allow you to interpret many of the complicated STL container datastructures and format them in more understandable forms from the debugger.  At present, the extension DLL only understands the Visual C++ 7.0 / 7.1 STL structure layout (no support for VC6 or VC8 as of yet, sorry).  This complements the built-in WinDbg support for STL, which does not cover all of VC7 (or at least did not the last time I checked).  The SDbgExt STL type display functions do not rely on type information contained in symbols, so you can use them to help debug third party programs that you do not have symbols for.  However, some extensions will require you to have a bit of knowledge about the structures contained in an STL container (usually the size of a container item).  These extensions will work on local or remote 32-bit targets.

The main functions that you might find useful in this grouping are these:

!stlmap allows you to traverse an std::map (binary tree) and optionally dump some bytes from each of the key and value types.  Due to the layout of the std::map type, you will need to tell SDbgExt about the size of the key and value types.

!stlset allows you to traverse an std::set (binary tree) optionally dump some bytes from each of value types.  It is very similar to the !stlmap extension, except that it works on std::set structures and thus only needs information about a value type and not an additional key type.

!stllist and !stlvector allow you to display the contents of an std::list or std::vector, respectively.  Optionally, you may provide the size of an element to dump some bytes from each element to the debugger.

!stlstring and !stlwstring allow you to display std::string and std::wstring structures.  If you are displaying very long strings (>16K characters), then you will need to provide a second argument specifying the maximum number of characters to display.  This limit is always capped at 64K characters.

Most of the STL datastructures traversal functions have only minimal protection against damaged or corrupted datastructures.  If you attempt to use them on a datastructure that is broken (perhaps it has a link that references itself, causing an infinite loop), then you will need to break out of the extension with Ctrl-C or Ctrl-Break depending on which debugger you use.  To ensure that your system remains responsive enough to have the option of breaking out, SDbgExt will lower its priority to at least normal (WinDbg runs at high priority by default) temporarily, for the duration of the call to the extension (the original priority is restored before the extension returns).

The last major category of functions supported by SDbgExt are those that are related to Windows UI debugging.  These extensions generally always require a live 32-bit target on the local computer in order to function (no remote debugging).  They work by directly querying internal window structures or by calling user32/gdi32 APIs about a particular UI object.  These can be used as a handy replacement for Spy++, which has a nasty tendancy to break badly (and break every GUI application on the same desktop with it) when it encounters a GUI program that is frozen in the debugger.

The !hwnd extension is the primary UI debugging extension supported by SDbgExt.  It will dump many of the interesting attributes about a window given its handle (HWND) to the debugger console.  Additionally, it can be used to enumerate the active windows on the current thread.  This extension is particular useful for programs that store things like class object pointers at GWLP_USERDATA or DWLP_USERDATA, which are normally hard to get at from WinDbg.

The !getprop extension can be used to enumerate window properties associated with a particular window, or to query a single specific window property associated with a particular window.  These are useful if a program stores information (like a class object pointer) in a window property and you need to get at it from the debugger, which is something that you cannot easily do from WinDbg normally.

The !msg extension will display information about a MSG structure given a pointer to it.  It has a limited built in set of message names for some of the most common standard window messages.  It will also display information about the window with which the window message is associated with (if any).

Finally, there are a couple of misc. functions that SDbgExt supports which don’t fit cleanly into any specific category.  Many of these are niche extension that are only useful for very specific scenarios.

The !switchdesk extension will switch the active desktop for the computer hosting the debugger.  This can be useful if you are debugging a full screen DirectX program using a GUI debugger on an alternate desktop and a console debugger in full screen mode connected to the GUI debugger using remote debugging on the same desktop as the program being debugged.

The !lmx extension will allow you to view the loaded module list in the various forms that the loader maintains it (in-load-order, in-memory-order, in-initialization-order).  When used on kernel targets, there is only one loaded module list, so the list identification parameter is unused in that usage case.

!findwfptr (courtsey of skape) allows you to scan the address space of the debuggee for function pointers that are located in writable memory.  It willl work on all target types.  For live targets, it can also optionally place breakpoints at all of the functions pointed to by pointers residing in writable memory.  This extension is useful if you are auditing a program for potential security risks; many of the static addresses (i.e. global variables) in standard system DLLs that contain function pointers have been changed to encode the pointer values based on a random cookie to prevent exploits from overwriting them to gain control over the program.  This extension can help you identify which function pointers might be used by an attacker to compromise your program when used in conjunction with certain types of security flaws.

!cmpmem allows you to compare ranges of memory over time, with the ability to exclude modified ranges.  This is particularly useful if you want to watch a data structure (or even the global variables of an entire module) and quickly determine what changes when a particular operation happens in the context of the debuggee.  For example, if you are trying to reverse engineer a function and want to understand some of the side effects it has on data structures or global variables, this function can help quickly identify modified areas without requiring you to analyze the entire function in a disassembler.

 That’s all for this series.  There are a couple of extensions that I didn’t mention, but they are either very obvious or not generally useful enough to be worth mentioning.  The online help (!help) provides basic syntax and a very brief description about all extensions supported by SDbgExt, so you can find parameter information about all of the extension I mentioned there.

Upcoming topics & suggestions

July 9th, 2006

For next week, I’ve got a series planned that describes some of the things you’ll run into if you are debugging programs on x64 Windows for the next time.  After that, I have something planned that will talk about how the various remote debugging options for WinDbg work (and when you want to use which ones).

 Are there any other topics that you’d like me to cover in particular?  (Post a comment and let me know!)