Archive for the ‘Debugging’ Category

Don’t always trust the disassembler…

Friday, November 3rd, 2006

A coworker of mine just ran into a particularly strange bug in the WinDbg disassembler that turns out to be kind of nasty if you don’t watch out for it.

There is a bug with how WinDbg disassembles e9/imm32 jump instructions on x86. (The e9/imm32 instruction takes a 4-byte relative displacement from eip+5, which when combined with eip, is the new eip value after the jump. It is thus in the format e9XXXXXXXX, where XXXXXXXX is the relative displacement.)

Specifically, if the absolute value of the relative displacement immediate value for the e9 opcode is the virtual address of a named symbol in the target, WinDbg incorrectly displays the target of the jump as that symbol. This does not occur if it does not resolve to a named symbol, i.e. something that would just show up as a raw address and not as a name or name+offset in the debugger.

For example, here’s a quick repro that allocates some space in the target and builds an e9/imm32 instruction that is disassembled incorrectly:

0:001> .dvalloc 1000
Allocated 1000 bytes starting at 00980000
0:001> eb 00980000 e9;
ed 00980001 kernel32!CreateFileA;
u 00980000 l1
00980000 e9241a807c      jmp kernel32!CreateFileA (7c801a24)

Here, CreateFileA is not the target of the jump. Instead, 7d181a29 is the target.

You are most likely to see this problem when looking at e9-style jumps that are used with function hooks.

It is interesting to note that the uf (unassemble function with code analysis) is not fooled by the jump, and does properly recognize the real target for the jump. (Here, the target is bogus, so uf refuses to unassemble it. However, in testing, it has followed the jump correctly when the target is valid.)

This bug is appearing for me in WinDbg 6.6.7.5. I would assume that it is probably present in earlier versions.

Here’s why you do not normally see this bug:

  1. This bug only occurs if the relative displacement, when incorrectly taken as a raw absolute virtual address, matches the absolute virtual address of a named symbol. This means that if you tried to resolve an address as a symbol, you would get a name (or name+offset) back and not just a number.
  2. Most relative displacements are very small, except in atypical cases like function hooks. This means that as a general rule of thumb, it is very rare to see a relative displacement for an e9/imm32 jump that is more than a couple of megabytes.
  3. There are rarely any named symbols in the first few megabytes of the address space. The first 64K are reserved (in non-ntvdm processes) entirely, and after that, things like RTL_USER_PROCESS_PARAMETERS and the environmental variable block for the process, and soforth are typically present. These are allocated at the low end of the address space because by default, the memory manager allocates from low addresses and up. Therefore, low address ranges are very quickly used up by allocations.
  4. Most images do not have a preferred image base in the extreme low end of the address space.
  5. By the time most DLLs with conflicting base addresses are being loaded, the low end of the address space has been sufficiently consumed and/or fragmented such that there is not room to map the image section view in the extreme low end of the address space, continuing to prevent there being named symbols in this region.

Given this, it’s easy to see how this bug has gone unseen and unfixed for so long. Hopefully, it’ll be fixed in the next DTW drop.

Resolution to CreateIpForwardEntry failing on Vista

Friday, November 3rd, 2006

Previously, I had posted about a compatibility problem with Windows Vista if you used CreateIpForwardEntry to manage the IP routing table. In particular, if you call this routine on Vista with the intent to create a new route in the IP routing table, you may get an inexpicibly ERROR_BAD_ARGUMENTS error code returned.

There is an officially supported workaround, though it is not very well documented, and was in fact only recently made available in the Platform SDK documentation to my knowledge.

The official line is that you must call the GetIpInterfaceEntry function on Vista, if you wish to continue to be able to add routes.

(Yes, this does suck. It is a total breaking change for anyone who did route manipulation on OS’s prior to Vista, until you patch your programs out in the field. If this is unacceptable to you, I would encourage you to provide feedback to Microsoft about how this issue impacts customer experiences and your ability to deploy and use your product on Vista.)

I would encourage you to use this documented API instead of the undocumented solution that I posted about earlier, simply because it will (ostensibly) continue to work on future OS versions. (Although, given that the reason you have to call this is because of a breaking future-compatibility change in Vista, I am not sure that it is really justified to use that line here…)

Win32 calling conventions: __thiscall in assembler

Thursday, November 2nd, 2006

The final calling convention that I haven’t gone over in depth is __thiscall.

Unlike the other calling conventions that I have previously discussed, __thiscall is not typically explicitly decorated on functions (and in most cases, you cannot decorate it explicitly).

As the name might imply, __thiscall is used exclusively for functions that have a this pointer – that is, non-static C++ class member functions. In a non-static C++ class member function, the this pointer is passed as a hidden argument. In Microsoft C++, the hidden argument is the first actual argument to the routine.

When a function is __thiscall, you will typically see the this pointer passed in the ecx register. In this respect, __thiscall is rather similar to __fastcall, as the first argument (this) is passed via register in ecx. Unlike __fastcall, however, all remaining arguments are passed via the stack; edx is not supported as an additional argument register. Like __fastcall and __stdcall, when using __thiscall, the callee cleans the stack (and not the caller).

In some circumstances, with non-exported, internal-only-to-a-module functions, CL may use ebx instead of ecx for this. For any exported __thiscall function (or a function whose address escapes a module), the compiler must use ecx for this, however.

Continuing with the previous examples, consider a function implementation like so:

class C
{
public:
	int c;

	__declspec(noinline)
	int ThiscallFunction1(int a, int b)
	{
		return (a + b) * c;
	}
};

This function operates the same as the other example functions that we have used, with the exception that ‘c’ is a member variable and not a parameter.

The implementation of this function looks like so in assembler:

C::ThiscallFunction1 proc near

a= dword ptr  4
b= dword ptr  8

mov     eax, [esp+8]       ; eax=b
mov     edx, [esp+4]       ; edx=a
add     eax, edx           ; eax=a+b
imul    eax, [ecx]         ; eax=eax*this->c
retn    8                  ; return eax;
C::ThiscallFunction1 endp

Note that [ecx+0] is the offset of the member variable ‘c’ from this. This function is similar to the __stdcall version, except that instead of being passed as an explicit argument, the ‘c’ parameter is implicitly passed as part of the class object this and is then referenced off of the this pointer.

Consder a call to this function like this in C:

C* c = new C;
c->c = 3;
c->ThiscallFunction1(1, 2);

This is actually a bit more complicated than the other examples, because we also have a call to operator new to allocate memory for the class object. In this instance, operator new is a __cdecl function that takes a single argument, which is the count in bytes to allocate. Here, sizeof(class C) is 4 bytes.

In assembler, we can thus expect to see something like this:

push    4                    ; sizeof(class C)   
call    operator new         ; allocate a class C object
add     esp, 4               ; clean stack from new call
push    2                    ; 'a'
push    1                    ; 'b'
mov     ecx, eax             ; (class C* c)'this'
mov     dword ptr [eax], 3   ; c->c = 3
call    C::ThiscallFunction1 ; Make the call

Ignoring the call to operator new for the most part, this is relatively what we would expect. ecx is used to pass this, and this->c is set to 3 before the call to ThiscallFunction1, as we would expect, given the C++ code.

With all of this information, you should have all you need to recognize and identify __thiscall functions. The main takeaways are:

  • ecx is used as an argument register, along with the stack, but not edx. This allows you to differentiate between a __fastcall and __thiscall function.
  • Arguments passed on the stack are cleaned by the caller and not the callee, like __stdcall.
  • For virtual function calls, look for a vtable pointer as the first class member (at offset 0) from this. (For multiple inheritance, things are a bit more complex; I am ignoring this case right now). Vtable accesses to retrieve a function pointer to call through after loading ecx before a function call are a tell-table sign of a __thiscall virtual function call.
  • For functions whose visibility scope is confined to one module, the compiler sometimes substitutes ebx for ecx as a volatile argument register for this.

Note that if you explicitly specify a calling convention on a class member function, the function ceases to be __thiscall and takes on the characteristics of the specified calling convention, passing this as the first argument according to the conventions of the requested calling convention.

That’s all for __thiscall. Next up in this series is a brief review and table of contents of what we have covered so far with common Win32 x86 calling conventions.

Removing kernel patching on the fly with the kernel debugger

Wednesday, November 1st, 2006

Occasionally, you may find yourself in a situation where you need to “un-patch” the kernel in order to make forward progress with investigating a problem. This has often been the case with me and certain unnamed anti-virus programs that have the misguided intention to prevent the computer administrator from administering their computer, by denying everyone access to certain protected processes.

Normally, this is done by the use of a kernel driver that hooks various kernel system calls and prevents usermode from being able to access a protected process. While this may be done in the name of preventing malware from interfering with anti-virus software, it also has the unfortunate side effect of preventing legitimate troubleshooting of software issues.

Fortunately, with WinDbg installed and a little knowledge of the debugger, it is easy to reverse these abusive kernel patches that undermine the ability of a system administrator to do his or her job.

Now, normally, you might think that one would be stuck reverse engineering large sections of code in order to disable such kinds of protection mechanisms. However, in the vast majority of cases like these, you can simply have the kernel debugger perform a comparison of the kernel memory image with the image retrieved from the symbol server, and fix up any differences (accounting for relocations). This may be done with the !chkimg -f nt command. Using !chkimg in this fashion allows you to quickly remove unwanted kernel patches without having to dig through third party code that has injected itself into the system.

If you are feeling particularly adventurerous, you can even do this in local kernel debugger mode on Windows XP or later, without having to boot the system with /DEBUG. Be warned that this does carry an inherent race condition, though very unlikely in most cases with system service patching, that you might crash the system if someone makes a call to one of the regions of the kernel that you are unpatching while the kernel is being restored to its pristine state.

You should also be aware that depending on how the third party software that has patched the kernel is written, removing the patches out from under it may have varying negative side effects; be careful. As a result, if you are working on a critical or production system, you may want to pick a different approach. If you are just working on a throw-away repro environment in a VM, though, this can be a good quick-n-dirty way to get the job done.

Despite these potential problems, I’ve successfully used this trick in a pinch several times successfully. If you are running into a brick wall with debugging malfunctioning anti-virus software interactions with your product because of anti-debug protection mechanisms, you might give this technique a try.

Debugging programs that block symbol server access

Tuesday, October 31st, 2006

One of the rather frustrating things to debug in Windows is one of the programs or services that is in the path of symbol server requests. This is most often the case with a service running in a svchost group with a number of other services that are responsible for things like DNS or some internal support that WinInet relies on or soforth. This deadlock condition is also prone to happening if you are debugging something else that makes lots of calls out to WinInet, as WinInet has some cross-process state that allows the symbol server engine to get deadlocked waiting on the target to release a global mutex (or similar global synchronization / state).

If you naively try to simply attach a debugger and load symbols in such a situation, you’ll end up with a nasty surprise; the debugger will hang, and you’ll have to kill it (and whatever you were debugging) to recover. In the case of svchost groups, many of those services will not properly restart after just being abruptly killed, so you may even have to reboot.

Now, one obvious solution to this problem is to just turn off symbol server access at all and work without symbols. This is obviously a major pain, though – nobody wants to debug without symbols if you actually have access to them, right?

There are a couple of other things that you can do to debug things in this scenario, however, that are a bit less painful than forgoing symbols:

First, you can use the kernel debugger to debug user mode processes. I have often seen Microsoft employees recommend this on the newsgroups. While this works (the kernel debugger is not affected by the state of any services that you are poking on the target computer), it is most certainly a royal pain to do. Using the kernel debugger means that breakpoints will often affect all processes (unless used via ba), you’ll have to deal with parts of the program being paged out, and not to mention the fact that kernel debugger connections are typically much less responsive than a local user mode debugger.

Because of the inordinate amount of pain involved in using kd to debug a user mode process, I do not recommend ever going this route unless you absolutely positively have no other recourse for accurate debugging.

Therefore, I recommend a different procedure in this case:

  1. Disable symbol server entirely (remove all SRV* symbol server references from your symbol path) or make sure that you have loaded symbols for ntdll.dll.
  2. Attach to the process in question normally, but do not load symbols or issue any commands.
  3. Write a full user minidump out to disk somewhere. For example, .dump /ma c:\tmp.dmp.
  4. Detach from the process that blocks symbol server from working.
  5. Open the minidump you saved earlier, and issue a .reload /f command. This causes all symbols in the process to be downloaded from the symbol server if you do not already have them.
  6. Re-attach to the process that you wanted to debug, and set your symbol path to refer to the downstream store you used with symbol server, but without invoking symbol server. That is, if you had previously used SRV*C:\symbols*http://msdl.microsoft.com/download/symbols for your symbol path, set it to C:\symbols. This ensures that you will never try to hit the symbol server.
  7. Debug as normal.

After that, you’ll have everything in the process that could have symbols on the symbol server already downloaded into your downstream store. By turning off symbol server and just using the downstream store when you actually start your real debugging efforts, you’ll make sure that the debugger will never deadlock itself against the target. Best of all, you don’t need to download a very large symbol pack from Microsoft that might be missing files that have been hotfixed, since you are still (preloading) all of the symbols from the symbol server. This trick can, of course, also be used with your own internal symbol servers as well.

Win32 calling conventions: __fastcall in assembler

Monday, October 30th, 2006

The __fastcall calling convention is the last major major C-supported Win32 (x86) calling convention that I have not covered yet. (There still exists __thiscall, which I’ll discuss later).

__fastcall is, as you might guess from the name, a calling convention that is designed for speed. In this spirit, it attempts to borrow from many RISC calling conventions in that it tries to be register-based instead of stack based. Unfortunately, for all but the smallest or simplest functions, __fastcall typically does not end up being a particularly stellar thing performance-wise, for x86, primarily due to the (comparatively) extremely limited register set that x86 sports.

This calling convention has a great deal in common with the x64 calling convention that Win64 uses. In fact, aside from the x64-specific parts of the x64 calling convention, you can think of the x64 calling convention as a logical extension of __fastcall that is designed to take advantage of the expanded register set available with x64 processors.

What this boils down to is that __fastcall will try to pass the first two pointer-sized arguments in the ecx and edx registers. Any additional registers are passed on the stack as per __stdcall.

In practice, the key things to look out for with a __fastcall function are thus:

  • The callee assumes a meaningful value in the ecx (or edx and ecx) registers. This is a tell-tale sign of __fastcall (although, you may sometimes see __thiscall make use of ecx for the this pointer).
  • No arguments are cleaned off the stack by the caller. Only __cdecl functions have this property.
  • The callee ends in a retn (args-2)*4 instruction. In general, this is the pattern that you will see with __fastcall functions that use the stack. For __fastcall functions where no stack parameters are used, the function typically ends in a ret instruction with no stack displacement argument.
  • The callee is a short function with very few arguments. These are the most likely cases where a smart programmer will use __fastcall, as otherwise, __fastcall does not tend to buy you very much over __stdcall.
  • Functions that interface directly with assembler. Having access to ecx and edx can be a handy shortcut for a C function that is being called by something that is obviously written in assembler.

Taking these into account, let’s take a look at the same sample function and function call that we have been previously dealing with in our earlier examples, this time in __fastcall.

The function that we are going to call is declared as so:

__declspec(noinline)
int __fastcall FastcallFunction1(int a, int b, int c)
{
	return (a + b) * c;
}

This is consistent with our previous examples, save that it is declared __fastcall.

The function call that we shall make is as so:

FastcallFunction1(1, 2, 3);

With this code, we can expect the function call to look something like so in assembler:

push    3                 ; push 'c' onto the stack
push    2                 ; place a constant 2 on the stack
xor     ecx, ecx          ; move 0 into 'a' (ecx)
pop     edx               ; pop 2 off the stack and into edx.
inc     ecx               ; set 'a' -- ecx to 1 (0+1)
call    FastcallFunction1 ; make the call (a=1, b=2, c=3)

This is actually a bit different than we might expect. Here, the compiler has been a bit clever and used some basic optimizations with setting up constants in registers. These optimizations are extremely common and something that you should get used to seeing as simply constant assignments to registers, given how frequently they show up. In a future series, I’ll go into some more details as to common compiler optimizations like these, but that’s a tale for a different time.

Continuing with __fastcall, here’s what the implementation of FastcallFunction1 looks like in assembler:

FastcallFunction1 proc near

c= dword ptr  4

lea     eax, [ecx+edx] ; eax = a + b
imul    eax, [esp+4]   ; eax = (eax * c)
retn    4              ; return eax;
FastcallFunction1 endp

As you can see, in this particular instance, __fastcall turns out to be a big saver as far as instructions executed (and thus size, and in a lesser degree, speed) of the callee. This kind of benefit is usually restricted to extremely simple functions, however.

The main things, then, to consider if you are trying to identify if a function is __fastcall or not are thus:

  • Usage of the ecx (or ecx and edx) registers in the function without loading them with explicit values before-hand. This typically indicates that they are being used as argument registers, as with __fastcall.
  • The caller does not clean any arguments off the stack (no add esp instruction to clean the stack after the call). With __fastcall, the callee always cleans the arguments (if any).
  • A ret instruction (with no stack displacement argument) terminating the function, if there are two or less arguments that are pointer-sized or smaller. In this case, __fastcall has no stack arguments.
  • A retn (args-2)*4 instruction terminating the function, if there are three or more arguments to the function. In this case, there are stack arguments that must be cleaned off the stack via the retn instruction.

That’s all for __fastcall. More on other calling conventions next time…

Debugger commands review

Sunday, October 29th, 2006

This posting is a master list of all the other posts and post series that cover different WinDbg commands, whether they be built-in commands, extension commands, or even third-party extension commands.

  1. Using SDbgExt to aid your debugging and reverse engineering efforts (part 1). SDbgExt is the debugger extension that I maintain and make publicly available. This series provides a high-level overview of the different commands that it offers.
  2. SDbgExt extensions, part 2
  3. Useful WinDbg commands: .formats
  4. Using knf to track down excessive stack usage. This trick is discussed in a section of the “Beware of stack usage with the new network stack in Windows Vista” post.
  5. Removing kernel patching on the fly with the kernel debugger. This article discusses how you can use the !chkimg command to remove patches and hooks on loaded module code at runtime. (This particular command is also available and applicable to the user mode debuggers, and not just the kernel debugger.)
  6. Debugger flow control: More on breakpoints (part 2). This article explores some of the inner workings of the various breakpoints supported by WinDbg. In addition, it describes the .apply_dbp command that can be used to apply a set of hardware breakpoints to the current register context, or a saved register context image in-memory.
  7. SDbgExt 1.09 released (support for displaying x64 EH data). This article describes the !fnseh command in SDbgExt that can be used to view exception handlers and unwind handlers for x64 targets from the debugger.
  8. Useful debugger commands: .writemem and .readmem. This article covers the .writemem and .readmem commands that can be used to move large sections of raw data into or out of the debugger.

Things to watch out for if you hook functions on Windows Vista

Friday, October 27th, 2006

There are a couple of things that I have ran into that you should keep in mind if you are hooking functions and are planning to run under Windows Vista.

First, watch out for things being moved around in memory. For example, in Windows Vista, the VirtualProtect function in kernel32 and the CreateProcessA function in kernel32 are now on the same page, for the x86 build [NOTE: this is subject to rapid change with hotfixes, and may not still be the case on RTM]. If you have some code that works conceptually like so:

DWORD  OldProt;
PVOID  MyCreateProcessA;
PUCHAR _CreateProcessA;
static ULONG MyHook;

MyHook = (ULONG)&MyCreateProcessA;

VirtualProtect(_CreateProcessA, 6,
	PAGE_READWRITE, &OldProt);

//
// [...] Disassembly and stub saving
//       code goes here...
//

//
// jmp dword ptr [MyHook]
//

_CreateProcessA[0] = 0xFF:
_CreateProcessA[1] = 0x25;
*(PULONG)(&_CreateProcessA[2]) = &MyHook;

VirtualProtect(_CreateProcessA, 6,
	OldProt, &OldProt);

… you’ll run into some strange crashes in Vista, because you might end up making the pages backing VirtualProtect’s implementation non-executable by accident. (Remember that memory protections only have page granularity.)

The solution? Use PAGE_EXECUTE_READWRITE for your “intermediate” states when hooking things.

Secondly, watch out for AcLayers.dll and ShimEng.dll. These two DLLs are the core of Microsoft’s Application Compatibility Layer, which is the engine used to apply compatibility fixes at runtime to broken programs that would otherwise fail to work on Windows Vista. (This engine is also used if you select a particular compatibility layer in the property sheet for a shortcut to an executable or an executable.)

The thing to watch out for here is that AcLayers likes to do import table hooking on various kernel32 APIs. In particular, AcLayers tends to hook GetProcAddress and then occasionally redirect returned function pointers to point into AcLayers.dll and not kernel32.dll. If you have a program that assumes that any pointer that it retrieves from kernel32.dll via GetProcAddress will remain at the same address for any other process in the same session, this can result in some unpleasant surprises.

For instance, consider the classic case of wanting to inject some code to run before the main process entrypoint of a child process. You might do something like inject some code that calls kernel32!LoadLibraryA on some DLL your application surprise, and then kernel32!GetProcAddress to get the address of a function in that DLL. Then the patch code might invoke a function in your DLL and return to the initial program entrypoint of the child process. This is actually a fairly common paradigm if you need to modify some sort of behavior of a child process. Unfortunately, it can easily break if the parent process is under the influence of the dreaded application compatibility layer.

The main problem here is that when you, say, find the address of LoadLibraryA or GetProcAddress in kernel32, AcLayers.dll steps in and actually hands you the address of a stub function inside AcLayers.dll which filters requests to load DLLs or get function pointers. This is all well and fine with the parent process; AcLayers.dll is there and can do whatever it’s work is whenever you call GetProcAddress or LoadLibraryA.

The catch is what happens when you try to make a child process call LoadLibraryA on a DLL before it runs the main program entrypoint. In this case, instead of passing a pointer into kernel32 (which is guaranteed to be present and at the same base address in every Win32 process in the same session), you are passing a pointer into AcLayers.dll to the child process. The problem case is when AcLayers.dll is not loaded immediately into the child process. Here, your patch code in the newly created child process might try to call LoadLibraryA to get your custom DLL unloaded. However, it actually tries to call an internal AcLayers.dll function – but AcLayers.dll isn’t actually loaded into the address space of the child process (or might have even been rebased), so your child process mysteriously crashes instantly. This typically manifests itself as nothing happening when you try to launch a child program, depending on computer configuration.

There is unfortunately no particularly elegant way to work around this particular problem that I have found. The best advice I have to offer here is to try and bypass any possibility that any function pointer you pass to another process (in kernel32.dll) is never intercepted by AcLayers.dll. Perhaps the most fool-proof way to do this is to manually walk the export table of kernel32.dll and locate the address of the export that you are interested in, although this is not a particularly easy task.

Beware of stack usage with the new network stack in Windows Vista

Tuesday, October 24th, 2006

In Windows Vista, much of the network stack that ships with the OS uses much more stack than in previous versions of the operating system.

From my experience, just indicating a UDP datagram up to NDIS can require you to have over 4K of kernel stack available on x86, or you risk taking a double fault and causing the system to bugcheck.

For example, here’s a portion of the stack that I ran into while debugging an unrelated problem at the Vista compatibility lab:

0: kd> k100
ChildEBP RetAddr  
818e6bdc 818ad19b RtlpBreakWithStatusInstruction
818e6c2c 818adc08 KiBugCheckDebugBreak+0x1c
818e6fdc 8184845e KeBugCheck2+0x5f4
818e6fdc 81871d35 KiTrap08+0x75
9c9cb084 8186dd14 SepAccessCheck+0x1e0
9c9cb0e0 81887907 SeAccessCheck+0x1a4
9c9cb51c 8715474c SeAccessCheckFromState+0xe4
9c9cb55c 871546d6 CompareSecurityContexts+0x47
9c9cb57c 87153b1a MatchValues+0xd4
9c9cb59c 87153aa7 CheckEqualConditionEnumMatch+0x3f
9c9cb63c 87153a1b MatchConditionOverlap+0x72
9c9cb660 87153774 FilterMatchEnum+0x6c
9c9cb674 8715948b FilterMatchEnumVisible+0x28
9c9cb6ac 87159520 IndexHashFastEnum+0x4d
9c9cb6f8 87158624 IndexHashEnum+0x139
9c9cb724 87159362 FeEnumLayer+0x7a
9c9cb7ac 87159b16 KfdGetLayerActionFromEnumTemplate+0x50
9c9cb7cc 8d6af9e4 KfdCheckAndCacheAcceptBypass+0x27
9c9cb8c4 8d6afc87 CheckAcceptBypass+0x146
9c9cb9a0 8d6b185d WfpAleAuthorizeReceive+0x82
9c9cba08 8d6ad542 WfpAleConnectAcceptIndicate+0x98
9c9cba74 8d6ad432 ProcessALEForTransportPacket+0xc5
9c9cbaf0 8d6ae6b3 ProcessAleForNonTcpIn+0x6f
9c9cbd28 8d6b0df0 WfpProcessInTransportStackIndication+0x2ab
9c9cbd78 8d6b0ae0 InetInspectReceiveDatagram+0x9a
9c9cbdfc 8d6b091c UdpBeginMessageIndication+0x33
9c9cbe44 8d6aecf3 UdpDeliverDatagrams+0xce
9c9cbe90 8d6aec40 UdpReceiveDatagrams+0xab
9c9cbea0 8d6acdd4 UdpNlClientReceiveDatagrams+0x12
9c9cbecc 8d6acba4 IppDeliverListToProtocol+0x49
9c9cbeec 8d6acad3 IppProcessDeliverList+0x2a
9c9cbf40 8d6ab443 IppReceiveHeaderBatch+0x1da
9c9cbfd0 8d6ac61d IpFlcReceivePackets+0xc06
9c9cc04c 8d6abf36 FlpReceiveNonPreValidatedNetBufferListChain
                  +0x6db
9c9cc074 8727b0b0 FlReceiveNetBufferListChain+0x104
9c9cc0a8 8726d737 ndisMIndicateNetBufferListsToOpen+0xab
9c9cc0d0 8726d6ae ndisIndicateSortedNetBufferLists+0x4a
9c9cc24c 871b53c3 ndisMDispatchReceiveNetBufferLists+0x129
9c9cc268 872802c4 ndisMTopReceiveNetBufferLists+0x2c
9c9cc2b4 b0a3fb4d ndisMIndicatePacketsToNetBufferLists+0xe9

From ndisMIndicatePacketsToNetBufferLists to where the system double faulted (in my case) inside of SeAccessCheck, a whopping
4656 bytes
of kernel stack were consumed.

So, now is the time to slim down your stack usage in your NDIS-related drivers, or you might be in for some unpleasant surprises when your drivers are used in conjunction with multiple third party IM drivers or the like (even better, you might investigate switching away from IM drivers and to the new filtering architecture). You should also be especially wary of any code that loops a packet that might potentially go back into tcpip.sys in a receive calling context (or any other context where you might have limited stack space available), as this can prove an unexpectedly expensive operation on Vista (and potentially beyond).

Oh, and a tip for finding stack hog functions with stack overflow problems: Use the ‘f’ flag with the ‘k’ command in WinDbg. For example:

0: kd> knf
 #   Memory  ChildEBP RetAddr  
00           818e6bdc 818ad19b RtlpBreakWithStatusInstruction
01        50 818e6c2c 818adc08 KiBugCheckDebugBreak+0x1c
02       3b0 818e6fdc 8184845e KeBugCheck2+0x5f4
03         0 818e6fdc 81871d35 KiTrap08+0x75
[...]

This has the debugger compute the stack (arguments + locals) usage at each call frame point for you, saving you a bit of work with the calculator.

Debugging (or reverse engineering…) a real life Windows Vista compatibility problem: CreateIpForwardEntry in iphlpapi

Tuesday, October 24th, 2006

Since I’m at the Microsoft Vista compatibity lab, it only makes sense that I’ve fixed a few Vista compatibility bugs in our product today.

Some of these are real bugs, but I ran into one in particular that is particularly infuriating: a completely undocumented, seemingly completely arbitrary restriction placed on a publicly documented API that has been around since Windows 98.

In this particular case, I was running into a problem where one of our products was being unable to add routes on Vista. This worked fine on prior platforms we supported, and so I started looking into it as a compatibility problem. First things first, I narrowed the problem down to a particular API that was failing.

We have a function that wrappers the various details about creating routes. The function in question went approximately like so:

//
// Add a route through the desired gateway.
//

DWORD
AddRoute(
	__in unsigned long Network,
	__in unsigned long Mask,
	__in unsigned long Gateway
	)
{
	MIB_IPFORWARDROW Row;
	DWORD            Status, ForwardType;
	unsigned long    InterfaceIp, InterfaceIndex;

[...]	// (Code to determine the local
	// interface to add the route on)

	//
	// Setup the IP forward row.
	//

	ZeroMemory(&Row,
		sizeof(Row));

	Row.dwForwardDest    = Network;
	Row.dwForwardMask    = Mask;
	Row.dwForwardPolicy  = 0;
	Row.dwForwardNextHop = Gateway;
	Row.dwForwardIfIndex = InterfaceIndex;
	Row.dwForwardType    = ForwardType;
	Row.dwForwardProto   = PROTO_IP_NETMGMT;
	Row.dwForwardAge     = INFINITE;
	Row.dwForwardMetric1 = 0;

	//
	// Create the route.
	//

	if ((Status = CreateIpForwardEntry(&Row))
		!= NO_ERROR)
	{
		wprintf(L"Creation failed, %lu.\\n",
			Status);
		return Status;
	}

[...]	// (More unrelated boilerplate code)

	return Status;
}

Essentially, the problem here was that CreateIpForwardEntry was failing. Checking logs, the error code logged was 0xA0.

Using the handy Microsoft error code lookup utility (err.exe), it was easy to determine what this error code means:

C:\\>err a0
# for hex 0xa0 / decimal 160 :
  INTERNAL_POWER_ERROR                            bugcodes.h
  LLC_STATUS_BIND_ERROR                           dlcapi.h
  SQL_160_severity_15                             sql_err
# Rule does not contain a variable.
  ERROR_BAD_ARGUMENTS                             winerror.h
# One or more arguments are not correct.
  SCW_E_TOOMUCHDATAIN                             wpscoserr.mc
# Too much incoming data%0
# 5 matches found for "a0"

The only error that makes sense in this context is ERROR_BAD_ARGUMENTS. Unfortunately, that is not really all that helpful. Checking the latest MSDN documentation for CreateIpForwardEntry, there is, of course, no mention of this error code whatsoever.

Additionally, looking at the Microsoft documentation, nothing immediately jumped to mind as to what the problem is.

Although the Microsoft people here for the Vista lab did offer to see about getting me in touch with someone in the product team who might have an explanation for this behavior, I eventually decided that I would just take a crack at digging into the internals of CreateIpForwardEntry and understand the problem myself in the meanwhile to see if I might be able to come up with a fix sooner. After searching around a bit on Google and not coming up with any good explanation for what was going wrong, I eventually decided to step into iphlpapi!CreateIpForwardEntry in the debugger and see just what was going wrong first-hand.

0:000> bu iphlpapi!CreateIpForwardEntry
breakpoint 0 redefined
0:000> g
Breakpoint 0 hit
eax=0012fd6c ebx=00000004 ecx=00000000 edx=00000000
esi=01040a0a edi=00000003
eip=751bdfc1 esp=0012fd58 ebp=0012fdb0 iopl=0
nv up ei pl nz ac pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000
efl=00000216
iphlpapi!CreateIpForwardEntry:
751bdfc1 8bff            mov     edi,edi

Looking at the disassembly of CreateIpForwardEntry, it’s clear that this function is now just a stub that forwards the call onto another function that performs the real work:

0:000> u @eip
iphlpapi!CreateIpForwardEntry:
751bdfc1 8bff       mov     edi,edi
751bdfc3 55         push    ebp
751bdfc4 8bec       mov     ebp,esp
751bdfc6 6a01       push    1
751bdfc8 ff7508     push    dword ptr [ebp+8]
751bdfcb e820ffffff call    CreateOrSetIpForwardEntry
751bdfd0 5d         pop     ebp
751bdfd1 c20400     ret     4

So, I pressed onward, stepping into iphlpapi!CreateOrSetIpForwardEntry

0:000> tc
iphlpapi!CreateIpForwardEntry+0xa:
751bdfcb e820ffffff call    CreateOrSetIpForwardEntry
0:000> t
eax=0012fd6c ebx=00000004 ecx=00000000 edx=00000000
esi=01040a0a edi=00000003
eip=751bdef0 esp=0012fd48 ebp=0012fd54 iopl=0
nv up ei pl nz ac pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000
efl=00000216
iphlpapi!CreateOrSetIpForwardEntry:
751bdef0 8bff            mov     edi,edi

Looking at the disassembly, there appears to be only one place where the error code ERROR_BAD_ARGUMENTS (disassembly truncated for better viewing):

0:000> uf @eip
iphlpapi!CreateOrSetIpForwardEntry:
751bdef0 8bff            mov     edi,edi
751bdef2 55              push    ebp
751bdef3 8bec            mov     ebp,esp
751bdef5 83ec48          sub     esp,48h
751bdef8 8365b800        and     dword ptr [ebp-48h],0
751bdefc 56              push    esi
751bdefd 6a2c            push    2Ch
751bdeff 8d45bc          lea     eax,[ebp-44h]
751bdf02 6a00            push    0
751bdf04 50              push    eax
751bdf05 e8f053ffff      call    memset
751bdf0a 8b7508          mov     esi,dword ptr [ebp+8]

[...]

;
; Convert the interface metric we passed in with
; the pRoute structure into an interface LUID,
; stored at [ebp-30].
;

751bdf36 8d45d0          lea     eax,[ebp-30h]
751bdf39 50              push    eax
751bdf3a ff7610          push    dword ptr [esi+10h]
751bdf3d e86590ffff      call    ConvertInterfaceIndexToLuid
751bdf42 85c0            test    eax,eax
751bdf44 7571            jne     751bdfb7


;
; Get the interface metric for the requested interface,
; and store it at [ebp+8].  We pass in the address of
; the LUID of the requested interface in order to make
; the check.
;

iphlpapi!CreateOrSetIpForwardEntry+0x56:
751bdf46 8d4508          lea     eax,[ebp+8]
751bdf49 50              push    eax
751bdf4a 8d45d0          lea     eax,[ebp-30h]
751bdf4d 50              push    eax
751bdf4e e802f4ffff      call    GetInterfaceMetric

[...]

;
; Load esi with pRoute->dwForwardMetric1
;

751bdf6c 8b7624          mov     esi,dword ptr [esi+24h]
751bdf6f 6a06            push    6
751bdf71 8945e0          mov     dword ptr [ebp-20h],eax
751bdf74 83c8ff          or      eax,0FFFFFFFFh
751bdf77 3b7508          cmp     esi,dword ptr [ebp+8]
751bdf7a 59              pop     ecx
751bdf7b 8d7de8          lea     edi,[ebp-18h]
751bdf7e f3ab            rep stos dword ptr es:[edi]
751bdf80 8945ec          mov     dword ptr [ebp-14h],eax
751bdf83 8945f0          mov     dword ptr [ebp-10h],eax
751bdf86 5f              pop     edi

;
; Check that esi is not less than [ebp+8]
; ... in other words, verify that
; pRoute->dwForwardMetric1 >= InterfaceMetric,
; where InterfaceMetric is set by GetInterfaceMetric()
;

751bdf87 7229            jb      751bdfb2 ; failure

iphlpapi!CreateOrSetIpForwardEntry+0x99:
751bdf89 2b7508          sub     esi,dword ptr [ebp+8]
751bdf8c 6a18            push    18h
751bdf8e 8d45e8          lea     eax,[ebp-18h]
751bdf91 50              push    eax
751bdf92 6a30            push    30h
751bdf94 8d45b8          lea     eax,[ebp-48h]
751bdf97 50              push    eax
751bdf98 6a10            push    10h
751bdf9a 6864331b75      push    751b3364
751bdf9f ff750c          push    dword ptr [ebp+0Ch]
751bdfa2 8975f4          mov     dword ptr [ebp-0Ch],esi
751bdfa5 6a01            push    1
751bdfa7 c645ff01        mov     byte ptr [ebp-1],1

;
; Call the NsiSetAllParameters internal API to create the
; route, and return its return value to the caller.
;

751bdfab e86857ffff      call    NsiSetAllParameters
751bdfb0 eb05            jmp     751bdfb7
[...]

iphlpapi!CreateOrSetIpForwardEntry+0xc2:
;
; Return ERROR_BAD_ARGUMENTS
;
751bdfb2 b8a0000000      mov     eax,0A0h

iphlpapi!CreateOrSetIpForwardEntry+0xc7:
751bdfb7 5e              pop     esi
751bdfb8 c9              leave
751bdfb9 c20800          ret     8

From this annotated disassembly, we can conclude that there are only two possibilities that might result in this behavior. The first is that GetInterfaceMetric(InterfaceIndex, &InterfaceMetric) is returning an InterfaceMetric greater than the metric we are supplying. The second is that NsiSetAllParameters is returning ERROR_BAD_ARGUMENTS.

To test this theory, we need to examine the comparison at 751bdf87 to determine if that is taking the failure branch, and we need to check the return value of NsiSetAllParameters. This is fairly easy to do with a couple of breakpoints:

0:000> bu 751bdf87 
0:000> bu 751bdfb0 
0:000> g
Breakpoint 1 hit
eax=ffffffff ebx=00000004 ecx=00000000 edx=7707e524
esi=00000000 edi=00000003
eip=751bdf87 esp=0012fcf8 ebp=0012fd44 iopl=0
nv up ei ng nz ac pe cy
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000
efl=00000297
iphlpapi!CreateOrSetIpForwardEntry+0x97:
751bdf87 7229            jb      751bdfb2 [br=1]

Our first breakpoint, the one on the comparison with the “Interface Metric” and the route metric we supplied in pRoute->dwForwardMetric1, was the one that hit first (as expected). Looking at the register context supplied by WinDbg, though, we can clearly see that the program is going to take the branch and head down the code path that returns ERROR_BAD_ARGUMENTS. Problem identified!

There still remains the issue of solving the problem, though. Looking at [ebp+8], it appears that the undocumented iphlpapi!GetInterfaceMetric returned 10:

0:000> ? dwo(@ebp+8)
Evaluate expression: 10 = 0000000a

This makes sense. We supplied a metric of 0, which is obviously less than 10. Unfortunately, now we need a good way to determine whether we should use a zero metric (for previous OS versions) or a different metric (for Vista), assuming we want our route to be the most precedent for a particular network/mask value.

Unfortunately, MSDN doesn’t turn up any hits on GetInterfaceMetric, and neither does Google. Well, that sucks – it looks like that for Vista, unless I want to hardcode 10, I’ll have to go off into undocumented land to use a publicly documented API. There seems to be something a bit ironic about that to me, but, nonetheless, the problem remains to be solved.

Update: There is a (minimally) documented solution that was very recently made available. See the bottom of the post for details.

So, all that we need to do is reverse engineer the parameters to this undocumented GetInterfaceMetric function and call it, right?

Well, no, not exactly – things actually get worse. It turns out that GetInterfaceMeteric is not even exported from iphlpapi.dll – it’s a purely internal function!

The only other option at this point, aside from hardcoding 10 as a minimum metric, is to reimplement all of the functionality of GetInterfaceMetric ourselves. Taking a look at GetInterfaceMetric, things look unfortunately rather complicated:

0:000> uf iphlpapi!GetInterfaceMetric
iphlpapi!GetInterfaceMetric:
751bd355 8bff            mov     edi,edi
751bd357 55              push    ebp
751bd358 8bec            mov     ebp,esp
751bd35a 6a1c            push    1Ch
751bd35c 6a04            push    4
751bd35e ff750c          push    dword ptr [ebp+0Ch]
751bd361 6a00            push    0
751bd363 6a08            push    8
751bd365 ff7508          push    dword ptr [ebp+8]
751bd368 6a07            push    7
751bd36a 6864331b75      push    NPI_MS_IPV4_MODULEID
751bd36f 6a01            push    1
751bd371 e88f5fffff      call    NsiGetParameter
751bd376 5d              pop     ebp
751bd377 c20800          ret     8

NPI_MS_IPV4_MODULEID is a global variable of some sort in iphlpapi:

0:000> db iphlpapi!NPI_MS_IPV4_MODULEID l8
751b3364  18 00 00 00 01 00 00 00  ........

Using the x command with ascending order, we can make an educated guess as to the size of this global by enumerating all symbols in iphlpapi in address space order:

0:000> x /a iphlpapi!*
[...]
751b3364 iphlpapi!NPI_MS_IPV4_MODULEID = <no type information>
751b3381 iphlpapi!NsiAllocateAndGetTable = <no type information>
[...]

So, we know that NPI_MS_IPV4_MODULEID must be no more than 0x1d bytes long. Taking a look around NPI_MS_IPV4_MODULE_ID, we see that past 0x18 bytes in, there appears to be code (nop instructions), making it likely that the global is 0x18 bytes long.

0:000> db 751b3364 
751b3364  18 00 00 00 01 00 00 00-00 4a 00 eb 1a 9b d4 11
751b3374  91 23 00 50 04 77 59 bc-90 90 90 90 90 ff 25 94

(The repeated 90 90 90 90 bytes are a typical sign of code. 90 is the opcode for the nop instruction on x86, which the compiler typically uses for padding out function start offsets for alignment.)

Given this, we should be able to replicate the behavior of GetInterfaceMetrics, as the only function it calls, NsiGetParameter, is exported by nsi.dll (of course, it isn’t documented…). From the above disassembly, we can see that NsiGetParameter takes a ulong-sized argument (constant 0x1), a pointer argument (address of NPI_MS_IPV4_MODULEID), a ulong-sized argument (constant 0x7), a pointer that is the address of the interface LUID (argument 1 of GetInterfaceMetrics, which we saw earlier), a ulong-sized argument (constant 0x8), a ulong or pointer-sized argument (constant 0x0), a pointer-sized argument (address of a ULONG containing the “interface metric”), a ulong-sized argument (constant 0x4), and (finally!) a ulong-sized argument (constant 0x1c). I would surmise that the 0x8 and 0x4 constants are the sizes of the LUID and output buffer, though I haven’t bothered to confirm that at this point.

From our knowledge of __stdcall, we can identify NsiGetParameter as __stdcall quickly by looking at the disassembly of GetInterfaceMetrics and noticing the behavior after the function call (not removing arguments from the stack space, assuming the callee (NsiGetParameter) performs that task.

Given all of this, we can make our own function that implements GetInterfaceMetric. Now, just to be clear, I would not recommend actually using this, unless Microsoft fails to provide a documented mechanism to determine the minimum metric permitted for CreateIpForwardEntry (or removes the restriction) prior to Vista RTM. I am going to try and do whatever I can to see what ISV’s are supposed to do with this particular problem (and whether it can be fixed before RTM) before this week is up, but in the event that I don’t get anywhere, I’ll have a backup plan (as ugly and hackish as it may be) – better than not being able to manipulate the route table, period, on Vista.

Anyway, the basic idea is that we call ConvertInterfaceIndexToLuid on the InterfaceIndex that we already have from iphlpapi, to convert this into a NET_LUID structure (new to Vista). It does so happen that ConvertInterfaceIndexToLuid is a documented API, which makes that the easy part.

Then, we simply replicate the call that we saw in GetInterfaceMetric inside iphlpapi.dll. For brevity, I am not posting the entire source code for my implementation of GetInterfaceMetric inline; you can, however, download it. With this reverse engineered implementation, all that is left is to call it to get the minimum metric for the interface we are about to add a route on, and place that metric in the MIB_IPFORWARDROW that we pass to CreateIpForwardEntry.

I’ll post back when I hear from Microsoft as to the official word as to how one is to handle this situation; I fully expect that there will be a documented API (or the restriction will go away) before RTM, at this point, given that this is a rather bad compatibility bug that breaks a long-existing documented API in such a way that requires you to go into undocumented hackery to continue to use it (especially since there is no other good way that I know of to replicate the functionality of the API in question).

Update: You can use the GetIpInterfaceEntry routine (new to Vista, in iphlpapi) to find the minimum metric for an interface. Note that you will very likely need to search on MSDN to find information on this function, as it’s not been included in recent SDKs to my knowledge.

(Note: Some of the debugger output was slightly modified or truncated by me to keep the formatting sane.)