Archive for the ‘Windows’ Category

Win32 calling conventions: __fastcall in assembler

Monday, October 30th, 2006

The __fastcall calling convention is the last major major C-supported Win32 (x86) calling convention that I have not covered yet. (There still exists __thiscall, which I’ll discuss later).

__fastcall is, as you might guess from the name, a calling convention that is designed for speed. In this spirit, it attempts to borrow from many RISC calling conventions in that it tries to be register-based instead of stack based. Unfortunately, for all but the smallest or simplest functions, __fastcall typically does not end up being a particularly stellar thing performance-wise, for x86, primarily due to the (comparatively) extremely limited register set that x86 sports.

This calling convention has a great deal in common with the x64 calling convention that Win64 uses. In fact, aside from the x64-specific parts of the x64 calling convention, you can think of the x64 calling convention as a logical extension of __fastcall that is designed to take advantage of the expanded register set available with x64 processors.

What this boils down to is that __fastcall will try to pass the first two pointer-sized arguments in the ecx and edx registers. Any additional registers are passed on the stack as per __stdcall.

In practice, the key things to look out for with a __fastcall function are thus:

  • The callee assumes a meaningful value in the ecx (or edx and ecx) registers. This is a tell-tale sign of __fastcall (although, you may sometimes see __thiscall make use of ecx for the this pointer).
  • No arguments are cleaned off the stack by the caller. Only __cdecl functions have this property.
  • The callee ends in a retn (args-2)*4 instruction. In general, this is the pattern that you will see with __fastcall functions that use the stack. For __fastcall functions where no stack parameters are used, the function typically ends in a ret instruction with no stack displacement argument.
  • The callee is a short function with very few arguments. These are the most likely cases where a smart programmer will use __fastcall, as otherwise, __fastcall does not tend to buy you very much over __stdcall.
  • Functions that interface directly with assembler. Having access to ecx and edx can be a handy shortcut for a C function that is being called by something that is obviously written in assembler.

Taking these into account, let’s take a look at the same sample function and function call that we have been previously dealing with in our earlier examples, this time in __fastcall.

The function that we are going to call is declared as so:

__declspec(noinline)
int __fastcall FastcallFunction1(int a, int b, int c)
{
	return (a + b) * c;
}

This is consistent with our previous examples, save that it is declared __fastcall.

The function call that we shall make is as so:

FastcallFunction1(1, 2, 3);

With this code, we can expect the function call to look something like so in assembler:

push    3                 ; push 'c' onto the stack
push    2                 ; place a constant 2 on the stack
xor     ecx, ecx          ; move 0 into 'a' (ecx)
pop     edx               ; pop 2 off the stack and into edx.
inc     ecx               ; set 'a' -- ecx to 1 (0+1)
call    FastcallFunction1 ; make the call (a=1, b=2, c=3)

This is actually a bit different than we might expect. Here, the compiler has been a bit clever and used some basic optimizations with setting up constants in registers. These optimizations are extremely common and something that you should get used to seeing as simply constant assignments to registers, given how frequently they show up. In a future series, I’ll go into some more details as to common compiler optimizations like these, but that’s a tale for a different time.

Continuing with __fastcall, here’s what the implementation of FastcallFunction1 looks like in assembler:

FastcallFunction1 proc near

c= dword ptr  4

lea     eax, [ecx+edx] ; eax = a + b
imul    eax, [esp+4]   ; eax = (eax * c)
retn    4              ; return eax;
FastcallFunction1 endp

As you can see, in this particular instance, __fastcall turns out to be a big saver as far as instructions executed (and thus size, and in a lesser degree, speed) of the callee. This kind of benefit is usually restricted to extremely simple functions, however.

The main things, then, to consider if you are trying to identify if a function is __fastcall or not are thus:

  • Usage of the ecx (or ecx and edx) registers in the function without loading them with explicit values before-hand. This typically indicates that they are being used as argument registers, as with __fastcall.
  • The caller does not clean any arguments off the stack (no add esp instruction to clean the stack after the call). With __fastcall, the callee always cleans the arguments (if any).
  • A ret instruction (with no stack displacement argument) terminating the function, if there are two or less arguments that are pointer-sized or smaller. In this case, __fastcall has no stack arguments.
  • A retn (args-2)*4 instruction terminating the function, if there are three or more arguments to the function. In this case, there are stack arguments that must be cleaned off the stack via the retn instruction.

That’s all for __fastcall. More on other calling conventions next time…

Things to watch out for if you hook functions on Windows Vista

Friday, October 27th, 2006

There are a couple of things that I have ran into that you should keep in mind if you are hooking functions and are planning to run under Windows Vista.

First, watch out for things being moved around in memory. For example, in Windows Vista, the VirtualProtect function in kernel32 and the CreateProcessA function in kernel32 are now on the same page, for the x86 build [NOTE: this is subject to rapid change with hotfixes, and may not still be the case on RTM]. If you have some code that works conceptually like so:

DWORD  OldProt;
PVOID  MyCreateProcessA;
PUCHAR _CreateProcessA;
static ULONG MyHook;

MyHook = (ULONG)&MyCreateProcessA;

VirtualProtect(_CreateProcessA, 6,
	PAGE_READWRITE, &OldProt);

//
// [...] Disassembly and stub saving
//       code goes here...
//

//
// jmp dword ptr [MyHook]
//

_CreateProcessA[0] = 0xFF:
_CreateProcessA[1] = 0x25;
*(PULONG)(&_CreateProcessA[2]) = &MyHook;

VirtualProtect(_CreateProcessA, 6,
	OldProt, &OldProt);

… you’ll run into some strange crashes in Vista, because you might end up making the pages backing VirtualProtect’s implementation non-executable by accident. (Remember that memory protections only have page granularity.)

The solution? Use PAGE_EXECUTE_READWRITE for your “intermediate” states when hooking things.

Secondly, watch out for AcLayers.dll and ShimEng.dll. These two DLLs are the core of Microsoft’s Application Compatibility Layer, which is the engine used to apply compatibility fixes at runtime to broken programs that would otherwise fail to work on Windows Vista. (This engine is also used if you select a particular compatibility layer in the property sheet for a shortcut to an executable or an executable.)

The thing to watch out for here is that AcLayers likes to do import table hooking on various kernel32 APIs. In particular, AcLayers tends to hook GetProcAddress and then occasionally redirect returned function pointers to point into AcLayers.dll and not kernel32.dll. If you have a program that assumes that any pointer that it retrieves from kernel32.dll via GetProcAddress will remain at the same address for any other process in the same session, this can result in some unpleasant surprises.

For instance, consider the classic case of wanting to inject some code to run before the main process entrypoint of a child process. You might do something like inject some code that calls kernel32!LoadLibraryA on some DLL your application surprise, and then kernel32!GetProcAddress to get the address of a function in that DLL. Then the patch code might invoke a function in your DLL and return to the initial program entrypoint of the child process. This is actually a fairly common paradigm if you need to modify some sort of behavior of a child process. Unfortunately, it can easily break if the parent process is under the influence of the dreaded application compatibility layer.

The main problem here is that when you, say, find the address of LoadLibraryA or GetProcAddress in kernel32, AcLayers.dll steps in and actually hands you the address of a stub function inside AcLayers.dll which filters requests to load DLLs or get function pointers. This is all well and fine with the parent process; AcLayers.dll is there and can do whatever it’s work is whenever you call GetProcAddress or LoadLibraryA.

The catch is what happens when you try to make a child process call LoadLibraryA on a DLL before it runs the main program entrypoint. In this case, instead of passing a pointer into kernel32 (which is guaranteed to be present and at the same base address in every Win32 process in the same session), you are passing a pointer into AcLayers.dll to the child process. The problem case is when AcLayers.dll is not loaded immediately into the child process. Here, your patch code in the newly created child process might try to call LoadLibraryA to get your custom DLL unloaded. However, it actually tries to call an internal AcLayers.dll function – but AcLayers.dll isn’t actually loaded into the address space of the child process (or might have even been rebased), so your child process mysteriously crashes instantly. This typically manifests itself as nothing happening when you try to launch a child program, depending on computer configuration.

There is unfortunately no particularly elegant way to work around this particular problem that I have found. The best advice I have to offer here is to try and bypass any possibility that any function pointer you pass to another process (in kernel32.dll) is never intercepted by AcLayers.dll. Perhaps the most fool-proof way to do this is to manually walk the export table of kernel32.dll and locate the address of the export that you are interested in, although this is not a particularly easy task.

The kernel object namespace and Win32, part 1

Thursday, October 26th, 2006

The kernel object namespace is partially exposed by various Win32 APIs. Everything that allows you to create a named object that returns a kernel handle is interacting with the kernel object namespace in some form or another, and many Win32 APIs internally use the object namespace under the hood.

The kernel object namespace is fairly similar to a filesystem; there are object directories, which contain named objects. Objects can be of various different types, such as a Device object (created by a kernel driver) or an Event object, a Semaphore object, and soforth. Additionally, there are symbolic link objects, which (like filesystem links on a UNIX-based system) allow you to create one name that simply refers to another named object in the system.

Until the introduction of Windows 2000, the part of the kernel object namespace that Win32 exposed was a fairly limited and simple subset of the full object namespace available to drivers and programs using the native system call interfaces.

First, file-related APIs interact with the \DosDevices object directory (otherwise known as \??). This is the object directory that holds anything that you might open with CreateFile() and related calls, such as drive letter links (say, C:), serial ports (COM1), other standard DOS devices, and custom devices created by kernel drivers. This is why, if you are a driver, you need to explicitly specify \DosDevices\DeviceName instead of that being automatically assumed (as it is in Win32, if you call CreateFile). Otherwise, the created object name will not be easily accessible to Win32.

Secondly, there is the \BaseNamedObjects object directory. This object directory is where named Event, Mutex, Semaphore, and Section (file mapping) objects are based at when created with the Win32 API.

\BaseNamedObjects is managed and created by the Base API server dll (basesrv.dll) running in the context of CSRSS at boot time. This means that, in particular, boot start drivers cannot rely on \BaseNamedObjects as being present early in the boot process (which can be a problem if you want to share a named event object with a user mode program, from a boot start driver). \DosDevices, however, is created by the kernel itself at boot time and is generally always accessible.

In general, that is the limit to how much of the kernel namespace is directly exposed to (and used to support) Win32 prior to Windows 2000. (This is technically not quite true. There is a little used pair of kernel32 APIs called DefineDosDevice and QueryDosDevices that allow limited manipulation of symbolic links based within the \DosDevices object directory. Using these APIs, you can discover the native target names of many of the internal symbolic links (for example, C: -> \Device\HarddiskVolume2). You can also create symbolic links based in \DosDevices that point to other parts of the NT object namespace with the DDD_RAW_TARGET_PATH flag using DefineDosDevice.).

Next time I’ll go into a bit more detail as to how some of the changes to the object manager namespace work with Windows 2000, and then Windows XP, which both introduce some significant changes to how Win32 interacts with object names (first with improved multi-session support for Terminal Server and Fast User Switching, and then with how mapped drive letters work with LSA logon sessions).

Beware of stack usage with the new network stack in Windows Vista

Tuesday, October 24th, 2006

In Windows Vista, much of the network stack that ships with the OS uses much more stack than in previous versions of the operating system.

From my experience, just indicating a UDP datagram up to NDIS can require you to have over 4K of kernel stack available on x86, or you risk taking a double fault and causing the system to bugcheck.

For example, here’s a portion of the stack that I ran into while debugging an unrelated problem at the Vista compatibility lab:

0: kd> k100
ChildEBP RetAddr  
818e6bdc 818ad19b RtlpBreakWithStatusInstruction
818e6c2c 818adc08 KiBugCheckDebugBreak+0x1c
818e6fdc 8184845e KeBugCheck2+0x5f4
818e6fdc 81871d35 KiTrap08+0x75
9c9cb084 8186dd14 SepAccessCheck+0x1e0
9c9cb0e0 81887907 SeAccessCheck+0x1a4
9c9cb51c 8715474c SeAccessCheckFromState+0xe4
9c9cb55c 871546d6 CompareSecurityContexts+0x47
9c9cb57c 87153b1a MatchValues+0xd4
9c9cb59c 87153aa7 CheckEqualConditionEnumMatch+0x3f
9c9cb63c 87153a1b MatchConditionOverlap+0x72
9c9cb660 87153774 FilterMatchEnum+0x6c
9c9cb674 8715948b FilterMatchEnumVisible+0x28
9c9cb6ac 87159520 IndexHashFastEnum+0x4d
9c9cb6f8 87158624 IndexHashEnum+0x139
9c9cb724 87159362 FeEnumLayer+0x7a
9c9cb7ac 87159b16 KfdGetLayerActionFromEnumTemplate+0x50
9c9cb7cc 8d6af9e4 KfdCheckAndCacheAcceptBypass+0x27
9c9cb8c4 8d6afc87 CheckAcceptBypass+0x146
9c9cb9a0 8d6b185d WfpAleAuthorizeReceive+0x82
9c9cba08 8d6ad542 WfpAleConnectAcceptIndicate+0x98
9c9cba74 8d6ad432 ProcessALEForTransportPacket+0xc5
9c9cbaf0 8d6ae6b3 ProcessAleForNonTcpIn+0x6f
9c9cbd28 8d6b0df0 WfpProcessInTransportStackIndication+0x2ab
9c9cbd78 8d6b0ae0 InetInspectReceiveDatagram+0x9a
9c9cbdfc 8d6b091c UdpBeginMessageIndication+0x33
9c9cbe44 8d6aecf3 UdpDeliverDatagrams+0xce
9c9cbe90 8d6aec40 UdpReceiveDatagrams+0xab
9c9cbea0 8d6acdd4 UdpNlClientReceiveDatagrams+0x12
9c9cbecc 8d6acba4 IppDeliverListToProtocol+0x49
9c9cbeec 8d6acad3 IppProcessDeliverList+0x2a
9c9cbf40 8d6ab443 IppReceiveHeaderBatch+0x1da
9c9cbfd0 8d6ac61d IpFlcReceivePackets+0xc06
9c9cc04c 8d6abf36 FlpReceiveNonPreValidatedNetBufferListChain
                  +0x6db
9c9cc074 8727b0b0 FlReceiveNetBufferListChain+0x104
9c9cc0a8 8726d737 ndisMIndicateNetBufferListsToOpen+0xab
9c9cc0d0 8726d6ae ndisIndicateSortedNetBufferLists+0x4a
9c9cc24c 871b53c3 ndisMDispatchReceiveNetBufferLists+0x129
9c9cc268 872802c4 ndisMTopReceiveNetBufferLists+0x2c
9c9cc2b4 b0a3fb4d ndisMIndicatePacketsToNetBufferLists+0xe9

From ndisMIndicatePacketsToNetBufferLists to where the system double faulted (in my case) inside of SeAccessCheck, a whopping
4656 bytes
of kernel stack were consumed.

So, now is the time to slim down your stack usage in your NDIS-related drivers, or you might be in for some unpleasant surprises when your drivers are used in conjunction with multiple third party IM drivers or the like (even better, you might investigate switching away from IM drivers and to the new filtering architecture). You should also be especially wary of any code that loops a packet that might potentially go back into tcpip.sys in a receive calling context (or any other context where you might have limited stack space available), as this can prove an unexpectedly expensive operation on Vista (and potentially beyond).

Oh, and a tip for finding stack hog functions with stack overflow problems: Use the ‘f’ flag with the ‘k’ command in WinDbg. For example:

0: kd> knf
 #   Memory  ChildEBP RetAddr  
00           818e6bdc 818ad19b RtlpBreakWithStatusInstruction
01        50 818e6c2c 818adc08 KiBugCheckDebugBreak+0x1c
02       3b0 818e6fdc 8184845e KeBugCheck2+0x5f4
03         0 818e6fdc 81871d35 KiTrap08+0x75
[...]

This has the debugger compute the stack (arguments + locals) usage at each call frame point for you, saving you a bit of work with the calculator.

Debugging (or reverse engineering…) a real life Windows Vista compatibility problem: CreateIpForwardEntry in iphlpapi

Tuesday, October 24th, 2006

Since I’m at the Microsoft Vista compatibity lab, it only makes sense that I’ve fixed a few Vista compatibility bugs in our product today.

Some of these are real bugs, but I ran into one in particular that is particularly infuriating: a completely undocumented, seemingly completely arbitrary restriction placed on a publicly documented API that has been around since Windows 98.

In this particular case, I was running into a problem where one of our products was being unable to add routes on Vista. This worked fine on prior platforms we supported, and so I started looking into it as a compatibility problem. First things first, I narrowed the problem down to a particular API that was failing.

We have a function that wrappers the various details about creating routes. The function in question went approximately like so:

//
// Add a route through the desired gateway.
//

DWORD
AddRoute(
	__in unsigned long Network,
	__in unsigned long Mask,
	__in unsigned long Gateway
	)
{
	MIB_IPFORWARDROW Row;
	DWORD            Status, ForwardType;
	unsigned long    InterfaceIp, InterfaceIndex;

[...]	// (Code to determine the local
	// interface to add the route on)

	//
	// Setup the IP forward row.
	//

	ZeroMemory(&Row,
		sizeof(Row));

	Row.dwForwardDest    = Network;
	Row.dwForwardMask    = Mask;
	Row.dwForwardPolicy  = 0;
	Row.dwForwardNextHop = Gateway;
	Row.dwForwardIfIndex = InterfaceIndex;
	Row.dwForwardType    = ForwardType;
	Row.dwForwardProto   = PROTO_IP_NETMGMT;
	Row.dwForwardAge     = INFINITE;
	Row.dwForwardMetric1 = 0;

	//
	// Create the route.
	//

	if ((Status = CreateIpForwardEntry(&Row))
		!= NO_ERROR)
	{
		wprintf(L"Creation failed, %lu.\\n",
			Status);
		return Status;
	}

[...]	// (More unrelated boilerplate code)

	return Status;
}

Essentially, the problem here was that CreateIpForwardEntry was failing. Checking logs, the error code logged was 0xA0.

Using the handy Microsoft error code lookup utility (err.exe), it was easy to determine what this error code means:

C:\\>err a0
# for hex 0xa0 / decimal 160 :
  INTERNAL_POWER_ERROR                            bugcodes.h
  LLC_STATUS_BIND_ERROR                           dlcapi.h
  SQL_160_severity_15                             sql_err
# Rule does not contain a variable.
  ERROR_BAD_ARGUMENTS                             winerror.h
# One or more arguments are not correct.
  SCW_E_TOOMUCHDATAIN                             wpscoserr.mc
# Too much incoming data%0
# 5 matches found for "a0"

The only error that makes sense in this context is ERROR_BAD_ARGUMENTS. Unfortunately, that is not really all that helpful. Checking the latest MSDN documentation for CreateIpForwardEntry, there is, of course, no mention of this error code whatsoever.

Additionally, looking at the Microsoft documentation, nothing immediately jumped to mind as to what the problem is.

Although the Microsoft people here for the Vista lab did offer to see about getting me in touch with someone in the product team who might have an explanation for this behavior, I eventually decided that I would just take a crack at digging into the internals of CreateIpForwardEntry and understand the problem myself in the meanwhile to see if I might be able to come up with a fix sooner. After searching around a bit on Google and not coming up with any good explanation for what was going wrong, I eventually decided to step into iphlpapi!CreateIpForwardEntry in the debugger and see just what was going wrong first-hand.

0:000> bu iphlpapi!CreateIpForwardEntry
breakpoint 0 redefined
0:000> g
Breakpoint 0 hit
eax=0012fd6c ebx=00000004 ecx=00000000 edx=00000000
esi=01040a0a edi=00000003
eip=751bdfc1 esp=0012fd58 ebp=0012fdb0 iopl=0
nv up ei pl nz ac pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000
efl=00000216
iphlpapi!CreateIpForwardEntry:
751bdfc1 8bff            mov     edi,edi

Looking at the disassembly of CreateIpForwardEntry, it’s clear that this function is now just a stub that forwards the call onto another function that performs the real work:

0:000> u @eip
iphlpapi!CreateIpForwardEntry:
751bdfc1 8bff       mov     edi,edi
751bdfc3 55         push    ebp
751bdfc4 8bec       mov     ebp,esp
751bdfc6 6a01       push    1
751bdfc8 ff7508     push    dword ptr [ebp+8]
751bdfcb e820ffffff call    CreateOrSetIpForwardEntry
751bdfd0 5d         pop     ebp
751bdfd1 c20400     ret     4

So, I pressed onward, stepping into iphlpapi!CreateOrSetIpForwardEntry

0:000> tc
iphlpapi!CreateIpForwardEntry+0xa:
751bdfcb e820ffffff call    CreateOrSetIpForwardEntry
0:000> t
eax=0012fd6c ebx=00000004 ecx=00000000 edx=00000000
esi=01040a0a edi=00000003
eip=751bdef0 esp=0012fd48 ebp=0012fd54 iopl=0
nv up ei pl nz ac pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000
efl=00000216
iphlpapi!CreateOrSetIpForwardEntry:
751bdef0 8bff            mov     edi,edi

Looking at the disassembly, there appears to be only one place where the error code ERROR_BAD_ARGUMENTS (disassembly truncated for better viewing):

0:000> uf @eip
iphlpapi!CreateOrSetIpForwardEntry:
751bdef0 8bff            mov     edi,edi
751bdef2 55              push    ebp
751bdef3 8bec            mov     ebp,esp
751bdef5 83ec48          sub     esp,48h
751bdef8 8365b800        and     dword ptr [ebp-48h],0
751bdefc 56              push    esi
751bdefd 6a2c            push    2Ch
751bdeff 8d45bc          lea     eax,[ebp-44h]
751bdf02 6a00            push    0
751bdf04 50              push    eax
751bdf05 e8f053ffff      call    memset
751bdf0a 8b7508          mov     esi,dword ptr [ebp+8]

[...]

;
; Convert the interface metric we passed in with
; the pRoute structure into an interface LUID,
; stored at [ebp-30].
;

751bdf36 8d45d0          lea     eax,[ebp-30h]
751bdf39 50              push    eax
751bdf3a ff7610          push    dword ptr [esi+10h]
751bdf3d e86590ffff      call    ConvertInterfaceIndexToLuid
751bdf42 85c0            test    eax,eax
751bdf44 7571            jne     751bdfb7


;
; Get the interface metric for the requested interface,
; and store it at [ebp+8].  We pass in the address of
; the LUID of the requested interface in order to make
; the check.
;

iphlpapi!CreateOrSetIpForwardEntry+0x56:
751bdf46 8d4508          lea     eax,[ebp+8]
751bdf49 50              push    eax
751bdf4a 8d45d0          lea     eax,[ebp-30h]
751bdf4d 50              push    eax
751bdf4e e802f4ffff      call    GetInterfaceMetric

[...]

;
; Load esi with pRoute->dwForwardMetric1
;

751bdf6c 8b7624          mov     esi,dword ptr [esi+24h]
751bdf6f 6a06            push    6
751bdf71 8945e0          mov     dword ptr [ebp-20h],eax
751bdf74 83c8ff          or      eax,0FFFFFFFFh
751bdf77 3b7508          cmp     esi,dword ptr [ebp+8]
751bdf7a 59              pop     ecx
751bdf7b 8d7de8          lea     edi,[ebp-18h]
751bdf7e f3ab            rep stos dword ptr es:[edi]
751bdf80 8945ec          mov     dword ptr [ebp-14h],eax
751bdf83 8945f0          mov     dword ptr [ebp-10h],eax
751bdf86 5f              pop     edi

;
; Check that esi is not less than [ebp+8]
; ... in other words, verify that
; pRoute->dwForwardMetric1 >= InterfaceMetric,
; where InterfaceMetric is set by GetInterfaceMetric()
;

751bdf87 7229            jb      751bdfb2 ; failure

iphlpapi!CreateOrSetIpForwardEntry+0x99:
751bdf89 2b7508          sub     esi,dword ptr [ebp+8]
751bdf8c 6a18            push    18h
751bdf8e 8d45e8          lea     eax,[ebp-18h]
751bdf91 50              push    eax
751bdf92 6a30            push    30h
751bdf94 8d45b8          lea     eax,[ebp-48h]
751bdf97 50              push    eax
751bdf98 6a10            push    10h
751bdf9a 6864331b75      push    751b3364
751bdf9f ff750c          push    dword ptr [ebp+0Ch]
751bdfa2 8975f4          mov     dword ptr [ebp-0Ch],esi
751bdfa5 6a01            push    1
751bdfa7 c645ff01        mov     byte ptr [ebp-1],1

;
; Call the NsiSetAllParameters internal API to create the
; route, and return its return value to the caller.
;

751bdfab e86857ffff      call    NsiSetAllParameters
751bdfb0 eb05            jmp     751bdfb7
[...]

iphlpapi!CreateOrSetIpForwardEntry+0xc2:
;
; Return ERROR_BAD_ARGUMENTS
;
751bdfb2 b8a0000000      mov     eax,0A0h

iphlpapi!CreateOrSetIpForwardEntry+0xc7:
751bdfb7 5e              pop     esi
751bdfb8 c9              leave
751bdfb9 c20800          ret     8

From this annotated disassembly, we can conclude that there are only two possibilities that might result in this behavior. The first is that GetInterfaceMetric(InterfaceIndex, &InterfaceMetric) is returning an InterfaceMetric greater than the metric we are supplying. The second is that NsiSetAllParameters is returning ERROR_BAD_ARGUMENTS.

To test this theory, we need to examine the comparison at 751bdf87 to determine if that is taking the failure branch, and we need to check the return value of NsiSetAllParameters. This is fairly easy to do with a couple of breakpoints:

0:000> bu 751bdf87 
0:000> bu 751bdfb0 
0:000> g
Breakpoint 1 hit
eax=ffffffff ebx=00000004 ecx=00000000 edx=7707e524
esi=00000000 edi=00000003
eip=751bdf87 esp=0012fcf8 ebp=0012fd44 iopl=0
nv up ei ng nz ac pe cy
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000
efl=00000297
iphlpapi!CreateOrSetIpForwardEntry+0x97:
751bdf87 7229            jb      751bdfb2 [br=1]

Our first breakpoint, the one on the comparison with the “Interface Metric” and the route metric we supplied in pRoute->dwForwardMetric1, was the one that hit first (as expected). Looking at the register context supplied by WinDbg, though, we can clearly see that the program is going to take the branch and head down the code path that returns ERROR_BAD_ARGUMENTS. Problem identified!

There still remains the issue of solving the problem, though. Looking at [ebp+8], it appears that the undocumented iphlpapi!GetInterfaceMetric returned 10:

0:000> ? dwo(@ebp+8)
Evaluate expression: 10 = 0000000a

This makes sense. We supplied a metric of 0, which is obviously less than 10. Unfortunately, now we need a good way to determine whether we should use a zero metric (for previous OS versions) or a different metric (for Vista), assuming we want our route to be the most precedent for a particular network/mask value.

Unfortunately, MSDN doesn’t turn up any hits on GetInterfaceMetric, and neither does Google. Well, that sucks – it looks like that for Vista, unless I want to hardcode 10, I’ll have to go off into undocumented land to use a publicly documented API. There seems to be something a bit ironic about that to me, but, nonetheless, the problem remains to be solved.

Update: There is a (minimally) documented solution that was very recently made available. See the bottom of the post for details.

So, all that we need to do is reverse engineer the parameters to this undocumented GetInterfaceMetric function and call it, right?

Well, no, not exactly – things actually get worse. It turns out that GetInterfaceMeteric is not even exported from iphlpapi.dll – it’s a purely internal function!

The only other option at this point, aside from hardcoding 10 as a minimum metric, is to reimplement all of the functionality of GetInterfaceMetric ourselves. Taking a look at GetInterfaceMetric, things look unfortunately rather complicated:

0:000> uf iphlpapi!GetInterfaceMetric
iphlpapi!GetInterfaceMetric:
751bd355 8bff            mov     edi,edi
751bd357 55              push    ebp
751bd358 8bec            mov     ebp,esp
751bd35a 6a1c            push    1Ch
751bd35c 6a04            push    4
751bd35e ff750c          push    dword ptr [ebp+0Ch]
751bd361 6a00            push    0
751bd363 6a08            push    8
751bd365 ff7508          push    dword ptr [ebp+8]
751bd368 6a07            push    7
751bd36a 6864331b75      push    NPI_MS_IPV4_MODULEID
751bd36f 6a01            push    1
751bd371 e88f5fffff      call    NsiGetParameter
751bd376 5d              pop     ebp
751bd377 c20800          ret     8

NPI_MS_IPV4_MODULEID is a global variable of some sort in iphlpapi:

0:000> db iphlpapi!NPI_MS_IPV4_MODULEID l8
751b3364  18 00 00 00 01 00 00 00  ........

Using the x command with ascending order, we can make an educated guess as to the size of this global by enumerating all symbols in iphlpapi in address space order:

0:000> x /a iphlpapi!*
[...]
751b3364 iphlpapi!NPI_MS_IPV4_MODULEID = <no type information>
751b3381 iphlpapi!NsiAllocateAndGetTable = <no type information>
[...]

So, we know that NPI_MS_IPV4_MODULEID must be no more than 0x1d bytes long. Taking a look around NPI_MS_IPV4_MODULE_ID, we see that past 0x18 bytes in, there appears to be code (nop instructions), making it likely that the global is 0x18 bytes long.

0:000> db 751b3364 
751b3364  18 00 00 00 01 00 00 00-00 4a 00 eb 1a 9b d4 11
751b3374  91 23 00 50 04 77 59 bc-90 90 90 90 90 ff 25 94

(The repeated 90 90 90 90 bytes are a typical sign of code. 90 is the opcode for the nop instruction on x86, which the compiler typically uses for padding out function start offsets for alignment.)

Given this, we should be able to replicate the behavior of GetInterfaceMetrics, as the only function it calls, NsiGetParameter, is exported by nsi.dll (of course, it isn’t documented…). From the above disassembly, we can see that NsiGetParameter takes a ulong-sized argument (constant 0x1), a pointer argument (address of NPI_MS_IPV4_MODULEID), a ulong-sized argument (constant 0x7), a pointer that is the address of the interface LUID (argument 1 of GetInterfaceMetrics, which we saw earlier), a ulong-sized argument (constant 0x8), a ulong or pointer-sized argument (constant 0x0), a pointer-sized argument (address of a ULONG containing the “interface metric”), a ulong-sized argument (constant 0x4), and (finally!) a ulong-sized argument (constant 0x1c). I would surmise that the 0x8 and 0x4 constants are the sizes of the LUID and output buffer, though I haven’t bothered to confirm that at this point.

From our knowledge of __stdcall, we can identify NsiGetParameter as __stdcall quickly by looking at the disassembly of GetInterfaceMetrics and noticing the behavior after the function call (not removing arguments from the stack space, assuming the callee (NsiGetParameter) performs that task.

Given all of this, we can make our own function that implements GetInterfaceMetric. Now, just to be clear, I would not recommend actually using this, unless Microsoft fails to provide a documented mechanism to determine the minimum metric permitted for CreateIpForwardEntry (or removes the restriction) prior to Vista RTM. I am going to try and do whatever I can to see what ISV’s are supposed to do with this particular problem (and whether it can be fixed before RTM) before this week is up, but in the event that I don’t get anywhere, I’ll have a backup plan (as ugly and hackish as it may be) – better than not being able to manipulate the route table, period, on Vista.

Anyway, the basic idea is that we call ConvertInterfaceIndexToLuid on the InterfaceIndex that we already have from iphlpapi, to convert this into a NET_LUID structure (new to Vista). It does so happen that ConvertInterfaceIndexToLuid is a documented API, which makes that the easy part.

Then, we simply replicate the call that we saw in GetInterfaceMetric inside iphlpapi.dll. For brevity, I am not posting the entire source code for my implementation of GetInterfaceMetric inline; you can, however, download it. With this reverse engineered implementation, all that is left is to call it to get the minimum metric for the interface we are about to add a route on, and place that metric in the MIB_IPFORWARDROW that we pass to CreateIpForwardEntry.

I’ll post back when I hear from Microsoft as to the official word as to how one is to handle this situation; I fully expect that there will be a documented API (or the restriction will go away) before RTM, at this point, given that this is a rather bad compatibility bug that breaks a long-existing documented API in such a way that requires you to go into undocumented hackery to continue to use it (especially since there is no other good way that I know of to replicate the functionality of the API in question).

Update: You can use the GetIpInterfaceEntry routine (new to Vista, in iphlpapi) to find the minimum metric for an interface. Note that you will very likely need to search on MSDN to find information on this function, as it’s not been included in recent SDKs to my knowledge.

(Note: Some of the debugger output was slightly modified or truncated by me to keep the formatting sane.)

Annoyances with IE7

Sunday, October 22nd, 2006

Since installing IE7, I’ve ran into a couple of annoyances.

The largest of which is that you can no longer use the trick to to launch an instance of iexplore.exe under Run As, and then navigate to the Control Panel to get an administrator view of Control Panel if you are logged on as a limited user (for pre-Vista). Now, instead, the admin IE instance will just tell the already-running explorer instance (which is running as your limited user account) to open a window at Control Panel. This is of course not what I want, which leaves me stuck with remembering the names of the individual .cpl files and launching them from an admin. Unfortunately for me, this just made running as a limited user on Windows XP and Windows Server 2003 much more painful; not a good thing from the perspective of a browser that is supposed to make things more secure. (In case you were wondering, you can’t just launch an admin explorer.exe while you already have explorer running under your user account. If you try to do this, the admin explorer instance will tell the already running explorer instance to open a new window, and then exit.) Alternatively, I could configure explorer to use a different process for every window, which does actually allow you to run explorer directly with Run As, but this has the unfortunate side effect of dramatically increasing memory usage if you have multiple explorer folder windows open.

The other things I have ran into so far are site compatibility problems, like lists breaking for WordPress. I am not sure if this particular problem is a WordPress one or an IE7 one, having not been particularly inclined to delve into HTML DOM debugging, but WordPress does appear to validate cleanly under the W3C XHTML validator. Some compatibility things are to be expected, of course, but it’s a bit disappointing to see them so glaringly obvious without either WordPress or Microsoft having done something to fix (or even acknowledge) the problem by now. Sigh.

As for tabbed browsing, I’m not sure if I really like this much yet. Up till now, I’ve pretty much always used “old-fashioned”, windowed browsing. I’ll see if tabbed browsing grows on me, but I wish I didn’t have to sacrifice ease of running as non-admin for it…

(Update: a commenter, jpassing, suggested using “explorer.exe /separate” with Run As, which appears to work nicely as a replacement for starting iexplore.exe when IE7 is installed.)

Unordered list items broken on the blog in IE7

Saturday, October 21st, 2006

Taking a little segway from the usual topics on the blog, today I got around to installing IE7 for the first time. Unfortunately, it seems that I have run into my first site compatibility issue that I really care about: a problem with WordPress.

For some reason, unordered list items appear to be not showing up as bulleted items on the blog when you are using IE7. The list items are still indented, but they don’t have a bullet prefixing them (just whitespace). I haven’t yet spent much time debugging this issue, which is for now just a minor annoyance. It doesn’t seem to be specific to my blog, as IE7 is having this problem for me with other WordPress blogs.

For example, when using IE7, these list items do not have bullet prefixes presently:

  • test list item 1
  • test list item 2

Anyone have a workaround or fix for this particular annoyance? Comment away if so…

Win32 calling conventions: __stdcall in assembler

Friday, October 20th, 2006

It’s been awhile since my last post, unfortunately, primarily due to my being a bit swamped with work and a couple of other things as of late. With that said, I’m going to start by picking up where I had previously left off with the Win32 calling conventions series. Without further ado, here’s the stuff on __stdcall as you’ll see it in assembler…

Like __cdecl, __stdcall is completely stack-based.  The semantics of __stdcall are very similar to __cdecl, except that the arguments are cleaned off the stack by the callee instead of the caller.  Because the number of arguments removed from the stack is burned into the target function at compile time, there is no support for variadic functions (functions that take a variable number of arguments, such as printf) that use the __stdcall calling convention.  The rules for register usage and return values are otherwise identical to __cdecl.

In practice, this typically means that an __stdcall function call will look much like a __cdecl function call until you examine the ret instruction that returns transfer to the caller at the end of the __stdcall function in question.  (Alternatively, you can look to see if it appears as if stack arguments are cleaned after the function call.  However, the compiler/optimizer sometimes likes to be tricky with __cdecl functions, and defer argument removal until several function calls later, so this method is less reliable.)

Because the callee cleans the arguments off the stack in an __stdcall function, you will always[1] see a ret instruction terminating a __stdcall function.  For most functions, this count is four times the number of arguments to the function, but this can vary if arguments that are larger than 32-bits are passed.  On Win32, this argument count in bytes value is virtually always[2] a multiple of four, as the compiler will always generate code that aligns the stack to at least four bytes for x86 targets.

Given this information, it is usually fairly easy to distinguish an __stdcall function from a __cdecl function, as a __cdecl function will never use an argument to ret.  Note that this does imply, however, that it is generally not possible to disinguish between an __stdcall function and a __cdecl function in the case that both take zero arguments (without any other outside information other than disassembly); in this special case, the calling conventions have the same semantics.  This also means that if you have a function that does not clean any bytes off the stack with ret, you’ll technically have to examine any callers of the function to see if any pass more than zero arguments (or the actual function implementation itself, to see if it ever expects more than zero arguments) in order to be absolutely sure if the function is __cdecl or __stdcall.

Here’s an example of a simple __stdcall function call for the following C function:
 

__declspec(noinline)
int __stdcall StdcallFunction1(int a, int b, int c)
{
 return (a + b) * c;
}

If we call the function like this:

StdcallFunction1(1, 2, 3);

… we can expect to see something like so, for the call:

push    3
push    2
push    1
call    StdcallFunction1

(There will be no add esp instruction after the call.)

This is quite similar to a __cdecl declared function with the same implementation.  The only difference is the lack of an add esp instruction following the call.

Looking at the function implementation, we can see that unlike the __cdecl version of this function, StdcallFunction1 removes the arguments from the stack:

StdcallFunction1 proc near

a= dword ptr  4 b= dword ptr  8 c= dword ptr  0Ch mov     eax, [esp+8] ; eax = b mov     ecx, [esp+4] ; ecx = a add     eax, ecx     ; eax = eax + ecx imul    eax, [esp+c] ; eax = eax * c retn    0Ch          ; (return value = eax) StdcallFunction1 endp

As expected, the only difference here is that the __stdcall version of the function cleans the three arguments off the stack.  The function is otherwise identical to the __cdecl version, with the return value stored in eax.

With all of this information, you should be able to rather reliably identify most __stdcall functions.  The key things to look out for are:

  • All arguments are on the stack.
  • The ret instruction terminating the function has a non-zero argument count if the number of arguments for the function is non-zero.
  • The ret instruction terminating the function has an argument count that is at least four times the number of arguments for the function.  (If the count is less than four, then the function might be a __fastcall function with three or more arguments.  The __fastcall calling convention passes the first two 32-bit or smaller arguments in registers.)
  • The function does not depend on the state of the ecx and edx volatile variables.  (If the function expects these registers to have a meaningful value initially, then the function is probably a __fastcall or __thiscall function, as those calling conventions pass arguments in the ecx and edx registers.) 

In the next post in this series, I’ll cover the __fastcall calling convention (and hopefully it won’t be such a long wait this time).  Stay tuned…

 

[1]: For functions declared as __declspec(noreturn) or that otherwise never normally return execution control directly to the caller (i.e. a function that always throws an exception), the ret instruction is typically omitted.  There are a couple of other rare cases where you may see no terminating ret, such as if there are two functions, where one function calls the second, and both have very similar prototypes (such as argument ordering or an additional defaulted argument).  In this case, the compiler may combine two functions by having one perform minor adjustments to the stack and then “falling through” directly to the second function.

[2]: If you see a function with a ret instruction that does not take a multiple of four as its argument, then the function was most likely hand-written in assembler.  The Microsoft compiler will never, to my knowledge, generate code like this (and neither should any sane Win32 compiler).

DxWnd 1.034 released

Sunday, September 24th, 2006

I’ve released a new version of DxWnd (requires the VC8SP0 CRT) – version 1.034. This is a minor release that, among fixing a couple of various bugs and some internal code cleanup and reorganization to build under VC8, adds a new feature: Video output rescaling.

I recently got a nice 20.1″ LCD to use as a second monitor for my main laptop at home. Unfortunately, I discovered that a lot of my old favorite classic games tended to do not-so-great things to your desktop color depth when you run them natively (in fullscreen mode), which while you might normally not care about, turns out to be a real bummer if you have something like an IM client or whatnot up on a second monitor.

So, I turned to a program I had written a couple of years ago – DxWnd. DxWnd is a program that lets you run DirectDraw 7 (or below) programs that only support fullscreen mode in a window. It accomplishes this by hooking various DirectDraw APIs and tricking the program into thinking that it is running at 640×480 (or whatever resolution it wants) fullscreen, when it is in fact running in a plain window at that resolution. Unfortunately, while DxWnd solves the color depth issue, running games at 640×480 on a 1920×1200 desktop is not really the best experience. Thus, I set out to make a couple of minor modifications to DxWnd to support rescaling the output. These are fairly simple in principle:

  • Use StretchBlt instead of BitBlt to copy data from the DirectDraw surface that the program writes to into the GDI device context associated with the actual window I am displaying on screen. The reason why I perform this extra buffering step in the first place is that GDI provides nice automatic palette conversions from DirectDraw surface DCs to plain desktop window DCs. Changing the BitBlt to a StretchBlt simply rescales the current video image to a new resolution as it is copied for display purposes.
  • For programs that call ScreenToClient / ClientToScreen / MapWindowPoints (or deal with mouse cursor coordinates), but do not correctly handle the fact that their program’s client area may not be centered at (0, 0) (after all, the program was written to only run in fullscreen mode, so normally this shortcut can be taken), DxWnd needs to alter the lie it tells in these functions. Previously, DxWnd would “fix up” the coordinates that get returned to a program (or that a program gives to Windows) so that the program only sees things centered at (0, 0). Now, in addition to that, DxWnd needs to scale these coordinates either from the real output resolution to the resolution that the program appears to be running at, or vice versa, depending on whether the coordinates are going “into” or “out of” the program. This does have one unfortunate side effect, which is that relative to a program that natively supports a given resolution, there is a perceived loss of precision when you move the mouse pointer in the rescaled video output window. This is because mouse cursor coordinates must be rescaled to values that are relative to the resolution that the program is expecting to be running at. For example, if you are running at twice the program’s native resolution, and the program draws a custom mouse cursor, then the cursor may only appear to move every two pixels that you move it instead of every one pixel (like you might expect).
  • For programs that use DirectInput for mouse coordinates, these coordinates also need to be scaled so that they are relative to the virtual screen at (0, 0) that the program expects all coordinates to be relative to.
  • Since we are scaling the output of a program, DxWnd can now allow the user to resize, maximize, or restore the window it creates to contain the video data from the program being hooked. For programs where the user has asked DxWnd to capture the mouse to the client area of the video output window, the mouse cursor capture needs to be recalculated if the window size changes (otherwise, you could not move the mouse cursor outside of the original window size).

With the new DxWnd, I can play some old classics like Master of Orion 2 or Privateer 2 rescaled to my desktop resolution on one monitor while still using a second monitor for things like e-mail or IM – and, without the color depth on my auxiliary display being reduced to 8-bit (or worse). There is some more information about DxWnd on the corresponding topic on the Valhalla Legends forum, if you are interested.

The system call dispatcher on x86

Wednesday, August 23rd, 2006

The system call dispatcher on x86 NT has undergone several revisions over the years.

Until recently, the primary method used to make system calls was the int 2e instruction (software interrupt, vector 0x2e). This is a fairly quick way to enter CPL 0 (kernel mode), and it is backwards compatible with all 32-bit capable x86 processors.

With Windows XP, the mainstream mechanism used to do system calls changed; From this point forward, the operating system selects a more optimized kernel transition mechanism based on your processor type. Pentium II and later processors will instead use the sysenter instruction, which is a more efficient mechanism of switching to CPL 0 (kernel mode), as it dispenses with some needless (in this case) overhead of usual interrupt dispatching.

How is this switch accomplished? Well, starting with Windows XP, the system service call stubs do not hardcode a particular instruction (say, int 2e) anymore. Instead, they indirect through a field in the KUSER_SHARED_DATA block (“SystemCall”). The meaning of this field changed in Windows XP SP2 and Windows Server 2003 SP1; in prior versions, the SystemCall field held the actual code used to make the system call (and was filled in at runtime with the proper values). In XP SP2 and Srv03 SP1, in the interests of reducing system attack surface, the KUSER_SHARED_DATA region was marked non-executable, and SystemCall becomes a pointer to a stub residing in NTDLL (with the pointer value being adjusted at runtime based on the processor type, to refer to an appropriate system call stub).

What this means for you today is that on modern systems, you can expect to see a sequence like so for system calls:

0:001> u ntdll!NtClose
ntdll!ZwClose:
7c821138 b81b000000       mov     eax,0x1b
7c82113d ba0003fe7f       mov     edx,0x7ffe0300
7c821142 ff12             call    dword ptr [edx]
7c821144 c20400           ret     0x4
7c821147 90               nop

0x7ffe0300 is +0x300 bytes into KUSER_SHARED_DATA. Looking at the structure definition, we can see that this is “SystemCall”:

0:001> dt ntdll!_KUSER_SHARED_DATA
   +0x000 TickCountLowDeprecated : Uint4B
   +0x004 TickCountMultiplier : Uint4B
   +0x008 InterruptTime    : _KSYSTEM_TIME
   [...]
   +0x300 SystemCall       : Uint4B
   +0x304 SystemCallReturn : Uint4B
   +0x308 SystemCallPad    : [3] Uint8B
   [...]

Since my system is Srv03 SP1, SystemCall is a pointer to a stub in NTDLL.

0:001> u poi(0x7ffe0300)
ntdll!KiFastSystemCall:
7c82ed50 8bd4             mov     edx,esp
7c82ed52 0f34             sysenter
ntdll!KiFastSystemCallRet:
7c82ed54 c3               ret

On my system, the system call dispatcher is using sysenter. You can look at the old int 2e dispatcher if you wish, as it is still supported for compatibility with older processors:

0:001> u ntdll!KiIntsystemCall
ntdll!KiIntSystemCall:
7c82ed60 8d542408         lea     edx,[esp+0x8]
7c82ed64 cd2e             int     2e
7c82ed66 c3               ret

The actual calling convention used by the system call dispatcher is thus:

  • eax contains the system call ordinal.
  • edx points to either the argument array of the system call on the stack (for int 2e), or the return address plus argument array (for sysenter).

For most of the time, though, you’ll probably not be dealing directly with the system call dispatching mechanism itself. If you are, however, now you know how it works.