Don’t mind the conflicting advice…

November 6th, 2006

I found this little gem in the VS2005 documentation for /W4:

Level 4 displays all level 3 warnings plus informational warnings, which in most cases can be safely ignored. This option should be used only to provide “lint” level warnings and is not recommended as your usual warning level setting.

For a new project, it may be best to use /W4 in all compilations. This will ensure the fewest possible hard-to-find code defects.

Whoops. Time to file a documentation bug.

It is the little inconsistencies like this that really contribute to the perception that MSDN is difficult to understand and use, I think.

Don’t always trust the disassembler…

November 3rd, 2006

A coworker of mine just ran into a particularly strange bug in the WinDbg disassembler that turns out to be kind of nasty if you don’t watch out for it.

There is a bug with how WinDbg disassembles e9/imm32 jump instructions on x86. (The e9/imm32 instruction takes a 4-byte relative displacement from eip+5, which when combined with eip, is the new eip value after the jump. It is thus in the format e9XXXXXXXX, where XXXXXXXX is the relative displacement.)

Specifically, if the absolute value of the relative displacement immediate value for the e9 opcode is the virtual address of a named symbol in the target, WinDbg incorrectly displays the target of the jump as that symbol. This does not occur if it does not resolve to a named symbol, i.e. something that would just show up as a raw address and not as a name or name+offset in the debugger.

For example, here’s a quick repro that allocates some space in the target and builds an e9/imm32 instruction that is disassembled incorrectly:

0:001> .dvalloc 1000
Allocated 1000 bytes starting at 00980000
0:001> eb 00980000 e9;
ed 00980001 kernel32!CreateFileA;
u 00980000 l1
00980000 e9241a807c      jmp kernel32!CreateFileA (7c801a24)

Here, CreateFileA is not the target of the jump. Instead, 7d181a29 is the target.

You are most likely to see this problem when looking at e9-style jumps that are used with function hooks.

It is interesting to note that the uf (unassemble function with code analysis) is not fooled by the jump, and does properly recognize the real target for the jump. (Here, the target is bogus, so uf refuses to unassemble it. However, in testing, it has followed the jump correctly when the target is valid.)

This bug is appearing for me in WinDbg 6.6.7.5. I would assume that it is probably present in earlier versions.

Here’s why you do not normally see this bug:

  1. This bug only occurs if the relative displacement, when incorrectly taken as a raw absolute virtual address, matches the absolute virtual address of a named symbol. This means that if you tried to resolve an address as a symbol, you would get a name (or name+offset) back and not just a number.
  2. Most relative displacements are very small, except in atypical cases like function hooks. This means that as a general rule of thumb, it is very rare to see a relative displacement for an e9/imm32 jump that is more than a couple of megabytes.
  3. There are rarely any named symbols in the first few megabytes of the address space. The first 64K are reserved (in non-ntvdm processes) entirely, and after that, things like RTL_USER_PROCESS_PARAMETERS and the environmental variable block for the process, and soforth are typically present. These are allocated at the low end of the address space because by default, the memory manager allocates from low addresses and up. Therefore, low address ranges are very quickly used up by allocations.
  4. Most images do not have a preferred image base in the extreme low end of the address space.
  5. By the time most DLLs with conflicting base addresses are being loaded, the low end of the address space has been sufficiently consumed and/or fragmented such that there is not room to map the image section view in the extreme low end of the address space, continuing to prevent there being named symbols in this region.

Given this, it’s easy to see how this bug has gone unseen and unfixed for so long. Hopefully, it’ll be fixed in the next DTW drop.

Resolution to CreateIpForwardEntry failing on Vista

November 3rd, 2006

Previously, I had posted about a compatibility problem with Windows Vista if you used CreateIpForwardEntry to manage the IP routing table. In particular, if you call this routine on Vista with the intent to create a new route in the IP routing table, you may get an inexpicibly ERROR_BAD_ARGUMENTS error code returned.

There is an officially supported workaround, though it is not very well documented, and was in fact only recently made available in the Platform SDK documentation to my knowledge.

The official line is that you must call the GetIpInterfaceEntry function on Vista, if you wish to continue to be able to add routes.

(Yes, this does suck. It is a total breaking change for anyone who did route manipulation on OS’s prior to Vista, until you patch your programs out in the field. If this is unacceptable to you, I would encourage you to provide feedback to Microsoft about how this issue impacts customer experiences and your ability to deploy and use your product on Vista.)

I would encourage you to use this documented API instead of the undocumented solution that I posted about earlier, simply because it will (ostensibly) continue to work on future OS versions. (Although, given that the reason you have to call this is because of a breaking future-compatibility change in Vista, I am not sure that it is really justified to use that line here…)

Win32 calling conventions: __thiscall in assembler

November 2nd, 2006

The final calling convention that I haven’t gone over in depth is __thiscall.

Unlike the other calling conventions that I have previously discussed, __thiscall is not typically explicitly decorated on functions (and in most cases, you cannot decorate it explicitly).

As the name might imply, __thiscall is used exclusively for functions that have a this pointer – that is, non-static C++ class member functions. In a non-static C++ class member function, the this pointer is passed as a hidden argument. In Microsoft C++, the hidden argument is the first actual argument to the routine.

When a function is __thiscall, you will typically see the this pointer passed in the ecx register. In this respect, __thiscall is rather similar to __fastcall, as the first argument (this) is passed via register in ecx. Unlike __fastcall, however, all remaining arguments are passed via the stack; edx is not supported as an additional argument register. Like __fastcall and __stdcall, when using __thiscall, the callee cleans the stack (and not the caller).

In some circumstances, with non-exported, internal-only-to-a-module functions, CL may use ebx instead of ecx for this. For any exported __thiscall function (or a function whose address escapes a module), the compiler must use ecx for this, however.

Continuing with the previous examples, consider a function implementation like so:

class C
{
public:
	int c;

	__declspec(noinline)
	int ThiscallFunction1(int a, int b)
	{
		return (a + b) * c;
	}
};

This function operates the same as the other example functions that we have used, with the exception that ‘c’ is a member variable and not a parameter.

The implementation of this function looks like so in assembler:

C::ThiscallFunction1 proc near

a= dword ptr  4
b= dword ptr  8

mov     eax, [esp+8]       ; eax=b
mov     edx, [esp+4]       ; edx=a
add     eax, edx           ; eax=a+b
imul    eax, [ecx]         ; eax=eax*this->c
retn    8                  ; return eax;
C::ThiscallFunction1 endp

Note that [ecx+0] is the offset of the member variable ‘c’ from this. This function is similar to the __stdcall version, except that instead of being passed as an explicit argument, the ‘c’ parameter is implicitly passed as part of the class object this and is then referenced off of the this pointer.

Consder a call to this function like this in C:

C* c = new C;
c->c = 3;
c->ThiscallFunction1(1, 2);

This is actually a bit more complicated than the other examples, because we also have a call to operator new to allocate memory for the class object. In this instance, operator new is a __cdecl function that takes a single argument, which is the count in bytes to allocate. Here, sizeof(class C) is 4 bytes.

In assembler, we can thus expect to see something like this:

push    4                    ; sizeof(class C)   
call    operator new         ; allocate a class C object
add     esp, 4               ; clean stack from new call
push    2                    ; 'a'
push    1                    ; 'b'
mov     ecx, eax             ; (class C* c)'this'
mov     dword ptr [eax], 3   ; c->c = 3
call    C::ThiscallFunction1 ; Make the call

Ignoring the call to operator new for the most part, this is relatively what we would expect. ecx is used to pass this, and this->c is set to 3 before the call to ThiscallFunction1, as we would expect, given the C++ code.

With all of this information, you should have all you need to recognize and identify __thiscall functions. The main takeaways are:

  • ecx is used as an argument register, along with the stack, but not edx. This allows you to differentiate between a __fastcall and __thiscall function.
  • Arguments passed on the stack are cleaned by the caller and not the callee, like __stdcall.
  • For virtual function calls, look for a vtable pointer as the first class member (at offset 0) from this. (For multiple inheritance, things are a bit more complex; I am ignoring this case right now). Vtable accesses to retrieve a function pointer to call through after loading ecx before a function call are a tell-table sign of a __thiscall virtual function call.
  • For functions whose visibility scope is confined to one module, the compiler sometimes substitutes ebx for ecx as a volatile argument register for this.

Note that if you explicitly specify a calling convention on a class member function, the function ceases to be __thiscall and takes on the characteristics of the specified calling convention, passing this as the first argument according to the conventions of the requested calling convention.

That’s all for __thiscall. Next up in this series is a brief review and table of contents of what we have covered so far with common Win32 x86 calling conventions.

Removing kernel patching on the fly with the kernel debugger

November 1st, 2006

Occasionally, you may find yourself in a situation where you need to “un-patch” the kernel in order to make forward progress with investigating a problem. This has often been the case with me and certain unnamed anti-virus programs that have the misguided intention to prevent the computer administrator from administering their computer, by denying everyone access to certain protected processes.

Normally, this is done by the use of a kernel driver that hooks various kernel system calls and prevents usermode from being able to access a protected process. While this may be done in the name of preventing malware from interfering with anti-virus software, it also has the unfortunate side effect of preventing legitimate troubleshooting of software issues.

Fortunately, with WinDbg installed and a little knowledge of the debugger, it is easy to reverse these abusive kernel patches that undermine the ability of a system administrator to do his or her job.

Now, normally, you might think that one would be stuck reverse engineering large sections of code in order to disable such kinds of protection mechanisms. However, in the vast majority of cases like these, you can simply have the kernel debugger perform a comparison of the kernel memory image with the image retrieved from the symbol server, and fix up any differences (accounting for relocations). This may be done with the !chkimg -f nt command. Using !chkimg in this fashion allows you to quickly remove unwanted kernel patches without having to dig through third party code that has injected itself into the system.

If you are feeling particularly adventurerous, you can even do this in local kernel debugger mode on Windows XP or later, without having to boot the system with /DEBUG. Be warned that this does carry an inherent race condition, though very unlikely in most cases with system service patching, that you might crash the system if someone makes a call to one of the regions of the kernel that you are unpatching while the kernel is being restored to its pristine state.

You should also be aware that depending on how the third party software that has patched the kernel is written, removing the patches out from under it may have varying negative side effects; be careful. As a result, if you are working on a critical or production system, you may want to pick a different approach. If you are just working on a throw-away repro environment in a VM, though, this can be a good quick-n-dirty way to get the job done.

Despite these potential problems, I’ve successfully used this trick in a pinch several times successfully. If you are running into a brick wall with debugging malfunctioning anti-virus software interactions with your product because of anti-debug protection mechanisms, you might give this technique a try.

Debugging programs that block symbol server access

October 31st, 2006

One of the rather frustrating things to debug in Windows is one of the programs or services that is in the path of symbol server requests. This is most often the case with a service running in a svchost group with a number of other services that are responsible for things like DNS or some internal support that WinInet relies on or soforth. This deadlock condition is also prone to happening if you are debugging something else that makes lots of calls out to WinInet, as WinInet has some cross-process state that allows the symbol server engine to get deadlocked waiting on the target to release a global mutex (or similar global synchronization / state).

If you naively try to simply attach a debugger and load symbols in such a situation, you’ll end up with a nasty surprise; the debugger will hang, and you’ll have to kill it (and whatever you were debugging) to recover. In the case of svchost groups, many of those services will not properly restart after just being abruptly killed, so you may even have to reboot.

Now, one obvious solution to this problem is to just turn off symbol server access at all and work without symbols. This is obviously a major pain, though – nobody wants to debug without symbols if you actually have access to them, right?

There are a couple of other things that you can do to debug things in this scenario, however, that are a bit less painful than forgoing symbols:

First, you can use the kernel debugger to debug user mode processes. I have often seen Microsoft employees recommend this on the newsgroups. While this works (the kernel debugger is not affected by the state of any services that you are poking on the target computer), it is most certainly a royal pain to do. Using the kernel debugger means that breakpoints will often affect all processes (unless used via ba), you’ll have to deal with parts of the program being paged out, and not to mention the fact that kernel debugger connections are typically much less responsive than a local user mode debugger.

Because of the inordinate amount of pain involved in using kd to debug a user mode process, I do not recommend ever going this route unless you absolutely positively have no other recourse for accurate debugging.

Therefore, I recommend a different procedure in this case:

  1. Disable symbol server entirely (remove all SRV* symbol server references from your symbol path) or make sure that you have loaded symbols for ntdll.dll.
  2. Attach to the process in question normally, but do not load symbols or issue any commands.
  3. Write a full user minidump out to disk somewhere. For example, .dump /ma c:\tmp.dmp.
  4. Detach from the process that blocks symbol server from working.
  5. Open the minidump you saved earlier, and issue a .reload /f command. This causes all symbols in the process to be downloaded from the symbol server if you do not already have them.
  6. Re-attach to the process that you wanted to debug, and set your symbol path to refer to the downstream store you used with symbol server, but without invoking symbol server. That is, if you had previously used SRV*C:\symbols*http://msdl.microsoft.com/download/symbols for your symbol path, set it to C:\symbols. This ensures that you will never try to hit the symbol server.
  7. Debug as normal.

After that, you’ll have everything in the process that could have symbols on the symbol server already downloaded into your downstream store. By turning off symbol server and just using the downstream store when you actually start your real debugging efforts, you’ll make sure that the debugger will never deadlock itself against the target. Best of all, you don’t need to download a very large symbol pack from Microsoft that might be missing files that have been hotfixed, since you are still (preloading) all of the symbols from the symbol server. This trick can, of course, also be used with your own internal symbol servers as well.

Win32 calling conventions: __fastcall in assembler

October 30th, 2006

The __fastcall calling convention is the last major major C-supported Win32 (x86) calling convention that I have not covered yet. (There still exists __thiscall, which I’ll discuss later).

__fastcall is, as you might guess from the name, a calling convention that is designed for speed. In this spirit, it attempts to borrow from many RISC calling conventions in that it tries to be register-based instead of stack based. Unfortunately, for all but the smallest or simplest functions, __fastcall typically does not end up being a particularly stellar thing performance-wise, for x86, primarily due to the (comparatively) extremely limited register set that x86 sports.

This calling convention has a great deal in common with the x64 calling convention that Win64 uses. In fact, aside from the x64-specific parts of the x64 calling convention, you can think of the x64 calling convention as a logical extension of __fastcall that is designed to take advantage of the expanded register set available with x64 processors.

What this boils down to is that __fastcall will try to pass the first two pointer-sized arguments in the ecx and edx registers. Any additional registers are passed on the stack as per __stdcall.

In practice, the key things to look out for with a __fastcall function are thus:

  • The callee assumes a meaningful value in the ecx (or edx and ecx) registers. This is a tell-tale sign of __fastcall (although, you may sometimes see __thiscall make use of ecx for the this pointer).
  • No arguments are cleaned off the stack by the caller. Only __cdecl functions have this property.
  • The callee ends in a retn (args-2)*4 instruction. In general, this is the pattern that you will see with __fastcall functions that use the stack. For __fastcall functions where no stack parameters are used, the function typically ends in a ret instruction with no stack displacement argument.
  • The callee is a short function with very few arguments. These are the most likely cases where a smart programmer will use __fastcall, as otherwise, __fastcall does not tend to buy you very much over __stdcall.
  • Functions that interface directly with assembler. Having access to ecx and edx can be a handy shortcut for a C function that is being called by something that is obviously written in assembler.

Taking these into account, let’s take a look at the same sample function and function call that we have been previously dealing with in our earlier examples, this time in __fastcall.

The function that we are going to call is declared as so:

__declspec(noinline)
int __fastcall FastcallFunction1(int a, int b, int c)
{
	return (a + b) * c;
}

This is consistent with our previous examples, save that it is declared __fastcall.

The function call that we shall make is as so:

FastcallFunction1(1, 2, 3);

With this code, we can expect the function call to look something like so in assembler:

push    3                 ; push 'c' onto the stack
push    2                 ; place a constant 2 on the stack
xor     ecx, ecx          ; move 0 into 'a' (ecx)
pop     edx               ; pop 2 off the stack and into edx.
inc     ecx               ; set 'a' -- ecx to 1 (0+1)
call    FastcallFunction1 ; make the call (a=1, b=2, c=3)

This is actually a bit different than we might expect. Here, the compiler has been a bit clever and used some basic optimizations with setting up constants in registers. These optimizations are extremely common and something that you should get used to seeing as simply constant assignments to registers, given how frequently they show up. In a future series, I’ll go into some more details as to common compiler optimizations like these, but that’s a tale for a different time.

Continuing with __fastcall, here’s what the implementation of FastcallFunction1 looks like in assembler:

FastcallFunction1 proc near

c= dword ptr  4

lea     eax, [ecx+edx] ; eax = a + b
imul    eax, [esp+4]   ; eax = (eax * c)
retn    4              ; return eax;
FastcallFunction1 endp

As you can see, in this particular instance, __fastcall turns out to be a big saver as far as instructions executed (and thus size, and in a lesser degree, speed) of the callee. This kind of benefit is usually restricted to extremely simple functions, however.

The main things, then, to consider if you are trying to identify if a function is __fastcall or not are thus:

  • Usage of the ecx (or ecx and edx) registers in the function without loading them with explicit values before-hand. This typically indicates that they are being used as argument registers, as with __fastcall.
  • The caller does not clean any arguments off the stack (no add esp instruction to clean the stack after the call). With __fastcall, the callee always cleans the arguments (if any).
  • A ret instruction (with no stack displacement argument) terminating the function, if there are two or less arguments that are pointer-sized or smaller. In this case, __fastcall has no stack arguments.
  • A retn (args-2)*4 instruction terminating the function, if there are three or more arguments to the function. In this case, there are stack arguments that must be cleaned off the stack via the retn instruction.

That’s all for __fastcall. More on other calling conventions next time…

Debugger commands review

October 29th, 2006

This posting is a master list of all the other posts and post series that cover different WinDbg commands, whether they be built-in commands, extension commands, or even third-party extension commands.

  1. Using SDbgExt to aid your debugging and reverse engineering efforts (part 1). SDbgExt is the debugger extension that I maintain and make publicly available. This series provides a high-level overview of the different commands that it offers.
  2. SDbgExt extensions, part 2
  3. Useful WinDbg commands: .formats
  4. Using knf to track down excessive stack usage. This trick is discussed in a section of the “Beware of stack usage with the new network stack in Windows Vista” post.
  5. Removing kernel patching on the fly with the kernel debugger. This article discusses how you can use the !chkimg command to remove patches and hooks on loaded module code at runtime. (This particular command is also available and applicable to the user mode debuggers, and not just the kernel debugger.)
  6. Debugger flow control: More on breakpoints (part 2). This article explores some of the inner workings of the various breakpoints supported by WinDbg. In addition, it describes the .apply_dbp command that can be used to apply a set of hardware breakpoints to the current register context, or a saved register context image in-memory.
  7. SDbgExt 1.09 released (support for displaying x64 EH data). This article describes the !fnseh command in SDbgExt that can be used to view exception handlers and unwind handlers for x64 targets from the debugger.
  8. Useful debugger commands: .writemem and .readmem. This article covers the .writemem and .readmem commands that can be used to move large sections of raw data into or out of the debugger.

Upcoming topics…

October 28th, 2006

Some of the things I’m planning on covering some time soon are:

  • Finish off the Win32 calling conventions series.
  • Continue the series about how the object namespace intersects with the Win32 API a bit further (Terminal Server session isolation, logon session isolation for mapped drives, etc).
  • Describe some of the common optimizer tricks that you might see when reverse engineering something (x86 assembler). For example, multiplication tricks with lea and that sort of thing.
  • Overview of how to use kernrate (the Microsoft profiler utility) to track down processor usage / performance problems.
  • Care and feeding of inline assembler on x86 (when to use it and when not to use it).
  • SEH on x86 (maybe)
  • A basic discussion on kernel debugging at some point in time…

I’m also working on organizing some of the post series into an easily accessible directory for quicker reference (note that some of the links in the Post Directory may not yet be active).

Let me know if there are any other topics that you might be interested in hearing about and I’ll see about posting about them at some point.

Things to watch out for if you hook functions on Windows Vista

October 27th, 2006

There are a couple of things that I have ran into that you should keep in mind if you are hooking functions and are planning to run under Windows Vista.

First, watch out for things being moved around in memory. For example, in Windows Vista, the VirtualProtect function in kernel32 and the CreateProcessA function in kernel32 are now on the same page, for the x86 build [NOTE: this is subject to rapid change with hotfixes, and may not still be the case on RTM]. If you have some code that works conceptually like so:

DWORD  OldProt;
PVOID  MyCreateProcessA;
PUCHAR _CreateProcessA;
static ULONG MyHook;

MyHook = (ULONG)&MyCreateProcessA;

VirtualProtect(_CreateProcessA, 6,
	PAGE_READWRITE, &OldProt);

//
// [...] Disassembly and stub saving
//       code goes here...
//

//
// jmp dword ptr [MyHook]
//

_CreateProcessA[0] = 0xFF:
_CreateProcessA[1] = 0x25;
*(PULONG)(&_CreateProcessA[2]) = &MyHook;

VirtualProtect(_CreateProcessA, 6,
	OldProt, &OldProt);

… you’ll run into some strange crashes in Vista, because you might end up making the pages backing VirtualProtect’s implementation non-executable by accident. (Remember that memory protections only have page granularity.)

The solution? Use PAGE_EXECUTE_READWRITE for your “intermediate” states when hooking things.

Secondly, watch out for AcLayers.dll and ShimEng.dll. These two DLLs are the core of Microsoft’s Application Compatibility Layer, which is the engine used to apply compatibility fixes at runtime to broken programs that would otherwise fail to work on Windows Vista. (This engine is also used if you select a particular compatibility layer in the property sheet for a shortcut to an executable or an executable.)

The thing to watch out for here is that AcLayers likes to do import table hooking on various kernel32 APIs. In particular, AcLayers tends to hook GetProcAddress and then occasionally redirect returned function pointers to point into AcLayers.dll and not kernel32.dll. If you have a program that assumes that any pointer that it retrieves from kernel32.dll via GetProcAddress will remain at the same address for any other process in the same session, this can result in some unpleasant surprises.

For instance, consider the classic case of wanting to inject some code to run before the main process entrypoint of a child process. You might do something like inject some code that calls kernel32!LoadLibraryA on some DLL your application surprise, and then kernel32!GetProcAddress to get the address of a function in that DLL. Then the patch code might invoke a function in your DLL and return to the initial program entrypoint of the child process. This is actually a fairly common paradigm if you need to modify some sort of behavior of a child process. Unfortunately, it can easily break if the parent process is under the influence of the dreaded application compatibility layer.

The main problem here is that when you, say, find the address of LoadLibraryA or GetProcAddress in kernel32, AcLayers.dll steps in and actually hands you the address of a stub function inside AcLayers.dll which filters requests to load DLLs or get function pointers. This is all well and fine with the parent process; AcLayers.dll is there and can do whatever it’s work is whenever you call GetProcAddress or LoadLibraryA.

The catch is what happens when you try to make a child process call LoadLibraryA on a DLL before it runs the main program entrypoint. In this case, instead of passing a pointer into kernel32 (which is guaranteed to be present and at the same base address in every Win32 process in the same session), you are passing a pointer into AcLayers.dll to the child process. The problem case is when AcLayers.dll is not loaded immediately into the child process. Here, your patch code in the newly created child process might try to call LoadLibraryA to get your custom DLL unloaded. However, it actually tries to call an internal AcLayers.dll function – but AcLayers.dll isn’t actually loaded into the address space of the child process (or might have even been rebased), so your child process mysteriously crashes instantly. This typically manifests itself as nothing happening when you try to launch a child program, depending on computer configuration.

There is unfortunately no particularly elegant way to work around this particular problem that I have found. The best advice I have to offer here is to try and bypass any possibility that any function pointer you pass to another process (in kernel32.dll) is never intercepted by AcLayers.dll. Perhaps the most fool-proof way to do this is to manually walk the export table of kernel32.dll and locate the address of the export that you are interested in, although this is not a particularly easy task.