Last time, I described how the x86 compiler sometimes optimizes conditional assignments.
It is worth adding to this sentiment, though, that the x64 compiler is in fact a different beast from the x86 compiler. The x64 instruction set brings new guaranteed-universally-supported instructions, a new calling convention (with new restrictions on how the stack is used in functions), and an increased set of registers. These can have a significant impact on the compiler.
As an example, let’s take a look at what the x64 compiler did when compiling (what I can assume) is the same source file into the code that we saw yesterday. There are a number of differences even with the small section that I posted that are worth pointing out. The approximately equivalent section of code in the x64 version is as so:
cmp cl, 30h mov ebp, 69696969h mov edx, 30303030h mov eax, ebp mov esi, 2 cmovnz eax, edx mov [rsp+78h+PacketLeader], eax
There are a number of differences here. First, there’s a very significant change that is not readily apparent from the small code snippets that I’ve pasted. Specifically, in the x86 build of this particular module, this code resided in a small helper subfunction that was called by a large exported API. With the x64 build, by contrast, the compiler decided to inline this function into its caller. (This is why this version of the code just stores the value in a local variable instead of storing it through a parameter pointer.)
Secondly, the compiler used cmovnz instead of going to the trouble of using the setcc, dec, and route. Because all x64 processors support the cmovcc family of instructions, the compiler has a free hand to always use them for any x64 platform target.
There are a number of different reasons why the compiler might perform particular optimizations. Although I’m hardly on the Microsoft compiler team, I might be able to hazard a guess as to why, for instance, the compiler might have decided to inline this code chunk instead of leaving it in a helper function as in the x86 build.
On x86, the call to this helper function looks like so:
push [ebp+KdContext] lea eax, [ebp+PacketLeader] push eax push [ebp+PacketType] call _KdCompReceivePacketLeader@12
Following the call instruction, throughout the main function, there are a number of comparisons between the PacketLeader local variable (that was filled in by KdCompReceivePacketLeader) and one of the constants (0x69696969) that we saw PacketLeader being set to in the code fragment. To be exact, there are three occurances of the following in short order after the call (though there are control flow structures such as loops in between):
cmp [ebp+PacketLeader], 69696969h
These are invariably followed by a conditional jump to another location.
Now, when we move to x64, things change a bit. One of the most significant differences between x64 and x86 (aside from the large address space of course) is the addition of a number of new general purpose registers. These are great for the optimizer, as they can allow for a number of things to be cached in registers instead of having to be either spilled to the stack or, in this case, encoded as instruction operands.
Again, to be clear, I don’t work on the x64 compiler, so this is really just my impression of things based on logical deduction. That being said, it would seem to me that one sort of optimization that you might be able to make easier on x64 in this case would be to replace all the large “cmp” instructions that reference the 0x69696969 constant with a comparison against a value cached in a register. This is desirable because a cmp instruction that compares a value dereferenced based on ebp (the frame pointer, in other words, a local variable) with a 4-byte immediate value (0x69696969) is a whopping 7 bytes long.
Now, 7 bytes might not seem like much, but little things like this add up over time and contribute to additional paging I/O by virtue of the program code being larger. Paging I/O is very slow, so it is advantageous for the compiler to try to reduce code size where possible in the interest of cutting down on the number of code pages.
Because x64 has a large number of extra general purpose registers (compared to x86), it is easier for the compiler to “justify” the “expenditure” of devoting a register to, say, caching a frequently used value for purposes of reducing code size.
In this particular incident, because the 0x69696969 constant is both referenced in the helper function and the main function, one benefit of inlining the code would be that it would be possible to “share” the constant in a cached register across both the reference in the helper function code and all the comparisons in the main function.
This is essentially what the compiler does in the x64 version. 0x69696969 is loaded into ebp, and depending on the condition flags when the cmovnz is executed will remain loaded into eax based off of a mov eax, ebp instruction.
Later on in the main function, comparisons against the 0x69696969 constant are performed via a check against ebp instead of an immediate 4-byte operand. For example, the long 7-byte cmp instructions on x86 become the following 4 byte instructions on x64:
cmp [rsp+78h+PacketLeader], ebp
I’m sure this is probably not the only reason why the function was inlined for the x64 build and not the x86 build, but the optimizer is fairly competent, and I’d be surprised if this kind of possible optimization wasn’t factored in. Other reasons in favor of inlining on x64 are, for instance, the restrictions that the (required) calling convention places against the sort of custom calling conventions possible on x86, and the fact that any non-leaf function that isn’t inlined requires its own unwind metadata entries in the exception directory (which, for small functions, can be a non-trivial amount of overhead compared to the opcode length of the function itself).
Aside from changes about decisions on whether to inline code or not, there are a number of new optimizations that are exclusive to x64. That, however, is a topic for another day.