Viridian guest hypercall interface published

Recently, Microsoft made a rather uncharacteristic move and (mostly) freely published the specifications for the Viridian hypercall interface (otherwise known as “Windows Server virtualization”). Publishing this documentation is, to be clear, a great thing for Microsoft to have done (in my mind, anyway).

The hypercall interface is in some respects analogus to the “native API” of the Windows kernel. Essentially, the hypercall interface is the mechanism by a privileged, virtualization-aware component running in a hypervisor partition can request assistance from the hypervisor for a particular task. In that respect, a hypercall is to a hypervisor as what a system call is to an operating system kernel.

It’s important to note that the documentation attempts to outline the hypercall interface from the perspective of documenting what one would need to implement a compatible hypervisor, and not from the perspective of how the Microsoft hypervisor implements said hypercall interface. However, it’s still worth a read and provides valuable insight into many aspects of how Viridian is architected at a high level.

I’m still working on digesting the whole specification (as the document is 241 pages long), but one thing that caught my eye was that there is special support in the hypervisor for debugging (in other words, kernel debugging). This support is implemented in the form of the HvPostDebugData, HvRetrieveDebugData, and HvResetDebugSession hypercalls (documented in the hypercall specification).

While I’m certainly happy to see that Microsoft is considering kernel debugging when it comes to the Viridian hypervisor, some aspects of how the Viridian hypercall interface works seem rather odd to me. After (re)reading the documentation for the debugging hypercalls a couple of times, I arrived at the conclusion that the Viridian debugging support is more oriented towards simply virtualizing and multiplexing an unreliable physical debugger link. The goal of this approach would seem to me to be that multiple partitions (operating system instances running under the hypervisor) would share the same physical connection between the physical machine hosting the hypervisor and the kernel debugger machine. Additionally, the individual partitions would be insulated from what actual physical medium the kernel debugger connection operates over (for example, 1394 or serial cable), such that only one kernel debugger transport module is needed per partition, regardless of what physical connection is used to connect the partition to the kernel debugger.

While this is a huge step forward from where shipping virtualization products are today with respect to kernel debugging (serial port debugging only), I think that this approach still falls short of completely ideal. There are a number of aspects of the debugging hypercalls that still carry much of the baggage of a physical, machine-to-machine kernel debugging interface, baggage that is arguably unnecessary and undesirable from a virtualization perspective. Besides the possibility of further improving the performance of the virtualized kernel debugger connection, it is possible to support partition-to-partition kernel debugging in a more convenient fashion than Viridian presently supports.

The debugging hypercalls, as currently defined, are in fact very much reminiscent of how I originally implemented VMKD. The hypercalls define an interface for a partition to send large chunks of kernel debugger data over as discrete units, without any guarantee of reception. Additionally, they provide a mechanism to notify the hypervisor that the partition is polling for kernel debugger data, so that the hypervisor can take action to reduce the resource consumption of the partition while it is awaiting new data (thus alleviating the CPU spin issue that one often runs into while broken into the kernel debugger with existing virtualization solutions, VMKD nonwithstanding of course).

The original approach that I took to VMKD is fairly similar to this. I essentially replaced the serial port I/O instructions in kdcom.dll with a mechanism that buffered data up until a certain point, and then transmitted (or received) data to the VMM in a discrete unit. Like the Viridian approach to debugging, this greatly reduces the number of VM exits (as compared to a traditional virtual serial port) and provides the VMM with an opportunity to reduce the CPU usage of the guest while it is awaiting new kernel debugger data.

However, I believe that it’s possible to improve upon the Viridian debugging hypercalls in much the same way as I improved upon VMKD before I arrived at the current release version. For instance, by dispensing with the provision that data posted to the debugging interface will not be reliably delivered, and enforcing several additional requirements on the debugger protocol, it is possible to further improve performance of partition kernel debugging. The suggested additional debugger protocol requirements include stipulating that data is transmitted or received by the debugging hypercalls are discrete protocol data units, and that both ends of the kernel debugger connection will be able to recover if an unexpected discrete PDU is received after a guest (or kernel debugger) reset.

These restrictions would further reduce VM exits by moving any data retransmit and recovery procedures outside of the partition being debugged. Furthermore, with the ability to reliably and transactionally transmit and receive (or fail in a transacted fashion) in as a function of the debugging hypercall itself, there is no longer a necessity for the hypervisor to ever schedule a partition that is frozen waiting for kernel debugger data until new data is available (or a transactional failure, such as a partition-defined timeout occurs). (This is, essentially, the approach that VMKD currently takes.)

In actuality, I believe that it should be possible to implement all of the above improvements by moving the Viridian debugging support out of the hypervisor and into the parent partition for a debuggee partition, with the parent partition being responsible for making hypercalls to set up a shared memory mapping for data transfer (HvMapGpaPages) and allow for event-driven communication with the debuggee partition (HvCreatePort and related APIs) that could be used to request that debugger command data in the shared memory region be processed. Above and beyond performance implications, this approach has the added advantage of more easily supporting partition-to-partition debugging (unless I’ve missed something in the documentation, the Viridian debugging hypercalls do not provide any mechanism to pass debugging data from one partition to another for processing).

Additionally, this approach would also completely eliminate the need to provide any specialized kernel debugging support at all in the hypervisor microkernel, instead moving this support into a parent (or the root) partition, leaving it to that partition to deal with the particulars of data transfer. If that partition, or another partition on the same physical computer as the debuggee partition is acting as the debugger, then data can be “transferred” using the shared memory mapping. Otherwise, the parent (or root) partition can implement whatever reliable transport mechanism it desired for the kernel debugger data (say, a TCP connection to a remote kernel debugger over an IP-based network). Thus, this proposed approach could potentially not only open up additional remote kernel debugger transport options, but also reduce code complexity of the hypervisor itself (which I would like to think is almost always a desirable thing, as non-existant code doesn’t have security holes, and the hypervisor is the absolute most trusted (software) component of the system when it is used).

Given that Viridian has some time yet before RTM, perhaps if we keep ours fingers crossed, we’ll yet see some further improvements to the Viridian kernel debugging scene.

Tags: Viridian, Virtualization

This entry was posted on Friday, November 9th, 2007 at 11:03 pm and is filed under NT Internals, Windows. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.