Most Win32 programs out there operate with more than one thread in the process at least some of the time, even if the program code doesn’t explicitly create a thread. The reason for this is that a number of system APIs internally create threads for their purposes.
For example, console Ctrl-C/Ctrl-Break events are handled in a new thread by default. (Internally, CSRSS creates a new thread in the address space of the recipient process, which then calls a kernel32 function to invoke the active console handler list. This is one of the reasons why kernel32 is guaranteed to be at the same base address system wide, incidentally, as CSRSS caches the address of the Ctrl-C dispatcher in kernel32 and assumes that it will be valid across all client processes[1].) This can introduce potentially unexpected threading concerns depending on what a console control handler does. For example, if a console control handler writes to the console, this access will not necessarily be synchronized with any I/O the initial thread is doing with the console, unless some form of explicit synchronization is used.
This is one of the reasons why Visual Studio 2005 eschews the single threaded CRT entirely, instead favoring the multithreaded variants. At this point in the game, there is comparatively little benefit to providing a single threaded CRT that will break in strange ways (such as if a console control handler is registered), just to save a minuscule amount of locking overhead. In a program that is primarily single threaded, the critical section package is fast enough that any overhead incurred by the CRT acquiring locks is going to be minuscule compared to the actual processing overhead of whatever operation is being performed.
An even greater subset of Win32 APIs use worker threads internally, though these typically have limited visibility to the main process image. For example, WSAAsyncGetHostByName uses a worker thread to synchronously call gethostbyname and then post the results to the requester via a window message. Even debugging an application typically results in threads being created in the target process for purpose of debugger break-in processing.
Due to the general trend that most Win32 programs will encounter multiple threads in one fashion or another, other parts of the Win32 programming environment besides the CRT are gradually dropping the last vestiges of their support for purely single threaded operation as well. For instance, the low fragmentation heap (LFH) does not support HEAP_NO_SERIALIZE, which disables the usage of the heap’s built-in critical section. In this case, as with the CRT, the cases where it is always provably safe to go without locks entirely are so limited and provide such a small benefit that it’s just not worth the trouble to take out the call to acquire the internal critical section. (Entering an unowned critical section does not involve a kernel mode transition and is quite cheap. In a primarily single threaded application, any critical sections will by definition almost always be either unowned or owned by the current thread in a recursive call.)
[1] This is not strictly true in some cases with the advent of Wow64, but that is a topic for another day.