WinHttpRequest performance woes

As I mentioned the other day, the WinHttpRequest automation object provides access from scripting to the WinHTTP client library. As I was looking into it for retrieving relatively large files for a testing scenario (e.g., the Sun JDK’s 80MB installation file) I couldn’t help but notice a clear flaw in its interface. Access to the response body for a request is provided by the ResponseText property, which simply provides a string containing the response. There’s also a similar ResponseBody property which is an array but otherwise identical. Unfortunately, the ResponseStream property, which returns an IStream reference, is not usable from scripting, which only supports dispinterfaces.

This is a problem because receiving large responses from the server requires maintaining their entire body in memory. Clearly, using ResponseText to download a 4 GB DVD ISO isn’t a great idea. So I figured as long as one keeps to reasonable transfer sizes, not massively ballooning the process working set, one should be fine.

It was only a short while later when a friend let me know about a problem he had in his own script making server requests. He was attempting to retrieve a file a few dozens megabytes in size from a web server, but instead of the retrieval completing promptly as expected, the Windows Script Host process became a CPU hog and continuously crept up in memory usage.

As we were trying to reproduce the problem in a as simple as possible environment, we saw that simply downloading a large file seemed to work promptly and as expected. Granted, memory usage was significant due to the property issue mentioned above, but the system had plenty of RAM and this was not an issue.

After some investigation the difference in the problematic scenario was identified. The transfer of the file was provided by a server generated dynamic page, where server code would dynamically write the contents of the requested file into the response stream.

Attaching the debugger to the running WSH process and examining the stack provides a big clue as to what is going on:
0:006> ~0 k 30
ChildEBP RetAddr
0013f004 7c81248c ntdll!RtlReAllocateHeap+0xc36
0013f05c 77506778 kernel32!GlobalReAlloc+0x17a
0013f078 77506802 ole32!CMemStm::SetSize+0x37
0013f09c 4d527444 ole32!CMemStm::Write+0x76
0013f0d8 4d527ca8 WINHTTP!CHttpRequest::ReadResponse+0x12b
0013f108 4d5248d9 WINHTTP!CHttpRequest::Send+0x1b9
0013f170 6fe9fcf0 WINHTTP!CHttpRequest::Invoke+0x359
0013f1ac 6fe9fc5d jscript!IDispatchInvoke2+0xb5
0013f1e8 6fe9fd78 jscript!IDispatchInvoke+0x59
0013f25c 6fea6c3c jscript!InvokeDispatch+0x90
0013f2a0 6fe9fab8 jscript!VAR::InvokeByName+0x1c2
0013f2e0 6fe9efea jscript!VAR::InvokeDispName+0x43
0013f304 6fea6ff4 jscript!VAR::InvokeByDispID+0xfd
0013f3bc 6fea165d jscript!CScriptRuntime::Run+0x16bd
0013f3d4 6fea1793 jscript!ScrFncObj::Call+0x8d
0013f444 6fe8da72 jscript!CSession::Execute+0xa7
0013f494 6fe8beba jscript!COleScript::ExecutePendingScripts+0x147
0013f4b0 0100220a jscript!COleScript::SetScriptState+0xf1
0013f4bc 0100217d cscript!CScriptingEngine::Run+0xb
0013f4d0 01001f34 cscript!CHost::RunStandardScript+0x85
0013f708 010027fc cscript!CHost::Execute+0x1f0
0013fcac 010024de cscript!CHost::Main+0x385
0013ff50 010025e6 cscript!main+0x6d
0013ffc0 7c816fd7 cscript!_mainCRTStartup+0xc4
0013fff0 00000000 kernel32!BaseProcessStart+0x23

As the WinHttpRequest object is reading the response from the server side script, it is naively writing it to an internal OLE memory stream (presumably the product of CreateStreamOnHGlobal). As more and more data is received, the memory stream is resized to accommodate it. We can track the stream’s growth pattern with a suitable breakpoint on the SetSize method. It is documented as receiving a ULARGE_INTEGER, an unsigned 64-bit integer, specifying the stream’s desired size. Therefore, we expect the argument to be 8 bytes above the position of the stack pointer when we enter the function, on x86. A 64-bit integer is also known as a quadword. The debugger command is hence:
bp ole32!CMemStm::SetSize ".printf \"SetSize called with size %d\", qwo(@esp + 8); .echo; gc"

If we promptly resume execution and let the script run for a while we see something similar to:
SetSize called with size 10650474
SetSize called with size 10658666
SetSize called with size 10666858
SetSize called with size 10675050
SetSize called with size 10683242

And so on… some quick arithmetic indicates that the numerous SetSize calls are apart by 8,192 bytes.

Resizing the stream by 8KB at a time for a file dozens of megabytes in size is clearly not a winning strategy. As can be seen in the initial stack trace, this results in a heap reallocation frenzy. As contiguous room for block resize on the heap is exhausted, the stream resize must be satisfied by allocating a whole new block and moving all the previous stream data there. As the stream grows, the whole process slows down exponentially. Since the block in question is a big one, it resides in a “virtual block” on the heap (a block of address space provided by VirtualAlloc rather than a smaller one from the heap’s free lists) and is presumably moved whenever there is not enough contiguous address space at the virtual block’s present position to extend it.

Let us examine the heap to confirm this. We can suspend execution at ole32!CMemStm::SetSize and proceed to kernel32!GlobalReAlloc. The first parameter to the latter is the handle to the heap allocation:
Breakpoint 0 hit
eax=00000000 ebx=00000000 ecx=006f5fa4 edx=774eee94 esi=00184790 edi=00002000
eip=77506741 esp=0013f07c ebp=0013f09c iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
ole32!CMemStm::SetSize:
77506741 8bff mov edi,edi
0:000> bp kernel32!GlobalReAlloc
0:000> g
Breakpoint 1 hit
eax=00184d98 ebx=00000000 ecx=006f5fa4 edx=00184a14 esi=00184790 edi=006f5fa4
eip=7c8123b9 esp=0013f060 ebp=0013f078 iopl=0 nv up ei pl nz na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000206
kernel32!GlobalReAlloc:
7c8123b9 6a24 push 24h
0:000> kb 1
ChildEBP RetAddr Args to Child
0013f05c 77506778 008e000c 006f5fa4 00002002 kernel32!GlobalReAlloc

If we proceed to examine the process default heap (at 0x00150000 in my case) and its virtual allocation list (we can ignore allocations from the linked lists below, they are for much smaller allocations) we can see the interesting block:
0:000> !heap -a 00150000
Index Address Name Debugging options enabled
1: 00150000
Segment at 00150000 to 00250000 (00035000 bytes committed)
Segment at 016D0000 to 017D0000 (00001000 bytes committed)
Flags: 00000002
ForceFlags: 00000000
Granularity: 8 bytes
Segment Reserve: 00200000
Segment Commit: 00002000
DeCommit Block Thres:00000200
DeCommit Total Thres:00002000
Total Free Size: 000001ee
Max. Allocation Size:7ffdefff
Lock Variable at: 00150608
Next TagIndex: 0000
Maximum TagIndex: 0000
Tag Entries: 00000000
PsuedoTag Entries: 00000000
Virtual Alloc List: 00150050
017d0000: 006f4000 [6f3fa4] - busy (2b) (Handle 008e000c), user flags (1)
...

If we let execution proceed for a while and examine the heap again, we see something like:
0:000> !heap -a 00150000
...
Virtual Alloc List: 00150050
02080000: 008ac000 [8abfa4] - busy (2b) (Handle 008e000c), user flags (1)
...

Notice the virtual allocation for the global heap handle has moved elsewhere in the interim. Originally at virtual address 0x017d0000, it is now at 0x02080000. Further examination shows that this happens repeatedly. In fact, examining the position of the block between calls of GlobalReAlloc shows how it alternates between a small set of virtual addresses, back and forth, as its size is modified. For a reason unclear without diving into deeper heap and virtual allocation internals, the virtual address block seems to change its location even when it can be extended to the new size at its present position, unlike what I speculated above.

Examination of the server response headers for the static page case and the dynamic case provide an insight to the difference in performance behavior. In the static case, the server, knowing in advance the size of the response, can provide the client with a Content-Length header. The header can be used by the client to allocate a block, huge as it may, in one swoop. Despite the slight memory pressure, this happens practically instantaneously.

In the dynamic case, the server cannot guess how much output the page is going to feel like providing. No Content-Length header is emitted, and the client’s naive stream memory allocation strategy comes to light and kills performance.

While WinHttpRequest’s naive reliance on COM’s memory stream allocation strategy is primarily at fault for this issue, the heap manager’s own deficiencies should not be overlooked. I believe its behavior for large block reallocations (virtual blocks) could be somewhat improved to mitigate inefficient clients.

This issue can be mitigated at the server side by ensuring a Content-Length header is always provided in responses. This can be tricky to calculate and I haven’t looked into whether lying (i.e., providing a Content-Length longer than the actual response just to get the client to preallocate memory) works correctly.

While I was looking into this issue, I was hoping to take advantage of Visual Studio 2005 Team Edition’s integrated profiler. I assumed it would be ideal for detecting and clearly illustrating an obvious CPU hog such as this case. However, after short examination, I must express my disappointment.

Visual Studio 2005’s profiler, like most other profilers, has two operation modes. Sampling and Instrumentation. Sampling means examining thread stacks at an interval and noting current execution state, basically. Instrumentation means binary patching of functions with code that notes execution flow has reached them. Instrumentation is therefore a means of “zooming in” on functions of interest, seeing as it is both more precise and has greater overhead.

I initially tried the Sampling mode. Examining its output in the impressive GUI, Performance Explorer, I noticed the sample didn’t clearly capture what was going on. I figured by instrumenting the functions I was suspicious of (the heap reallocator, etc.) a clear graphical illustration would be possible.

It was then I discovered that the profiler’s Instrumentation mode requires full debugging information. Since I am profiling native code that is not my own running under WSH, clearly all I have is public PDBs. Instrumentation refused to work in this scenario, making it useless for these kinds of cases.

This is too bad because other profilers seem content with instrumenting just exports or functions as they are indicated in public PDBs. I hope to see this as a feature in future versions of the Visual Studio profiler, since I like the slick Performance Explorer.

For those of you implementing memory stream classes and the like, keep in mind that sometimes an imperialistic memory allocation strategy is a good thing.

Think before you act, allocate before you write.

Advertisements

No SxS love from the Windows Script Host?

I was automating a scenario with a WSH script the other day that required interaction with a web server. So naturally I figured I’d make use of the WinHttpRequest automation object which wraps the WinHTTP API.

Those familiar with WSH may share my great distaste for the fact that when it functions as an automation controller, the developer is expected to hard-code enumeration constants and the like (as “vars” in JScript or “Consts” in VBScript). I first encountered this ridiculous limitation when a friend showed me how he translated C# code that automated Microsoft Word to VBScript, and had to look up the various constants by hand, with Visual Basic 6’s Object Browser, which functions as a convenient type library viewer. By default, the script engine only uses the automation object’s IDispatch interface, leaving the chore of constant resolution to the caller.

So I was relieved when I found out that .wsf files, which are XML files that wrap scripts executed by WSH, support referencing a type library for the purpose of making available the constants used with a controlled automation object. I was a little disappointed to find out about their not so ideal performance characteristics, but that was not problematic in my case.

I figured I’d introduce a reference of the following form to the script:
<reference object="WinHttp.WinHttpRequest.5.1"/>

Not all was well, however. After introducing the change above, I noticed my script had stopped working on one of the systems. Invocation of the Windows Scripting Host failed, with WSH claiming to be unable to resolve the reference to the specified ProgID.

After looking into it I figured out the problem with that specific system was that it was running Windows Server 2003 rather than Windows XP. It seemed strange the newer Windows Server 2003 would have a regression like that. I continued investigating.

The first clue was that winhttp.dll, the DLL implementing the WinHTTP API, was MIA from Windows Server 2003’s system32 directory. Surely the API was not missing from the OS, MSDN clearly documents its presence. It was indeed there, albeit in a modified form: a native side by side assembly.

OK, so winhttp.dll is there, in an oddly named subdirectory somewhere in the winsxs store instead of system32. Still, I recalled from my previous interaction with SxS that side by side assemblies could expose COM objects to their clients. Examination of the manifest file for WinHTTP in Windows Server 2003’s winsxs store revealed that it was indeed doing so.

Microsoft documents that users of the flat WinHTTP C API under Windows Server 2003 should add winhttp.dll as a dependent assembly to the activation context of the client application, but this approach seemed inappropriate to me in the context of the WinHttpRequest automation object, since clients activate it by ProgID or GUID and do not load winhttp.dll directly. Them being made aware of this relationship would be a serious breach of COM’s encapsulation.

I proceeded to write a test application in C++. It initialized COM and proceeded to call CLSIDFromProgID to translate “WinHttp.WinHttpRequest.5.1” to a GUID. Given success of this translation, it would call CoCreateInstance on returned GUID and if that worked out, QueryInterface for IDispatch and for IWinHttpRequest (defined in the Windows SDK’s httprequest.idl).

To my, I must admit, great surprise, the test application worked. The first surprising thing was that CLSIDFromProgID returned successfully, even though I specified a ProgID exposed by a SxS assembly. The ProgID was clearly absent from the HKEY_CLASSES_ROOT registry key in Windows Server 2003, in contrast to its presence there in Windows XP. Only if ole32.dll, the COM runtime, had specific knowledge of SxS and ability to perform a lookup in the winsxs store, would such a request be serviced successfully, I figured. However, no mention of this functionality could be found directly in CLSIDFromProgID’s documentation.

I was even more surprised that the CLSID returned by CLSIDFromProgID as the result of the lookup was NOT the CLSID of winhttp.dll! I couldn’t find the returned CLSID in the registry. However, when I promptly invoked CoCreateInstance, not only the activation request succeeded, I actually saw a Module Load event for winhttp.dll from the winsxs store in the debugger. I assume that the returned CLSID is part of some COM SxS integration magic.

OK, so my poor man’s automation controller implemented in C++ could obviously activate the WinHttpRequest object even in Windows Server 2003 with no knowledge of its new SxS semantics. It seemed odd that my script would fail to do same, since I assumed similar mechanics were behind its resolution process for locating the type library.

The next thing I did was to try and run my script on Windows Vista. I figured the change to WinHTTP making it a SxS assembly introduced in Windows Server 2003 was incorporated into Microsoft’s latest OS, as well. Continuing the previous chain of surprises, the script suddenly worked.

The first difference between Windows Server 2003 and Windows Vista that I observed was that Windows Script Host was updated to version 5.7 in the new OS. My first theory was that the new WSH had corrected whatever implementation issue that prevented WSH 5.6 from locating SxS type libraries.

I looked it up and found out that only days earlier Microsoft had actually made a release of the new Windows Script Host 5.7 to down-level platforms. Untypically for Microsoft nowadays, they even made a release for Windows 2000. So now I had a chance to test my theory. I installed WSH 5.7 on the Windows Server 2003 system and reran the script. In yet another surprise, it didn’t work, the type library reference giving the same error as before. It seems my instincts are really off about all of this.

So there must be a different reason for the different behavior of Windows Server 2003 and Windows Vista. After examining the Vista system, it appeared the whole thing was a lot simpler than I had originally thought. Windows Vista was a strange hybrid of the Windows XP and Windows Server 2003 behaviors, with winhttp.dll being present both as a SxS assembly in its winsxs store and as a regular DLL in system32. Indeed, examination of HKEY_CLASSES_ROOT in the Vista registry resulted in the discovery of plain old ProgID registration for the non-SxS winhttp.dll. This is most likely the reason that the type library lookup succeeds in the Windows Vista system.

With these details at hand, I was finally able to find a discussion of this issue in a newsgroup. In that newsgroup thread, Microsoft’s Biao Wang acknowledges WSH’s lack of support for SxS type library references. The thread being an old one, the possibility of a fix being introduced in Windows Server 2003 Service Pack 1 was mentioned. However, considering the issue presented itself on the Windows Server 2003 system that had Service Pack 2 installed and that the latest WSH 5.7 still doesn’t support this down-level, it appears that the issue ended up remaining unresolved, for whatever consideration Microsoft had made on the matter.

The thread does mention a satisfactory workaround: reference the SxS type library by GUID and version instead of by object ProgID and it seems to work. I tried referencing the type library by GUID when the ProgID approach didn’t work on Windows Server 2003 originally, but that reference didn’t work either since I left out the “version” directive. Another happy ending.

Lovers of type library constant imports, rejoice!

Windbg’s integrated managed debugging – an accidental feature?

When Windbg 6.7.5.0 came out a while ago, some were quick to notice that stack traces now showed managed and unmanaged stack frames interchanged, integrating the previously separate functionality of Son of Strike’s !CLRStack command. So disappointment was great when this stopped working with no explanation with the 6.7.5.1 debugger release.

The other day I was trying to be helpful to someone on the microsoft.public.windbg newsgroup that noticed the missing functionality in the latest revision. I was promptly taken aback by this surprising response from Pat Styles of Microsoft.

Apparently, integrated managed debugging was never intended to come out of Microsoft at its current, early stage. If you consider messages like “Integrated managed debugging does not support locals. See http://dbg/managed.html&#8221; you get when you try to examine locals in a managed stack frame, obviously being a reference to an Microsoft internal site, it seems to add up.

My best guess is that the dbghelp.dll and dbgeng.dll versions distributed with the 6.7.5.0 DTW release accidentally included support for managed code. They might simply support CLR PDBs or even CLR metadata (wouldn’t that be cool? Symbols for managed code even without a PDB) – but the unfortunate fact is that it’s all being scrapped. Worse, from Pat’s reply it appears Microsoft are unwilling to commit to when we might see this as supported and working functionality in the debugger.

Let’s hope it’s going to be sooner rather than later.

Mysterious disappearing acts by samples from Microsoft SDKs

I was exploring use of the COM type library marshaller as an alternative to a MIDL-generated marshaller a while back (as mentioned on the bottom of my earlier COM post). The type library marshaller, like the rest of COM, is usually inclined towards registration (you specify the TypeLib for an interface in the registry to let COM know how to marshal it and specify the marshaller’s GUID as the ProxyStubClsid32 instead of some MIDL generated class in your custom DLL) – but can actually be used quite comfortably without it. In my search for a solution I stumbled upon Don Box’s old MSJ article that provides a rather exhaustive description of the whole deal. Evidently, the TypeLib entry in the registry is just an accessory and the interesting part is actually the rpcrt4.dll function duo, CreateProxyFromTypeInfo and CreateStubFromTypeInfo.

These APIs receive an ITypeLib reference and use the binary type information to synthesize an /Oicf style marshaller, dynamically. It is quite a feat. Consider the possibilities. An application could marshal arbitrary interface references and not be bound just to a known subset for which it has proxy/stub implementations statically linked. Microsoft recognized how messy linking a MIDL generated proxy/stub factory to a non-C++ project can be and included the Marshaler sample in the .NET 1.1 SDK. This is a C# sample that uses P/Invoke to the type library marshaller and thus avoids MIDL integration. When I looked over the sample, I was amazed to see it use the undocumented, underground CreateProxyFromTypeInfo and CreateStubFromTypeInfo APIs with not even a sign of apology.

As I later found out, I was simply fortunate to happen to be on a machine with the old .NET 1.1 SDK installed. The sample, along with others, has simply disappeared from the .NET 2.0 SDK and its successor, the Windows SDK. Unlike what one may expect, it hasn’t been replaced by anything similar that illustrates what it meant to.

Unfortunately, this is merely one case in a series of disappearing acts by Microsoft SDK samples. As I was exploring the use of RPC as an alternative to the more heavyweight COM, I looked for an SDK sample on the memory allocation and free semantics of [in, out] parameters in MIDL-generated RPC stubs. I found a list of the RPC samples in this MSDN page and figured the listed DYNOUT sample was exactly what I needed. Unfortunately, this time I was on a machine with only the latest Windows SDK installed. A search through Samples\NetDs\rpc quickly resulted in disappointment as the sample was obviously gone forever. This time, unlike the Marshaler sample case, there didn’t seem to be an obvious reason for the sample’s demise. I was able to scramble the answer for what I was looking for through a combination of trial-and-error and examination of those samples that did remain. It remains unclear why the now-defunct sample remains listed in MSDN.

Another trigger for filing a Missing Sample’s Report with the authorities was given to me the other day when a friend asked me about a WMI client sample. This one was a part of the Windows Server 2oo3 DDK but was, again, mysteriously absent from the new Windows Driver Kit. Considering the WDK and its huge DVD contains just about anything (so much that Microsoft decided to trim it down back to ~700MB by refactoring out the Windows Logo testing stuff, etc.) it seems bizarre that this sample of all things would be cut out.

If the motivation for the removal of these and other samples is some sort of quality bar (like the highly unwelcome removal of the trusty Dependency Walker from the Windows SDK because it doesn’t meet some sort of “quality” guideline) then it seems to me the cure is far worse than the disease. A poor sample is better than no sample at all, except perhaps when it comes to security, which doesn’t seem to be the case here. The disappearance of samples is a great disservice to users of the various SDKs.

CRT deployment made slightly easier by Visual C++ 2008 Beta 2

My infamous post on using the OS CRT (msvcrt.dll) instead of the Visual C++ CRT (msvcrXX.dll) enumerated several compelling reasons for avoiding the headache of deploying the latter with your project. Those of you who are less concerned with the size of your distribution package and more with the added complexity of an MSI-based installation will be glad to know that deploying the new CRT as a private assembly has been made easier by the new Orcas Beta. As explained in this post on Nikola’s blog, private deployment is much easier and doesn’t require weird manifest hanky-panky, including the hefty vcredist.exe or using the MSI merge modules.

And the rest of us that want to create ActiveX controls smaller than 900K keep hoping and waiting for an OS debug CRT release in the next WDK…