As I mentioned the other day, the WinHttpRequest automation object provides access from scripting to the WinHTTP client library. As I was looking into it for retrieving relatively large files for a testing scenario (e.g., the Sun JDK’s 80MB installation file) I couldn’t help but notice a clear flaw in its interface. Access to the response body for a request is provided by the ResponseText property, which simply provides a string containing the response. There’s also a similar ResponseBody property which is an array but otherwise identical. Unfortunately, the ResponseStream property, which returns an IStream reference, is not usable from scripting, which only supports dispinterfaces.
This is a problem because receiving large responses from the server requires maintaining their entire body in memory. Clearly, using ResponseText to download a 4 GB DVD ISO isn’t a great idea. So I figured as long as one keeps to reasonable transfer sizes, not massively ballooning the process working set, one should be fine.
It was only a short while later when a friend let me know about a problem he had in his own script making server requests. He was attempting to retrieve a file a few dozens megabytes in size from a web server, but instead of the retrieval completing promptly as expected, the Windows Script Host process became a CPU hog and continuously crept up in memory usage.
As we were trying to reproduce the problem in a as simple as possible environment, we saw that simply downloading a large file seemed to work promptly and as expected. Granted, memory usage was significant due to the property issue mentioned above, but the system had plenty of RAM and this was not an issue.
After some investigation the difference in the problematic scenario was identified. The transfer of the file was provided by a server generated dynamic page, where server code would dynamically write the contents of the requested file into the response stream.
Attaching the debugger to the running WSH process and examining the stack provides a big clue as to what is going on:
0:006> ~0 k 30
0013f004 7c81248c ntdll!RtlReAllocateHeap+0xc36
0013f05c 77506778 kernel32!GlobalReAlloc+0x17a
0013f078 77506802 ole32!CMemStm::SetSize+0x37
0013f09c 4d527444 ole32!CMemStm::Write+0x76
0013f0d8 4d527ca8 WINHTTP!CHttpRequest::ReadResponse+0x12b
0013f108 4d5248d9 WINHTTP!CHttpRequest::Send+0x1b9
0013f170 6fe9fcf0 WINHTTP!CHttpRequest::Invoke+0x359
0013f1ac 6fe9fc5d jscript!IDispatchInvoke2+0xb5
0013f1e8 6fe9fd78 jscript!IDispatchInvoke+0x59
0013f25c 6fea6c3c jscript!InvokeDispatch+0x90
0013f2a0 6fe9fab8 jscript!VAR::InvokeByName+0x1c2
0013f2e0 6fe9efea jscript!VAR::InvokeDispName+0x43
0013f304 6fea6ff4 jscript!VAR::InvokeByDispID+0xfd
0013f3bc 6fea165d jscript!CScriptRuntime::Run+0x16bd
0013f3d4 6fea1793 jscript!ScrFncObj::Call+0x8d
0013f444 6fe8da72 jscript!CSession::Execute+0xa7
0013f494 6fe8beba jscript!COleScript::ExecutePendingScripts+0x147
0013f4b0 0100220a jscript!COleScript::SetScriptState+0xf1
0013f4bc 0100217d cscript!CScriptingEngine::Run+0xb
0013f4d0 01001f34 cscript!CHost::RunStandardScript+0x85
0013f708 010027fc cscript!CHost::Execute+0x1f0
0013fcac 010024de cscript!CHost::Main+0x385
0013ff50 010025e6 cscript!main+0x6d
0013ffc0 7c816fd7 cscript!_mainCRTStartup+0xc4
0013fff0 00000000 kernel32!BaseProcessStart+0x23
As the WinHttpRequest object is reading the response from the server side script, it is naively writing it to an internal OLE memory stream (presumably the product of CreateStreamOnHGlobal). As more and more data is received, the memory stream is resized to accommodate it. We can track the stream’s growth pattern with a suitable breakpoint on the SetSize method. It is documented as receiving a ULARGE_INTEGER, an unsigned 64-bit integer, specifying the stream’s desired size. Therefore, we expect the argument to be 8 bytes above the position of the stack pointer when we enter the function, on x86. A 64-bit integer is also known as a quadword. The debugger command is hence:
bp ole32!CMemStm::SetSize ".printf \"SetSize called with size %d\", qwo(@esp + 8); .echo; gc"
If we promptly resume execution and let the script run for a while we see something similar to:
SetSize called with size 10650474
SetSize called with size 10658666
SetSize called with size 10666858
SetSize called with size 10675050
SetSize called with size 10683242
And so on… some quick arithmetic indicates that the numerous SetSize calls are apart by 8,192 bytes.
Resizing the stream by 8KB at a time for a file dozens of megabytes in size is clearly not a winning strategy. As can be seen in the initial stack trace, this results in a heap reallocation frenzy. As contiguous room for block resize on the heap is exhausted, the stream resize must be satisfied by allocating a whole new block and moving all the previous stream data there. As the stream grows, the whole process slows down exponentially. Since the block in question is a big one, it resides in a “virtual block” on the heap (a block of address space provided by VirtualAlloc rather than a smaller one from the heap’s free lists) and is presumably moved whenever there is not enough contiguous address space at the virtual block’s present position to extend it.
Let us examine the heap to confirm this. We can suspend execution at ole32!CMemStm::SetSize and proceed to kernel32!GlobalReAlloc. The first parameter to the latter is the handle to the heap allocation:
Breakpoint 0 hit
eax=00000000 ebx=00000000 ecx=006f5fa4 edx=774eee94 esi=00184790 edi=00002000
eip=77506741 esp=0013f07c ebp=0013f09c iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
77506741 8bff mov edi,edi
0:000> bp kernel32!GlobalReAlloc
Breakpoint 1 hit
eax=00184d98 ebx=00000000 ecx=006f5fa4 edx=00184a14 esi=00184790 edi=006f5fa4
eip=7c8123b9 esp=0013f060 ebp=0013f078 iopl=0 nv up ei pl nz na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000206
7c8123b9 6a24 push 24h
0:000> kb 1
ChildEBP RetAddr Args to Child
0013f05c 77506778 008e000c 006f5fa4 00002002 kernel32!GlobalReAlloc
If we proceed to examine the process default heap (at 0x00150000 in my case) and its virtual allocation list (we can ignore allocations from the linked lists below, they are for much smaller allocations) we can see the interesting block:
0:000> !heap -a 00150000
Index Address Name Debugging options enabled
Segment at 00150000 to 00250000 (00035000 bytes committed)
Segment at 016D0000 to 017D0000 (00001000 bytes committed)
Granularity: 8 bytes
Segment Reserve: 00200000
Segment Commit: 00002000
DeCommit Block Thres:00000200
DeCommit Total Thres:00002000
Total Free Size: 000001ee
Max. Allocation Size:7ffdefff
Lock Variable at: 00150608
Next TagIndex: 0000
Maximum TagIndex: 0000
Tag Entries: 00000000
PsuedoTag Entries: 00000000
Virtual Alloc List: 00150050
017d0000: 006f4000 [6f3fa4] - busy (2b) (Handle 008e000c), user flags (1)
If we let execution proceed for a while and examine the heap again, we see something like:
0:000> !heap -a 00150000
Virtual Alloc List: 00150050
02080000: 008ac000 [8abfa4] - busy (2b) (Handle 008e000c), user flags (1)
Notice the virtual allocation for the global heap handle has moved elsewhere in the interim. Originally at virtual address 0x017d0000, it is now at 0x02080000. Further examination shows that this happens repeatedly. In fact, examining the position of the block between calls of GlobalReAlloc shows how it alternates between a small set of virtual addresses, back and forth, as its size is modified. For a reason unclear without diving into deeper heap and virtual allocation internals, the virtual address block seems to change its location even when it can be extended to the new size at its present position, unlike what I speculated above.
Examination of the server response headers for the static page case and the dynamic case provide an insight to the difference in performance behavior. In the static case, the server, knowing in advance the size of the response, can provide the client with a Content-Length header. The header can be used by the client to allocate a block, huge as it may, in one swoop. Despite the slight memory pressure, this happens practically instantaneously.
In the dynamic case, the server cannot guess how much output the page is going to feel like providing. No Content-Length header is emitted, and the client’s naive stream memory allocation strategy comes to light and kills performance.
While WinHttpRequest’s naive reliance on COM’s memory stream allocation strategy is primarily at fault for this issue, the heap manager’s own deficiencies should not be overlooked. I believe its behavior for large block reallocations (virtual blocks) could be somewhat improved to mitigate inefficient clients.
This issue can be mitigated at the server side by ensuring a Content-Length header is always provided in responses. This can be tricky to calculate and I haven’t looked into whether lying (i.e., providing a Content-Length longer than the actual response just to get the client to preallocate memory) works correctly.
While I was looking into this issue, I was hoping to take advantage of Visual Studio 2005 Team Edition’s integrated profiler. I assumed it would be ideal for detecting and clearly illustrating an obvious CPU hog such as this case. However, after short examination, I must express my disappointment.
Visual Studio 2005’s profiler, like most other profilers, has two operation modes. Sampling and Instrumentation. Sampling means examining thread stacks at an interval and noting current execution state, basically. Instrumentation means binary patching of functions with code that notes execution flow has reached them. Instrumentation is therefore a means of “zooming in” on functions of interest, seeing as it is both more precise and has greater overhead.
I initially tried the Sampling mode. Examining its output in the impressive GUI, Performance Explorer, I noticed the sample didn’t clearly capture what was going on. I figured by instrumenting the functions I was suspicious of (the heap reallocator, etc.) a clear graphical illustration would be possible.
It was then I discovered that the profiler’s Instrumentation mode requires full debugging information. Since I am profiling native code that is not my own running under WSH, clearly all I have is public PDBs. Instrumentation refused to work in this scenario, making it useless for these kinds of cases.
This is too bad because other profilers seem content with instrumenting just exports or functions as they are indicated in public PDBs. I hope to see this as a feature in future versions of the Visual Studio profiler, since I like the slick Performance Explorer.
For those of you implementing memory stream classes and the like, keep in mind that sometimes an imperialistic memory allocation strategy is a good thing.
Think before you act, allocate before you write.