@davwu If I configure the recipe with the maximum context size (1048576) and tell vLLM to use 0.9 of my RAM then it seems to run stably, and offers just short of 4 concurrent requests:
(APIServer pid=84) INFO 05-17 13:42:09 [model.py:1697] Using max model len 1048576
[snip]
(Worker_TP0_EP0 pid=207) INFO 05-17 13:46:34 [gpu_model_runner.py:6246] Estimated CUDA graph memory: 0.70 GiB total
(Worker_TP0_EP0 pid=207) INFO 05-17 13:46:34 [gpu_worker.py:462] Available KV cache memory: 27.93 GiB
(Worker_TP0_EP0 pid=207) INFO 05-17 13:46:34 [gpu_worker.py:477] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.9000 is equivalent to --gpu-memory-utilization=0.8943 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9057. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=157) INFO 05-17 13:46:35 [kv_cache_utils.py:1710] GPU KV cache size: 4,093,302 tokens
(EngineCore pid=157) INFO 05-17 13:46:35 [kv_cache_utils.py:1711] Maximum concurrency for 1,048,576 tokens per request: 3.90x
Here they are while running the Inspect Evals tool in this configuration:
It seems to be stable, although it’s non-trivial to actually exercise such a large context!
Let me know if you’d like me to run any other tests. Thanks for the question, I need to update my blog post on a couple of technical details.

