I am not sure what are you talking about - I constantly run over 500k and prefix cache hit is 97% and it rocks at 30-35 t/s. I add maybe 100-500 new tokens per iteration on average after initial ram up and ttft is instant. prefix cache totally works fine. I am not knowledgeable about interworkings of “mainstream” models, but the models I did test are not requering 100gb per 1m. worst offender is qwen 3.6 27 - about 36gb per 1m tokens in q8. Nemotron super and cascade - very small, 10gm per 1m. qwen 3.5 122b - similar about 14gb per 1m. and so on
rate: 97.5%
(APIServer pid=1) INFO 06-16 14:11:31 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.64, Accepted throughput: 25.00 tokens/s, Drafted throughput: 30.40 tokens/s, Accepted: 250 tokens, Drafted: 304 tokens, Per-position acceptance rate: 0.974, 0.671, Avg Draft acceptance rate: 82.2%
(APIServer pid=1) INFO 06-16 14:11:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 39.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 97.3%, Prefix cache hit rate: 97.5%
(APIServer pid=1) INFO 06-16 14:11:41 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.49, Accepted throughput: 23.60 tokens/s, Drafted throughput: 31.60 tokens/s, Accepted: 236 tokens, Drafted: 316 tokens, Per-position acceptance rate: 0.886, 0.608, Avg Draft acceptance rate: 74.7%
(APIServer pid=1) INFO 06-16 14:11:51 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 38.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 97.3%, Prefix cache hit rate: 97.5%
(APIServer pid=1) INFO 06-16 14:11:51 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.46, Accepted throughput: 23.00 tokens/s, Drafted throughput: 31.60 tokens/s, Accepted: 230 tokens, Drafted: 316 tokens, Per-position acceptance rate: 0.892, 0.563, Avg Draft acceptance rate: 72.8%
(APIServer pid=1) INFO: 192.168.1.2:64972 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 06-16 14:12:01 [loggers.py:271] Engine 000: Avg prompt throughput: 249.1 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 95.3%, Prefix cache hit rate: 97.5%
(APIServer pid=1) INFO 06-16 14:12:01 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.41, Accepted throughput: 13.50 tokens/s, Drafted throughput: 19.20 tokens/s, Accepted: 135 tokens, Drafted: 192 tokens, Per-position acceptance rate: 0.854, 0.552, Avg Draft acceptance rate: 70.3%
(APIServer pid=1) INFO 06-16 14:12:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 40.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 95.1%, Prefix cache hit rate: 97.5%
(APIServer pid=1) INFO 06-16 14:12:11 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.65, Accepted throughput: 25.40 tokens/s, Drafted throughput: 30.80 tokens/s, Accepted: 254 tokens, Drafted: 308 tokens, Per-position acceptance rate: 0.974, 0.675, Avg Draft acceptance rate: 82.5%
After two days of heavy use and agent constantly chumming through the data - got to 97%, no slowdown, reducing and keeping around 95%