When you get 30 t/s, is it with the configuration shown in step 3? If so, how do you measure it? Is it including the thinking phase + the actual answer? For example, if I ask “Can you describe the sun in a few short sentences”, it took 16 seconds thinking and then write a two lines answer in 3 seconds. It is basically one line per second.
Thank you, super helpful! Was working on trying to manually update nvcr.io/nvidia/vllm:26.02-py3 but kept hitting walls, so thanks again. Takes about 4 minutes to launch on my system.
Looks like the nightly build of vllm/vllm-openai:cu130-nightly is running cuda 13.0.1 but the nvidia vllm image is running CUDA 13.1.1. Any concerns or improvements there?
Your docker run does not have “–kv-cache-dtype fp8”, which I’ve seen others recommend for memory savings, any thoughts or suggestions there?
thank you