GenaiPerf benchmark

I benchmarked a deepseek32b qwen distilled model and this is the result I got. I gave 2000 requests and 2000 concurrency. Is it taking into account all the requests for the outputs in the table? Because the ttft is 419950 ms almost 8 minutes?
Should I divide by 2000 and then 100 to get ttft in sec for 1 request?
That is way too large.
Could you help explain the results? I used an H200 GPU and tensor parallelism 2.

Hi @tkhurana1,

What service are you using to carry out the benchmarking?

Thanks,

Sophie

I deployed an llm using nims and used the docker container genaiperf. The benchmarking was done on my local setup and I got this result on the terminal and copied it onto an excel sheet.
Could you explain the results. Why is the ttft so large? Is it for all 2000 requests?
Should I divide by 2000 for one?

Please can you share the command you used to run the genai-perf?

Thanks

You can submit requests much faster than they can be fulfilled so requests sit inside a queue while a clock is running. The Time-To-First-Token (TTFT) clock starts when query is submitted and stopped when a response is received. Please refer to GenAI-Perf — NVIDIA Triton Inference Server for the description of all the metrics.

Based on the screenshot provided, we can see that the fastest request took around 500ms to return the first token, an average inter-token-latency of 80ms, and 1200 output tokens per request. The expected time for a complete response is around (500ms + 80ms*1,200)=96,500ms or 96.5s. This looks like the minimum request latency you’re getting (94,713ms).

Running 2000 concurrent requests may be overwhelming your single H200. I would test lower concurrencies like [1, 2, 4, 8, 16, etc.] to see how it affects user experience. Maybe TTFT and ITL will decrease so you end up with a more responsive system.

Doing the above experiment will help you determine what a single H200 is capable of. It may turn out that you need to scale to more GPUs to achieve acceptable performance for more simultaneous requests.