How to use Nsight to analyze a 300ms delay issue in VLLM

An abnormal delay of 300 milliseconds was detected.

When we were conducting stress testing on vllm with a load of 30QPS, we found an anomaly with a 300ms delay, which occurred more frequently as the QPS increased. Looking from nsight, one thread’s utilization rate was at 100%, but the Python stack was empty.

The input token for the experimental data was 40, and the output token was 20.

This situation is very strange. For more specific experimental data, please see [Performance]: Why does VLLM perform worse than TGI in Speculative decoding? · Issue #7540 · vllm-project/vllm · GitHub

How to determine where this problem occurs and resolve it.

Unfortunately there really isn’t a was to diagnose this from just a screen shot. It looks to me like you have some CPU side activity that is blocking the other processes and resulting in the GPU being starved for work. I would look into the backtraces to see why this section is blocking and determine if there is a way to do it non-blocking.

Thank you for your reply. This is an nsysy file, please take a look at the issue. I really appreciate it.

there are a whole lot of epoll_wait calls blocking the OS and no other work is happening. Both in this section of the code and another bare stretch at the end.