We have deployed llama3-70b using VLLM on two H100 cards (TP=2, Tensor Parallelism), and I would like to profile its execution process with nsys-2024.3.1.75-243134195302v0.
I did not add any other nsys configuration parameters besides ‘-o’, and the report was generated normally after the program finished running.
In the final results, we only captured a very small number of CUDA GPU Kernel as followed:
This is unreasonable because during the test, 10 queries were sent to the server, all of which were executed correctly, and we observed GPU usage through nv-smi.
We observed these potentially related warning.
Thread count limit is exceeded, not all threads will be shown (thread count: 3762, thread limit: 2000).
CUDA profiling might have not been started correctly.
No CUDA events collected. Does the process use CUDA?
(The second warning appears around 100 times.)
Here is a related vLLM github issue: Error when using nsys profile · Issue #3247 · vllm-project/vllm · GitHub
How should I use/config nsys to obtain the correct results?