Nsys Profile VLLM Error

We have deployed llama3-70b using VLLM on two H100 cards (TP=2, Tensor Parallelism), and I would like to profile its execution process with nsys-2024.3.1.75-243134195302v0.

I did not add any other nsys configuration parameters besides ‘-o’, and the report was generated normally after the program finished running.

In the final results, we only captured a very small number of CUDA GPU Kernel as followed:

This is unreasonable because during the test, 10 queries were sent to the server, all of which were executed correctly, and we observed GPU usage through nv-smi.

We observed these potentially related warning.

Thread count limit is exceeded, not all threads will be shown (thread count: 3762, thread limit: 2000).

CUDA profiling might have not been started correctly.
No CUDA events collected. Does the process use CUDA?

(The second warning appears around 100 times.)

Here is a related vLLM github issue: Error when using nsys profile · Issue #3247 · vllm-project/vllm · GitHub

How should I use/config nsys to obtain the correct results?

@liuyis can you help with this.

Hi @gnurse , thanks for reaching out. I have a few questions:

  1. Is it possible to share your profiling report?
  2. Is there a way to confirm that your application does submit more CUDA kernels? For example is it possible to print a log everytime a CUDA kernel is invoked? This can help confirming whether Nsys does miss any kernel, and knowing the specific kernels that are missing can also help debugging the actual issue (if any).
  3. Can you also try to turn on GPU metrics sampling feature? You can start with the option --gpu-metrics-device=all. That’s also a way to help confirm GPU usage and whether Nsys is missing CUDA kernels.
  4. Finally, is it possible for us to set up the application on our side to test and debug, or getting access to a system that can run the application?

Thanks,
Liuyi