Missing CUDA runtime events from nsys report

I’m using nsys (2025.2) to profile an application within apptainer (a containerization framework).

I can see that the GPU clock, DRAM bandwidth and other key metrics are being collected but there is no CUDA kernel information being tracked. What could be the reason for this?

My command line reads the following:

nsys profile \
  --cpu-core-metrics=0,2 \
  --gpu-metrics-devices=all \
  --cuda-um-cpu-page-faults=true \
  --cuda-um-gpu-page-faults=true \
  --event-sample=system-wide \
  -- \
  python benchmark_serving.py \
    --backend vllm \
    --dataset-name sharegpt \
    --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
    --model neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
    --num-prompts 1000 \
    --endpoint /v1/completions \
    --tokenizer neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
    --save-result

@liuyis can you take a look a tthis.

Hi @rajeshshashikumar, you mentioned apptainer (a containerization framework), does it mean a container is being spawned and the actual application runs inside the container? If that is the case, then Nsys needs to run inside the container as well in order to get CUDA trace data, because Nsys needs to injection the target application process to get those data.

If that’s not the case, could you share the report file?

@liuyis , Yes I am running nsys inside the cotnainer not from the outside. I still am not able to view the tracked CUDA trace data. Here’s the setup file for the container

Here is the attached nsys-report file
report2.nsys-rep.zip (75.8 MB)

Thanks for sharing the report. From the report, I can see that there were CUDA activities happening in process 3349650. However, this process was not launched by Nsys. The process launched by Nsys was PID 3350175, marked as green on the timeline.

In order for Nsys to capture the CUDA API & Kernel traces from a process, the process has to be launched by Nsys, because Nsys needs to inject it at launch time.

I assume process 3349650 is like a backgroud server process and 3350175 sends commands to the background process and triggers CUDA workload in it. Is there a way you can launch the background process 3349650 with Nsys as well?

Does the above flag specify to capture all system activity?

Thank you, I will try to do that. But is there a way to attach nsys to a specific PID? I could not find that in the documentation

Not really, this only enables the “event sampling” feature, which indirectly enabled the “CPU sampling” feature and that’s why you can see the background Python process and the callstack in my screenshot. However, for trace features like CUDA trace, OSRT trace, there is no system-wide support and the process has to be launched by Nsys.

But is there a way to attach nsys to a specific PID? I could not find that in the documentation

Nsys does not support attaching to running processes. You’ll need to launch the process through Nsys, i.e. something like nsys profile --trace=osrt,cuda <the background app that run CUDA workload

1 Like