nsys CUDA trace works for threads, but not for subprocesses

Running

nsys profile --trace=cuda,nvtx,osrt -bfp <target> <args>

produces a .qdrep file which I can view in Nsight Systems.

This works fine when the process is using CUDA from two threads, The GPU node shows up and I can view the CUDA kernel profile info.

Running the same experiment using sub-processes instead, yields warnings from Nsight Systems:

CUDA profiling might have not been started correctly.
Warning	Analysis	*box (:0:1:24482)	00:00.196	
CUDA profiling might have not been started correctly.
Warning	Analysis	*box (:0:1:24482)	00:00.196	
Zero CUDA events were collected. Does the application use CUDA?

According to the documentation, the profile should trace subprocesses by default.

I am seeing similar behaviour when trying to profile from Nsight Systems on a host, the CUDA profiling doesn’t seem to take.

Is this expected behaviour?

(This is on Ubuntu 18.04 with CUDA 10.1, Nsight Systems version 2019.3.1)

1 Like

Did the base process exit before the child processes had finished?

I ask because the default behavior is to end the profiling session when the launched application process exits. If that initial process ended before CUDA events were flushed from the child processes, you would see the “Zero CUDA events…” warning.

When running this experiment with the CUDA processes held back for 5 seconds, the whole process takes 7 seconds. So it looks like the parent process is alive and well for the duration. But, now I see profiling data for the CUDA processes, so that is good!

Could it be that the CUDA processesing finished before the profiling got going? (The CUDA workload was a bit short, 250 msec)

That is entirely possible.