I’m profiling a Pytorch application in the simplest way possible: nsys profile python app.py.
The total runtime using one single GPU is about 25 minutes. In the resulting report file, generated by nsys stats report1.nsys-rep, I see the following entry for the time spent in cudaLaunchKernel:
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ---------- --------- -------- --------- ----------- ------------------------------------------------
81.8 1262651546752 13432820 93997.5 3206.0 2415 149953081 442752.6 cudaLaunchKernel
[...]
According to this, 1262 seconds are spent in cudaLaunchKernel, which I find unrealistic w.r.t. the entire runtime of about 1500 seconds. As far as I understand it, cudaLaunchKernel should not contain the execution time of the kernels themselves, but only the time it takes to offload them to the device.
A second question regarding this is how these numbers change when more than one GPUs are used. I have made the same profile with four GPUs, but the number of calls only slightly increases, whereas I would expect about four times as many kernel calls. Does the way nsys is attached to the application distinguish between different GPUs in any way?
I would recommend opening your .nsys-rep file in the GUI and looking at the timeline. In particular I wonder if the memory transfers host->device are optimized well, which is easier to see in the timeline than from stats output.
I’m also a little unclear by what you mean by more than one GPU here. If it is the same program being run on more GPUs, it should still be the same amount of work beign done, or basically the same number of kernels, modulo whatever extra is needed for splitting up the work. If you mean that you are now running multiple copies of the work, or the work on each GPU, are you launching them all inside Nsys?
I already tried to investigate the timeline in the GUI, but it is to large for my RAM. I guess that I need to figure out how I can reduce the problem size in order to have shorter runs. I’m working on run on a Grace Hopper system, which might give hints towards the influence of memory transfers.
I don’t really understand why the optimization of the memory transfers would influence the time that it takes for the kernels to be launched.
The application uses torch.multiprocessing to distribute to work on the GPUs. I was thinking about your statement about how the kernel calls are distributed and you are right because each GPU needs to call fewer kernels for each training batch.
you are running for ~20 minutes. Can you use delayed start, duration, or nvtx ranges/cudaProfilerStart/Stop to limit the run? You really only need an example iteration or two.
If you are really focused mostly on the CUDA usage you might want to remove the CPU sampling.
Thanks, with your hint, I was able to generate a considerably smaller output file, about 60MB. This can easily be displayed in the viewer.
In the profile I see chunks of identical blocks consisting of various kernels and cudaMemcpyAsync in between them. My assumption is now that the inflated accumulated time of cudaLaunchKernel is to a certain degree made up from waiting times of kernels for CUDA streams. Is there a way to extract that wait time from the profile?
My original goal was to estimate the overhead incurred by kernel launches. I made a test consisting of a loop calling a kernel which does nothing:
empty_kernel<<<1,1>>>();
long long t1 = get_nanoseconds();
for (int i = 0 ; i < N; i++) {
empty_kernel<<<1,1>>>();
}
long long t2 = get_nanoseconds();
I need to do one kernel call beforehand because that comes with a larger overhead since it needs to initialize the CUDA runtime. With this test, I measure about 2μs per call.
I’m going to ask @jyi to answer on the wait time issue.
I am also going to recommend that you run the expert system rule for async memory, and make sure that you aren’t winding up with synced calls. You can find this rule in the GUI by using the drop down in the event view pane (bottom screen):
Is there a way to extract that wait time from the profile?
I’m not sure which specific wait time you’re referring to, but the only one currently available in the stats system is the queue wait time between the launch call and the kernel execution. You can check nsys stats --help-reports=cuda_kern_exec_trace for the column descriptions.