CUDA HW stops tracing cuda activities while SM activities are still being recoeded

I’m profiling a multi-threading program with three models inferenced in three threads respectively, and noticed something odd in Nsight system.

In the timeline view, the CUDA HW trace suddenly ends (no more kernel launches are shown), but the SM Active / Warp Occupancy / DRAM Bandwidth metrics stay high for quite a while afterward as if the GPU is still running kernels.

I noticed the following warning as well,

The attached link is my nsys report. https://drive.google.com/file/d/1ZJPsIGaC5NnJGlDF8viZ0y3OecLscdmg/view?usp=drive_link

I’m curious about why this is happening and how can I trace all of the cuda api during the whole running process?

Any suggestion would be appreciated. :-)

@liuyis can you help?

Hi @goool.yang98 , I don’t seem to have the permission to view the file from google drive, could you relax the permission setting?

One frequent reason for the symptom you are observing is when the application was terminated forcibly. Nsys injects target app and there are buffers in target app process to temporarily store the profiling data, and if the app was terminated forcibly, the buffers might not have a chance to be flushed out.

One thing you can try is adding --duration=168 to your Nsys CLI command line (since the screenshot shows the collection was longer than 168s, adjust if that changes). That way, Nsys will have a chance to flush the buffers at 168s before the app was terminated.

Hi @liuyis , thanks for the reply! It works after I added --duration parameter. There are two things that still make me confused.

  1. I used `--cuda-flush-interval` before trying to make sure the temporarily-stored data to be flushed out, but It didn’t work, the cuda api records remain missing. Do I use --cuda-flush-interval in a wrong way?
  2. The application is running in a interactive way so I have to terminate it forcibly. Is there any way that can make Nsys to trace the whole process from the beginning to the end because of the Unpredictable of the application duration?

Hi @goool.yang98 , –-cuda-flush-interval will not force buffers to be saved when they are not full; it only defers the buffer flushing until flush interval even if the buffers are full. So it will not help here, see:

--cuda-flush-interval=

   Set the interval, in milliseconds, when buffered CUDA data is automatically saved to
   storage. CUDA data buffer saves may cause profiler overhead. Buffer save behavior can be
   controlled with this switch.

   If the CUDA flush interval is set to 0 on systems running CUDA 11.0 or newer, b**uffers are
   saved when they fill**. If a flush interval is set to a non-zero value on such systems,
   **buffers are saved only when the flush interval expires**. If a flush interval is set and the
   profiler runs out of available buffers before the flush interval expires, additional buffers
   will be allocated as needed. In this case, setting a flush interval can reduce buffer
   save overhead but increase memory use by the profiler.

The application is running in a interactive way so I have to terminate it forcibly. Is there any way that can make Nsys to trace the whole process from the beginning to the end because of the Unpredictable of the application duration?

One potential option is to find the session ID then use nsys stop to manually stop the session, before you forcibly terminate the app, i.e.

  1. nsys profile xxx
  2. nsys sessions list to find out the session ID of the profiling session you started
  3. nsys stop --session=<id>to stop the session, wait for report to be genrated
  4. Then you can forcibly terminate the app

Does this work for your case?