Focused profiling with nvprof not working?

pyotr777 · April 17, 2020, 7:04am

I use cudaProfilerStart() and cudaProfilerStop() to select regions for profiling with nvprof, however, I still see kernels that are launched outside of the profiling region in the output.

A fragment of my code:

  // Forward
  LOG(INFO) << "Initialisation FWD";
  dnnmark.SetupWorkspaces(0);
  if (FLAGS_warmup) {
    for (int i = 0; i < FLAGS_warmup; i++) {
      LOG(INFO) << "Warming up...";
      dnnmark.Forward();
    }
  }

  cudaProfilerStart();

  dnnmark.GetTimer()->Clear();
  // Real benchmark
  for (int i = 0; i < FLAGS_iterations; i++) {
    LOG(INFO) << "Iteration " << i;
    dnnmark.Forward();
  }
  dnnmark.GetTimer()->SumRecords();

  cudaProfilerStop();

I have a warmup part and a benchmarking part.
There is a CUDA kernel (maxwell_gcgemm_32x32_nt) which is called one time in each forward pass. (From nsys traces I know it is called only once during forward pass from ForwardConvolution function and is not called in backward pass).

So I expect that the kernel is called the same number as there are forward passes in the benchmarking part. However, the number of calls of the kernel is always = number of warmup passes + number of benchmark passes.

Even more, if I remove cudaProfilerStart() and cudaProfilerStop around the forward pass and only leave them around backward pass (not in the fragment above), I still see the same number of calls of the kernel while it should be 0.

The command I run and a part of the output:

$ nvprof --profile-api-trace none --unified-memory-profiling off --profile-from-start off ./build/benchmarks/test_composed_model/dnnmark_test_composed_model -config conf_tmp.dnnmark --warmup 10 --iterations 10 --debuginfo 0
==4677== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version
...
Total running time(ms): 11174.345703
==4677== Profiling application: ./build/benchmarks/test_composed_model/dnnmark_test_composed_model -config conf_tmp.dnnmark --warmup 10 --iterations 10 --debuginfo 0
==4677== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   56.86%  357.19ms        20  17.859ms  16.024ms  18.176ms  maxwell_gcgemm_32x32_nt
...

My environment:
NVDRV:430.64,
CUDA:10.1,
cuDNN:7.6.3.30-1
Ubuntu 18.04.3 LTS

RahulDhoot · May 28, 2020, 10:23am

Hi,
We are unable to reproduce the issue on our side using a similar code based on CUDA C.
Can you provide more details?
How many kernels are reported by nvprof after removing cudaProfiler(Start/Stop)() APIs and “–profile-from-start off” option?
Can you remove the option “–profile-api-trace none” and check the output?
Also, by using nvtx ranges, you can check how many kernels are launched by a region of code.

Topic		Replies	Views
Profiling CUDA Programming and Performance	2	826	August 17, 2015
Nvprof works but nsight compute gives "no kernels were profiled" warning Nsight Compute	2	1552	August 23, 2022
Nvidia Visual Profiler Not accurate in timing Visual Profiler and nvprof cuda	0	773	July 29, 2022
OpenACC profiling with NVProf Legacy PGI Compilers	2	6994	January 25, 2016
No events/metrics were profiled when use nvprof in CUDA 10.1.168 Visual Profiler and nvprof	5	5021	December 14, 2019
is excessive kernel launches killing my application? CUDA Programming and Performance	3	1950	July 19, 2016
Why NVPROF and Nsight not profiling one of the kernels? CUDA Programming and Performance	5	2281	October 26, 2015
nvprof becomes unresponsive CUDA Programming and Performance	6	937	June 27, 2018
What is the defferent between"GPU activities" and "API calls"? Legacy PGI Compilers	3	3192	June 4, 2019
Understanding profiling (nvprof) output of cuFFT CUDA Programming and Performance	1	2003	April 10, 2018

Focused profiling with nvprof not working?

Related topics