Focused profiling with nvprof not working?

I use cudaProfilerStart() and cudaProfilerStop() to select regions for profiling with nvprof, however, I still see kernels that are launched outside of the profiling region in the output.

A fragment of my code:

  // Forward
  LOG(INFO) << "Initialisation FWD";
  if (FLAGS_warmup) {
    for (int i = 0; i < FLAGS_warmup; i++) {
      LOG(INFO) << "Warming up...";


  // Real benchmark
  for (int i = 0; i < FLAGS_iterations; i++) {
    LOG(INFO) << "Iteration " << i;


I have a warmup part and a benchmarking part.
There is a CUDA kernel (maxwell_gcgemm_32x32_nt) which is called one time in each forward pass. (From nsys traces I know it is called only once during forward pass from ForwardConvolution function and is not called in backward pass).

So I expect that the kernel is called the same number as there are forward passes in the benchmarking part. However, the number of calls of the kernel is always = number of warmup passes + number of benchmark passes.

Even more, if I remove cudaProfilerStart() and cudaProfilerStop around the forward pass and only leave them around backward pass (not in the fragment above), I still see the same number of calls of the kernel while it should be 0.

The command I run and a part of the output:

$ nvprof --profile-api-trace none --unified-memory-profiling off --profile-from-start off ./build/benchmarks/test_composed_model/dnnmark_test_composed_model -config conf_tmp.dnnmark --warmup 10 --iterations 10 --debuginfo 0
==4677== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version
Total running time(ms): 11174.345703
==4677== Profiling application: ./build/benchmarks/test_composed_model/dnnmark_test_composed_model -config conf_tmp.dnnmark --warmup 10 --iterations 10 --debuginfo 0
==4677== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   56.86%  357.19ms        20  17.859ms  16.024ms  18.176ms  maxwell_gcgemm_32x32_nt

My environment:
Ubuntu 18.04.3 LTS

We are unable to reproduce the issue on our side using a similar code based on CUDA C.
Can you provide more details?
How many kernels are reported by nvprof after removing cudaProfiler(Start/Stop)() APIs and “–profile-from-start off” option?
Can you remove the option “–profile-api-trace none” and check the output?
Also, by using nvtx ranges, you can check how many kernels are launched by a region of code.