I use cudaProfilerStart() and cudaProfilerStop() to select regions for profiling with nvprof, however, I still see kernels that are launched outside of the profiling region in the output.
A fragment of my code:
// Forward
LOG(INFO) << "Initialisation FWD";
dnnmark.SetupWorkspaces(0);
if (FLAGS_warmup) {
for (int i = 0; i < FLAGS_warmup; i++) {
LOG(INFO) << "Warming up...";
dnnmark.Forward();
}
}
cudaProfilerStart();
dnnmark.GetTimer()->Clear();
// Real benchmark
for (int i = 0; i < FLAGS_iterations; i++) {
LOG(INFO) << "Iteration " << i;
dnnmark.Forward();
}
dnnmark.GetTimer()->SumRecords();
cudaProfilerStop();
I have a warmup part and a benchmarking part.
There is a CUDA kernel (maxwell_gcgemm_32x32_nt) which is called one time in each forward pass. (From nsys traces I know it is called only once during forward pass from ForwardConvolution function and is not called in backward pass).
So I expect that the kernel is called the same number as there are forward passes in the benchmarking part. However, the number of calls of the kernel is always = number of warmup passes + number of benchmark passes.
Even more, if I remove cudaProfilerStart() and cudaProfilerStop around the forward pass and only leave them around backward pass (not in the fragment above), I still see the same number of calls of the kernel while it should be 0.
The command I run and a part of the output:
$ nvprof --profile-api-trace none --unified-memory-profiling off --profile-from-start off ./build/benchmarks/test_composed_model/dnnmark_test_composed_model -config conf_tmp.dnnmark --warmup 10 --iterations 10 --debuginfo 0
==4677== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version
...
Total running time(ms): 11174.345703
==4677== Profiling application: ./build/benchmarks/test_composed_model/dnnmark_test_composed_model -config conf_tmp.dnnmark --warmup 10 --iterations 10 --debuginfo 0
==4677== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 56.86% 357.19ms 20 17.859ms 16.024ms 18.176ms maxwell_gcgemm_32x32_nt
...
My environment:
NVDRV:430.64,
CUDA:10.1,
cuDNN:7.6.3.30-1
Ubuntu 18.04.3 LTS