prepare functions to profile with nvprof


I am currently profiling a finite difference stencil kernel using nvprof and it’s getting super slow. The part where I profile (region between cudaProfilerStart() and cudaProfilerEnd() finishes by only 0.00078 sec if without using nvprof. With nvprof, I have been waiting for its run for over 5 minutes, and it is still running.

mpirun -gpu -np 1 nvprof --profile-from-start off -f -o nvprof.%q{OMPI_COMM_WORLD_RANK} $exe 128 128 128 2

the last 4 numbers correspond to size along x,y,z,t axes.

My question is why it is running super slow and is there a way to reduce its runtime by leaving out some information gathered by nvprof?


It is expected that a kernel takes longer to execute while it is being profiled.
How long a kernel that normally runs in X ms should take when profiled, I have no idea.

Also make sure to run your program on cuda-memcheck before profiling, this will point invalid memory accesses that can potentially slow down profiling.

Hi Saulocpp,

Thanks for the reply! I just found out the reason why nvprof stalled was due to IBM Spectrum MPI. When using nvprof with SMPI, I need to provide --openmp-profiling off to nvprof and then remove all OpenMP offload regions, otherwise the program will stall right at MPI_Init().

A related post:

This is very weird. I will report this issue to IBM.