I am currently profiling a finite difference stencil kernel using nvprof and it’s getting super slow. The part where I profile (region between cudaProfilerStart() and cudaProfilerEnd() finishes by only 0.00078 sec if without using nvprof. With nvprof, I have been waiting for its run for over 5 minutes, and it is still running.
It is expected that a kernel takes longer to execute while it is being profiled.
How long a kernel that normally runs in X ms should take when profiled, I have no idea.
Also make sure to run your program on cuda-memcheck before profiling, this will point invalid memory accesses that can potentially slow down profiling.
Thanks for the reply! I just found out the reason why nvprof stalled was due to IBM Spectrum MPI. When using nvprof with SMPI, I need to provide --openmp-profiling off to nvprof and then remove all OpenMP offload regions, otherwise the program will stall right at MPI_Init().
A related post:
This is very weird. I will report this issue to IBM.