Why NVPROF and Nsight not profiling one of the kernels?

I have this CFD program in cuda, which when I execute using block of dimension 16 * 16 and profile it, it gets profiled perfectly and shows a kernel “NLMMNT” to be taking most of the GPU time. But I execute the same program using block dimension 32 * 32, the program accelerates upto 5 times faster than before, and the results of the program are correct, but now the profiler is not showing the profiling output for NLMMNT. When I see the log of Nsight, there also its not showing the profiling of NLMMNT to be complete. I can figure what may be the reason, I tried running that application for hours but still NLMMNT’s profile info is absent from the profiler’s output.

Log of Nsight can be seen in this screenshot…

http://s23.postimg.org/hbaect7uz/profoutput.png

Chances are the kernel isn’t showing up in the profiler output because it did not execute. For example, the kernel my have failed to launch due to an out-of-resources condition. Does the code check the return status of every CUDA API call and every kernel launch? Note that checking the status of kernel launches is a two-step process, to catch both pre-launch and post-launch errors.

As I said the output is correct, I cross checked the return status of the kernel.

I have a similar problem. If this was resolved can anyone tell me how?

Does you program carefully check the return status of every CUDA API call and every kernel launch? If not, there is a chance the kernel may never have executed.

Also, make sure the profiling data is properly flushed at the end of the application run. To do so

#include <cuda_profiler_api.h>

and at application termination, call

cudaProfilerStop();