I have this cuda profiler output which is baffling me. I need help with understanding. To give a basic introduction, I am working on GTX 280 in windows and am using visual cuda profiler version 1.1.08. My .cu code has 3-kernels…none using shared memory for now. Here is the profiler output for reference. I have a few weird observations:
Why does the third kernel show “zero” instructions, gld_coherence and gst_coherence? It has 1 block of 300 threads (as against other kernels with lot more thread-blocks as you can see from the profiler output) and performs quite some work and multiple global memory accesses at every step. Even when I select only “instructions” in any profiling session, it still gives zero instructions, gld and gst for kernel-3.
Why the number of blocks is listed under gridSizeY instead of gridSizeX? The current output with Bx=blockIdx.x is correct. When I change it to By=blockIdx.y all through the code, I get a crapy output. Why is it so?
I understand that though gld/gst-incoherence is hidden in the visual profiler for newer hardware like GTX200 series (since they claim to take care of the coalesced memory accesses), incoherence/uncoalesced memory access still exists (I am using double precision and NO shared memory for now). Is it true? If so, what is the way to find it out?
Will be thankful if someone can please share some info/experience with me!! I have very limited speed-up and no optimization is helping me!!
Thanks & regards,