URGENT: Weird CUDA profiler results...need help with analysis

Hi Everyone,

I have this cuda profiler output which is baffling me. I need help with understanding. To give a basic introduction, I am working on GTX 280 in windows and am using visual cuda profiler version 1.1.08. My .cu code has 3-kernels…none using shared memory for now. Here is the profiler output for reference. I have a few weird observations:

  1. Why does the third kernel show “zero” instructions, gld_coherence and gst_coherence? It has 1 block of 300 threads (as against other kernels with lot more thread-blocks as you can see from the profiler output) and performs quite some work and multiple global memory accesses at every step. Even when I select only “instructions” in any profiling session, it still gives zero instructions, gld and gst for kernel-3.

  2. Why the number of blocks is listed under gridSizeY instead of gridSizeX? The current output with Bx=blockIdx.x is correct. When I change it to By=blockIdx.y all through the code, I get a crapy output. Why is it so?

  3. I understand that though gld/gst-incoherence is hidden in the visual profiler for newer hardware like GTX200 series (since they claim to take care of the coalesced memory accesses), incoherence/uncoalesced memory access still exists (I am using double precision and NO shared memory for now). Is it true? If so, what is the way to find it out?

Will be thankful if someone can please share some info/experience with me!! I have very limited speed-up and no optimization is helping me!!

Thanks & regards,

Aditi

Did u check ur kernel for errors? (but then I think log is in-complete for bad kernels…)

May b, the driver is NOT profiling the correct multi-processor in which the kernel is being executed… :-(

Since 1 block runs only in 1 MP and the profiler is done only for 1 MP, may b the driver is profiling some MP in which the code is NOT run…

I would be surprised if thats the case. Driver cant be so dumb.

Try using the command line profiler. Set CUDA_PROFILE_CONFIG to the config file name and enter what u want to watch in that config file - one option per line. Note that there is a limit of 4 counters for the profiler (4 is applicable only for those profiler options which depend on hardware counters)