Profiling the OpenCL kernel

Hi folks,

I have two implementations of a problem, one that uses the shared memory and one that does not. I use the async_work_group_copy to get the data from Global to shared memory. I expected an improved performance using the shared memory, but I see that it is as good or worse than the global memory implementation. I profiled both the implementations and got the numbers.

Comparing the numbers, shared memory implementation (SMI) against global memory implementation(GMI):

The GPU time in SMI is more than the GMI
The number of global load issued is lesser in SMI
The branches are more in SMI (I don’t understand why)
The divergent branches are more (0 in GMI and too many in SMI)

I now want to check how the time inside the kernel was spent to really know what the time is being spent on. The don’t know if the profiler from Nvidia can do this. Any suggestions, hints?

– Bharath