How to verify that the shared is used as declared?

I have the following kernel that I believe is bound by the global memory bandwidth. Half of the access to global memory are made to a small structure “dC” made from 5 unsigned int (ALPHA_SIZE == 4), a great candidate to be moved to shared memory since these values are common to all blocks and all threads. I’ve made a copy from dC to dSharedC as the first step in my kernel as is shown in the code below. Later in the kernel and device functions called, I pass around dSharedC instead of dC as I did in my original code.

This modification has resulted in no difference in performance. By running the profiler, I get that: local_load=[ 0 ] local_store=[ 0 ]. Without having too much experience with the profiler, I’m assuming that this indicates that no reads or writes are made to the shared memory, which is consistent with no performance gain.

    [*]Is my interpretation of the profiler correct?

    [*]Is my used of shared correct?

    [*]Is there any way to confirm that the shared memory is used for dSharedC?

    [*]Is it possible that even if half of my memory access are moved from global to shared, I see no performance gain in a kernel that is bound by memory access?

__global__ void kernel (..., unsigned int *dC, ...) {

__shared__ unsigned int dSharedC[ALPHA_SIZE + 1];

  if (threadIdx.x < (ALPHA_SIZE + 1))

	dSharedC[threadIdx.x] = dC[threadIdx.x];

  __syncthreads ();

[*]Is my interpretation of the profiler correct?

No. in your case, you didn’t use local memory, used shared memory. consider that local memory is difference shared memoy.

that was the reason you get local_load = 0, and local_store = 0;

[*]Is my used of shared correct?

yes, corrected.

[*]Is there any way to confirm that the shared memory is used for dSharedC?

firs of all, you should set zero for each elements in shared memory

dSharedC[threadIdx.x] = 0;

and then use your copy function

dSharedC[threadIdx.x] = dC[threadIdx.x];

__syncthreads ();

copy back to global memory, and finally copy data from global memory to host memory.

in host memory, it is easy to confirm data.

Of couse you have many other method to confirm data stored in shared memory.

[*]Is it possible that even if half of my memory access are moved from global to shared, I see no performance gain in a kernel that is bound by memory access?

Yes, it is.

Thanks for you answers, I figured quite a few things…

Hmmm… it sounds like an evidence and now I feel like a fool! My kernel runs at an occupancy of 0.5, uses 29 registers and close to no shared memory per block (20 bytes). Using 256 blocks of 512 threads the kernel runs for about 12 sec, but the profiler reports cta_launched=[ 25 ]. On a Tesla C1060 with 30 multiprocessors, shouldn’t I see cta_launched = 256 / 30? Each block deals with 4000 or so independent queries so I would expect that their running time is on average similar. If I randomize my assignment of queries to blocks, I get the same results.

Somewhere in my kernel, I do the following:

unsigned long current = dL[current_i];

  current <<= (2 * ((bucket_id * bucketSize) - 32 * current_i));

Where dL is a 700Mb C array in global memory, and there is really no other instruction that the compiler can fit between the memory request and use of the result. The index current_i is completely random from one query to another, and uniformly distributed over the 700Mb of the structure. What are my option to try to hide the latency of this memory request?