shared memory Computation become slower when using the shared memory

Hi all,

I am still new to the CUDA, I have a code that I am working on optimizing it.

the code doesn’t use the shared memory, so i tried using the shared memory to reduce the global memory access.

lets say that i have something like this : (this code is just an example for the idea. and might not compile)

[codebox]

int i,j;

float result, temp;

temp = 0.0;

result = 0.0;

for (i = 0;i<3; i++){

for (j=0;i<10;j++){

temp += cos(gData[idx]);

}

result += temp/10;

}

oData[idx] = results;

[/codebox]

so my shared memory implementation like this:

[codebox]

shared float sData[768];

int i,j;

float result, temp;

for ( i=0;i<3;i++)

sData[threadIdx.x + blockDim.x * i] = gData[idx];

cudathreadsync();

temp = 0.0;

result = 0.0;

for (i = 0;i<3; i++){

for (j=0;i<10;j++){

temp += cos(sData[threadIdx.x + blockDim.x * i]);

}

result += temp/10;

}

oData[idx] = results;

[/codebox]

I reduces the global memory load hits 10 times. (j loop)

but the application ended running slower.

system is Geforce 295

block size is 256 (up to 4 can be computed per SM 1024 threads limit)

my shared memory usage is 1520 byte per block (6080 per SM still ok for 4 blocks)

my registers is 13 per thread (13,312 per SM thats still ok for 4 blocks in 295 cards)

I was thinking maybe the extra calculations that needed to be done for the addresses of the shared and global memory that I needed to make my memory access coalesced is putting more computation and slowing down my code ~ i am not sure, I tried everything I know so far. any help ?

my speedup (slowdown ) is 0.9 X

Shared memory is useful for communication between different threads of the same block. In your case, the same thread is reading the same array element ten times in a row. The compiler will optimize this and keep the array element in a register, so that there is no benefit to using shared memory.

Shared memory is useful for communication between different threads of the same block. In your case, the same thread is reading the same array element ten times in a row. The compiler will optimize this and keep the array element in a register, so that there is no benefit to using shared memory.

Thanks tera,
thats make some sense now.
but why the profiler shows that global memory access is so high ( the size of the array multiplied by 10 ) why it still count it as global memory hits?

Thanks tera,
thats make some sense now.
but why the profiler shows that global memory access is so high ( the size of the array multiplied by 10 ) why it still count it as global memory hits?

That is indeed surprising. It is even more surprising as in any case the number of accesses should be smaller than the total number of accesses in your kernel, as the profiler gathers information only from a subset of multiprocessors. Are you sure you are interpreting the numbers correctly?

That is indeed surprising. It is even more surprising as in any case the number of accesses should be smaller than the total number of accesses in your kernel, as the profiler gathers information only from a subset of multiprocessors. Are you sure you are interpreting the numbers correctly?

How did you measure the global memory bandwidth use? I have seen the Visual Profiler output some very strange numbers, such as global memory throughput of 419GB/s. Suffice to say, I don’t trust that thing.

How did you measure the global memory bandwidth use? I have seen the Visual Profiler output some very strange numbers, such as global memory throughput of 419GB/s. Suffice to say, I don’t trust that thing.