shared memory Computation become slower when using the shared memory

Beautiful_Code · August 19, 2010, 11:08pm

Hi all,

I am still new to the CUDA, I have a code that I am working on optimizing it.

the code doesn’t use the shared memory, so i tried using the shared memory to reduce the global memory access.

lets say that i have something like this : (this code is just an example for the idea. and might not compile)

[codebox]

int i,j;

float result, temp;

temp = 0.0;

result = 0.0;

for (i = 0;i<3; i++){

for (j=0;i<10;j++){

temp += cos(gData[idx]);

}

result += temp/10;

}

oData[idx] = results;

[/codebox]

so my shared memory implementation like this:

[codebox]

shared float sData[768];

int i,j;

float result, temp;

for ( i=0;i<3;i++)

sData[threadIdx.x + blockDim.x * i] = gData[idx];

cudathreadsync();

temp = 0.0;

result = 0.0;

for (i = 0;i<3; i++){

for (j=0;i<10;j++){

temp += cos(sData[threadIdx.x + blockDim.x * i]);

}

result += temp/10;

}

oData[idx] = results;

[/codebox]

I reduces the global memory load hits 10 times. (j loop)

but the application ended running slower.

system is Geforce 295

block size is 256 (up to 4 can be computed per SM 1024 threads limit)

my shared memory usage is 1520 byte per block (6080 per SM still ok for 4 blocks)

my registers is 13 per thread (13,312 per SM thats still ok for 4 blocks in 295 cards)

I was thinking maybe the extra calculations that needed to be done for the addresses of the shared and global memory that I needed to make my memory access coalesced is putting more computation and slowing down my code ~ i am not sure, I tried everything I know so far. any help ?

my speedup (slowdown ) is 0.9 X

tera · August 19, 2010, 11:47pm

Shared memory is useful for communication between different threads of the same block. In your case, the same thread is reading the same array element ten times in a row. The compiler will optimize this and keep the array element in a register, so that there is no benefit to using shared memory.

tera · August 19, 2010, 11:47pm

Shared memory is useful for communication between different threads of the same block. In your case, the same thread is reading the same array element ten times in a row. The compiler will optimize this and keep the array element in a register, so that there is no benefit to using shared memory.

Beautiful_Code · August 20, 2010, 10:59am

Thanks tera,
thats make some sense now.
but why the profiler shows that global memory access is so high ( the size of the array multiplied by 10 ) why it still count it as global memory hits?

Beautiful_Code · August 20, 2010, 10:59am

Thanks tera,
thats make some sense now.
but why the profiler shows that global memory access is so high ( the size of the array multiplied by 10 ) why it still count it as global memory hits?

tera · August 20, 2010, 1:27pm

That is indeed surprising. It is even more surprising as in any case the number of accesses should be smaller than the total number of accesses in your kernel, as the profiler gathers information only from a subset of multiprocessors. Are you sure you are interpreting the numbers correctly?

tera · August 20, 2010, 1:27pm

That is indeed surprising. It is even more surprising as in any case the number of accesses should be smaller than the total number of accesses in your kernel, as the profiler gathers information only from a subset of multiprocessors. Are you sure you are interpreting the numbers correctly?

Eric3918 · August 20, 2010, 1:35pm

How did you measure the global memory bandwidth use? I have seen the Visual Profiler output some very strange numbers, such as global memory throughput of 419GB/s. Suffice to say, I don’t trust that thing.

Eric3918 · August 20, 2010, 1:35pm

How did you measure the global memory bandwidth use? I have seen the Visual Profiler output some very strange numbers, such as global memory throughput of 419GB/s. Suffice to say, I don’t trust that thing.

Topic		Replies	Views
Shared Memory Bandwidth CUDA Programming and Performance	3	1398	August 3, 2013
General Shared Memory Question CUDA Programming and Performance	5	6611	March 4, 2010
Problems with using shared memory CUDA Programming and Performance	5	5846	September 14, 2009
Disappointing shared memory performance CUDA Programming and Performance	3	734	September 8, 2011
optimization shared memory fail major speed using shared memory in detriment of global memory CUDA Programming and Performance	3	3667	March 31, 2011
Shared memory as slow as global memory CUDA Programming and Performance	8	4375	September 5, 2016
Shared memory using structure instead of array CUDA Programming and Performance	7	1310	February 29, 2020
using shared memory CUDA Programming and Performance	6	2931	September 17, 2009
Use CUDA Shared memory as a write buffer CUDA Programming and Performance	8	3157	May 9, 2015
Shared memory and running time Results not reproducible CUDA Programming and Performance	10	1724	August 24, 2009

shared memory Computation become slower when using the shared memory

Related topics