optimization shared memory fail major speed using shared memory in detriment of global memory

zero_cool · March 30, 2011, 1:20pm

Good Morning,

I am a noob when the subject is CUDA.
I have a little question about CUDA optimization using shared memory.
I program two kernels, the first using global memory and second using shared memory, but the first has a major speed up that is compare with the second kernel.
This is happen because i am transfer the data between global and shared memory whithout before allocate the same data into the registers?

following in the next lines the first and second kernels code.

somebody help me?

/First kernel/

global void calc_ez(float *ez,float *ga, float *dz, int dimx, int dimy){

int ix = blockIdx.x*blockDim.x + threadIdx.x;
int iy = blockIdx.y*blockDim.y + threadIdx.y;
int idx = iy*dimx + ix;

    ez[idx]=ga[idx]*dz[idx]; 

    __syncthreads();

}

/Second kernel/

/*The SUB_LARGURA is the amount of shared memory using in the kernel and TAM is the amount of full space allocate on the global memory */

global void calc_ez_share(float *ez,float *ga, float *dz, int dimx, int dimy){

__shared__ float ga_s[SUB_LARGURA][SUB_LARGURA];
__shared__ float dz_s[SUB_LARGURA][SUB_LARGURA];



int bx  = blockIdx.x;  
int by = blockIdx.y;

int tx  = threadIdx.x; 
int ty = threadIdx.y;

int ix = blockIdx.x*blockDim.x + threadIdx.x;
int iy = blockIdx.y*blockDim.y + threadIdx.y;

int idx = iy*dimx + ix;

int Row = by * SUB_LARGURA + ty;
int Col = bx * SUB_LARGURA + tx;


	for (int m = 0; m <TAM/SUB_LARGURA; m++) {

		ga_s[ty][tx] = ga[Row*TAM + (m*SUB_LARGURA + tx)];
		dz_s[ty][tx] = dz[Row*TAM + (m*SUB_LARGURA + tx)];
		
		
		__syncthreads();


		ez[Row*TAM + (m*SUB_LARGURA + tx)]=(ga_s[ty][tx] * dz_s[ty][tx]);
		__syncthreads();


		}

    	__syncthreads();

}

Best Regards

Antonio Carlos

LSChien · March 30, 2011, 1:54pm

You just do component-wise multiplication of two matrices, no data is reused.
Shared memory cannot help you but add penalty because you need to write to shared memory and read it again.

Try other apps, for example, matrix multiplication in SDK.

zero_cool · March 31, 2011, 12:42pm

I don’t understanding your assertive about the necessity of shared memory reused data.
If i use register to receive data of shared memory and send data to the global memory, the optimization has effect ?

the change suggested demonstrate :

/*The SUB_LARGURA is the amount of shared memory using in the kernel and TAM is the amount of full space allocate on the global memory */

global void calc_ez_share(float *ez,float *ga, float *dz, int dimx, int dimy){

shared float ga_s[SUB_LARGURA][SUB_LARGURA];
shared float dz_s[SUB_LARGURA][SUB_LARGURA];

int value;

int bx = blockIdx.x;
int by = blockIdx.y;

int tx = threadIdx.x;
int ty = threadIdx.y;

int ix = blockIdx.xblockDim.x + threadIdx.x;
int iy = blockIdx.yblockDim.y + threadIdx.y;

int idx = iy*dimx + ix;

int Row = by * SUB_LARGURA + ty;
int Col = bx * SUB_LARGURA + tx;

for (int m = 0; m <TAM/SUB_LARGURA; m++) {

ga_s[ty][tx] = ga[RowTAM + (mSUB_LARGURA + tx)];
dz_s[ty][tx] = dz[RowTAM + (mSUB_LARGURA + tx)];

__syncthreads();

/*

HERE THE DATA FROM SHARED MEMORY WILL BE TRANSFER TO THE REGISTERS ( THREAD ESCOPE )
BEFORE SENT TO THE GLOBAL MEMORY

*/
value=(ga_s[ty][tx] * dz_s[ty][tx]);
__syncthreads();

ez[RowTAM + (mSUB_LARGURA + tx)]=value;

}

__syncthreads();

}

Best Regards

Antonio Carlos

avidday · March 31, 2011, 1:06pm

Don’t use shared memory at all in this example. There is no benefit from doing so.

Topic		Replies	Views
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	3	3513	July 12, 2011
General Shared Memory Question CUDA Programming and Performance	5	6611	March 4, 2010
Correct Use of Shared Memory? CUDA Programming and Performance	1	712	January 6, 2010
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	0	597	July 12, 2011
using shared memory CUDA Programming and Performance	6	2931	September 17, 2009
shared memory problems CUDA Programming and Performance	12	4981	October 2, 2010
Better way to program Cuda CUDA Programming and Performance	3	1174	July 5, 2010
shared memory Computation become slower when using the shared memory CUDA Programming and Performance	8	1813	August 20, 2010
Shared vs. texture memory CUDA Programming and Performance	6	3180	April 18, 2009
CUDA: Using shared memory between different kernels.. CUDA Programming and Performance	4	16204	July 21, 2017

optimization shared memory fail major speed using shared memory in detriment of global memory

Related topics