optimization shared memory fail major speed using shared memory in detriment of global memory

Good Morning,

I am a noob when the subject is CUDA.
I have a little question about CUDA optimization using shared memory.
I program two kernels, the first using global memory and second using shared memory, but the first has a major speed up that is compare with the second kernel.
This is happen because i am transfer the data between global and shared memory whithout before allocate the same data into the registers?

following in the next lines the first and second kernels code.

somebody help me?

/First kernel/

global void calc_ez(float *ez,float *ga, float *dz, int dimx, int dimy){

int ix = blockIdx.x*blockDim.x + threadIdx.x;
int iy = blockIdx.y*blockDim.y + threadIdx.y;
int idx = iy*dimx + ix;

    ez[idx]=ga[idx]*dz[idx]; 

    __syncthreads();

}

/Second kernel/

/*The SUB_LARGURA is the amount of shared memory using in the kernel and TAM is the amount of full space allocate on the global memory */

global void calc_ez_share(float *ez,float *ga, float *dz, int dimx, int dimy){

__shared__ float ga_s[SUB_LARGURA][SUB_LARGURA];
__shared__ float dz_s[SUB_LARGURA][SUB_LARGURA];



int bx  = blockIdx.x;  
int by = blockIdx.y;

int tx  = threadIdx.x; 
int ty = threadIdx.y;

int ix = blockIdx.x*blockDim.x + threadIdx.x;
int iy = blockIdx.y*blockDim.y + threadIdx.y;

int idx = iy*dimx + ix;

int Row = by * SUB_LARGURA + ty;
int Col = bx * SUB_LARGURA + tx;


	for (int m = 0; m <TAM/SUB_LARGURA; m++) {

		ga_s[ty][tx] = ga[Row*TAM + (m*SUB_LARGURA + tx)];
		dz_s[ty][tx] = dz[Row*TAM + (m*SUB_LARGURA + tx)];
		
		
		__syncthreads();


		ez[Row*TAM + (m*SUB_LARGURA + tx)]=(ga_s[ty][tx] * dz_s[ty][tx]);
		__syncthreads();


		}

    	__syncthreads();

}

Best Regards

Antonio Carlos

You just do component-wise multiplication of two matrices, no data is reused.
Shared memory cannot help you but add penalty because you need to write to shared memory and read it again.

Try other apps, for example, matrix multiplication in SDK.

I don’t understanding your assertive about the necessity of shared memory reused data.
If i use register to receive data of shared memory and send data to the global memory, the optimization has effect ?

the change suggested demonstrate :

/*The SUB_LARGURA is the amount of shared memory using in the kernel and TAM is the amount of full space allocate on the global memory */

global void calc_ez_share(float *ez,float *ga, float *dz, int dimx, int dimy){

shared float ga_s[SUB_LARGURA][SUB_LARGURA];
shared float dz_s[SUB_LARGURA][SUB_LARGURA];

int value;

int bx = blockIdx.x;
int by = blockIdx.y;

int tx = threadIdx.x;
int ty = threadIdx.y;

int ix = blockIdx.xblockDim.x + threadIdx.x;
int iy = blockIdx.y
blockDim.y + threadIdx.y;

int idx = iy*dimx + ix;

int Row = by * SUB_LARGURA + ty;
int Col = bx * SUB_LARGURA + tx;

for (int m = 0; m <TAM/SUB_LARGURA; m++) {

ga_s[ty][tx] = ga[RowTAM + (mSUB_LARGURA + tx)];
dz_s[ty][tx] = dz[RowTAM + (mSUB_LARGURA + tx)];

__syncthreads();

/*

HERE THE DATA FROM SHARED MEMORY WILL BE TRANSFER TO THE REGISTERS ( THREAD ESCOPE )
BEFORE SENT TO THE GLOBAL MEMORY

*/
value=(ga_s[ty][tx] * dz_s[ty][tx]);
__syncthreads();

ez[RowTAM + (mSUB_LARGURA + tx)]=value;

}

__syncthreads();

}

Best Regards

Antonio Carlos

Don’t use shared memory at all in this example. There is no benefit from doing so.