How to implement shared memory in kernel

Hi I have the following kernel

__global__ void kernelDFT(u_char *matrix, float *real, float *imag, int idxI, int idxJ)


	int idxM = blockIdx.y * blockDim.y + threadIdx.y;

	int idxN = blockIdx.x * blockDim.x + threadIdx.x;

	float angle;


	while(idxM < d_row)


		while(idxN < d_col)


			angle = (idxI*idxM/(float)d_row + idxJ*idxN/(float)d_col)*2.0f*M_PI;

			real[idxI*d_col+idxJ] += matrix[idxM*d_col+idxN]*cosf( angle );   //access to global memory a lot

			imag[idxI*d_col+idxJ] += matrix[idxM*d_col+idxN]*sinf( angle );   //also here

			idxN += blockDim.x * gridDim.x;


		idxM += blockDim.y * gridDim.y;

		idxN = blockIdx.x * blockDim.x + threadIdx.x;



In those lines (with comments) I think I’m reading/writing a lot from global memory so my kernel runs slower…is there a way to implement shared memory? I read CUDA by example and my idea is delare 2 variables as shared (for instance s_real, s_imag) and keep doing the operations on those variables and at the end just put the result on real and imag matrices. what do you think? any ideas? don’t know how to put that on code I’m new to CUDA

Note that your code as it stands is not even correct. Atomic additions would be needed as all threads add to the same array elements.

The structure of your problem is basically a reduction (with the added twist that each element gets rotated by angle before the summation). So you can look at the reduction example from the SDK.

By using M_PI you also introduce unwanted double precision operations into your kernel, so better use 3.1415926f instead.

And as I am already giving optimization hints: compile with -use_fast_math, or compute 1.0f/d_row and 1.0f/d_col on the CPU so you can replace the expensive division with multiplication by the inverse.