Hi I have the following kernel

```
__global__ void kernelDFT(u_char *matrix, float *real, float *imag, int idxI, int idxJ)
{
int idxM = blockIdx.y * blockDim.y + threadIdx.y;
int idxN = blockIdx.x * blockDim.x + threadIdx.x;
float angle;
while(idxM < d_row)
{
while(idxN < d_col)
{
angle = (idxI*idxM/(float)d_row + idxJ*idxN/(float)d_col)*2.0f*M_PI;
real[idxI*d_col+idxJ] += matrix[idxM*d_col+idxN]*cosf( angle ); //access to global memory a lot
imag[idxI*d_col+idxJ] += matrix[idxM*d_col+idxN]*sinf( angle ); //also here
idxN += blockDim.x * gridDim.x;
}
idxM += blockDim.y * gridDim.y;
idxN = blockIdx.x * blockDim.x + threadIdx.x;
}
}
```

In those lines (with comments) I think I’m reading/writing a lot from global memory so my kernel runs slower…is there a way to implement shared memory? I read CUDA by example and my idea is delare 2 variables as shared (for instance s_real, s_imag) and keep doing the operations on those variables and at the end just put the result on real and imag matrices. what do you think? any ideas? don’t know how to put that on code I’m new to CUDA