using shared memory

Hello everyone,
I need to cross multiply 2 vector arrays, and I am using one thread for each element in the resulting array.
at the moment im just calling the elements directly from global memory like so:

global void CrossMulArray(cufftComplex A_d,cufftComplex B_d,cufftComplex C_d,int BATCH)
{
int idx = (blockIdx.y
65535
256)+(blockIdx.x
256)+threadIdx.x;
if(idx<BATCH*256)
{
int idx2 = threadIdx.x;

  C_d[idx].x= -1 * ((A_d[idx2].y * (-1*B_d[idx].y)) - (A_d[idx2].x * B_d[idx].x));
  C_d[idx].y= -1 * ((A_d[idx2].y * B_d[idx].x) + (A_d[idx2].x * ( -1 * B_d[idx].y)));
  }

}

but since shared memory is so much faster, i was going to load all A_d and B_d elements in shared memory for each block that I launch.

how would i attempt that?

I tried declaring:
shared float ax=A_d[threadIdx.x].x;
shared float ay=A_d[threadIdx.x].y;
.
.
.

but that gives me an error.

any suggestions? what am i doing wrong?

Take a look into the CUDA programming guide from nvidia. Page 20, Section 3.2.2.

You have to specify the size of shared mem at kernel launch time:

...

kernel_launch<<<GRID,BLOCK,SIZE_OF_SHARED_MEM>>>(...);

...

In the kernel, you declare the shared as follows:

__shared__ TYPE NAME[];

You do not need specify a size of the array, since it is defined due to the third parameter of

the kernel invocation.

Each block has its own shared mem. After the declaration, you can write your data to the

array and can access it like just any other array.

Thanks =) thats what i needed to know

ok, so i have figured out the shared memory, but i think my shared memory version is slower than the original one.

how is that possible?

here is what i did

first the global mem version:

. //cpu code

.

.

 CrossMulArray <<< dimGrid, 256>>> (d_A,d_B,d_C,BATCH);

.

.

.

//gpu code

  __global__ void CrossMulArray(cufftComplex *A_d,cufftComplex *B_d,cufftComplex *C_d,int BATCH)

  {

	  int idx = (blockIdx.y*65535*256)+(blockIdx.x*256)+threadIdx.x;

	  if(idx<BATCH*256)

	  {

	  int idx2 = threadIdx.x;

	  //cufftComplex thing = cuConjf(B_d[idx]);

	  C_d[idx].x=-1*((A_d[idx2].y * (-1*B_d[idx].y)) - (A_d[idx2].x * B_d[idx].x));

	  C_d[idx].y=-1*((A_d[idx2].y * B_d[idx].x) + (A_d[idx2].x * (-1*B_d[idx].y)));

	  }

  }

this one takes roughly 340ms with 218k arrays of 256 elements

now the weird thing is that the shared mem version like below, took 370-400ms:

. //cpu code 

.

.

unsigned int smem_size = sizeof(float) * 4 * 256;

CrossMulArray_smem <<< dimGrid, 256, smem_size >>> (d_A,d_B,d_C,BATCH);

.

.

.

//gpu code

  __global__ void CrossMulArray_smem(cufftComplex *A_d,cufftComplex *B_d,cufftComplex *C_d,int BATCH)

  {

	  int idx = (blockIdx.y*65535*256)+(blockIdx.x*256)+threadIdx.x;

	  

	  if(idx<BATCH*256)

	  {

		const unsigned int tid=threadIdx.x;

		const unsigned int loc=4*tid;

	  extern __shared__ float a[];

	  a[loc]=A_d[tid].x;

	  a[loc+1]=A_d[tid].y;

	  a[loc+2]=B_d[idx].x;

	  a[loc+3]=B_d[idx].y;

	  C_d[idx].x=-1*((a[loc+1] * (-1*a[loc+3])) - (a[loc] * a[loc+2]));

	  C_d[idx].y=-1*((a[loc+1] * a[loc+2]) + (a[loc] * (-1*a[loc+3])));

	  }

  }

I do not exactly know whats the problem, but one hint:
After you wrote the data to shared mem, use __syncthreads() to synchronize all threads.
Otherwise there could be race conditions, if one thread reads data and other writes data.
So, its possible, that this slows down your kernel.

arg yea i tried putting in a __syncthreads(); but that made no difference…

I also made it read data twice just to see if it would slow it down even more… but it didnt do anything… so im thinking that it has to be a diff part of the code… but which part… im down to the bare minimum of code, i cant eliminate the if or the calculation

ok, i have found my error… my timer also timed a copy… so I guess that was silly on my side…

not im down to 11.5ms i think i might be able to speed that up using constant mem… so now i gotta figure that out…