Hello everyone,
I need to cross multiply 2 vector arrays, and I am using one thread for each element in the resulting array.
at the moment im just calling the elements directly from global memory like so:
global void CrossMulArray(cufftComplex A_d,cufftComplex B_d,cufftComplex C_d,int BATCH)
{
int idx = (blockIdx.y65535256)+(blockIdx.x256)+threadIdx.x;
if(idx<BATCH*256)
{
int idx2 = threadIdx.x;
I do not exactly know whats the problem, but one hint:
After you wrote the data to shared mem, use __syncthreads() to synchronize all threads.
Otherwise there could be race conditions, if one thread reads data and other writes data.
So, its possible, that this slows down your kernel.
arg yea i tried putting in a __syncthreads(); but that made no difference…
I also made it read data twice just to see if it would slow it down even more… but it didnt do anything… so im thinking that it has to be a diff part of the code… but which part… im down to the bare minimum of code, i cant eliminate the if or the calculation