binning with kernel newbie about implement loop to kernel

Hi All,

I have a Complex array (batchsizennsizeof(Complex) ), and I would like to bin its squares in a float array (nn*sizeof(float)).

If I do this with host variables, its simple:

typdef float2 Complex;

void ComplexPufferToReal(Complex *h_signal, float *h_puffer, int batchsize, int nn)


    int k, j;

    for (j=0;j<nn*batchsize;j++)


    h_puffer[k] += (h_signal[j].x * h_signal[j].x) + (h_signal[j].y * h_signal[j].y);


    if (k>nn-1) k = 0;



This one works just fine on the host. But I would like to do this binning on the device since I have to do this many times in a loop after an cudaFFT function.

I tried to implement the above function to a kernel so it would work on device variables (allocated with cudaMalloc, and zero-d with cudaMemset):

__global__ void ComplexPufferToReal(Complex *d_signal, float *d_puffer, int batchsize, int nn)


  int k=0;

  int j = blockIdx.x*blockDim.x + threadIdx.x;

  if (j<batchsize*nn)


    d_puffer[k] += (d_signal[j].x * d_signal[j].x) + (d_signal[j].y * d_signal[j].y);


    if (k>nn-1) k = 0;



calling like this for example:

int batchsize = 1000; 

int nn = 512; 

int Nofmaxthreads = 512;

int nBlocks = (batchsize*nn)/Nofmaxthreads + ((batchsize*nn)%Nofmaxthreads == 0 ? 0 : 1);

ComplexPufferToReal<<<nBlocks, Nofmaxthreads>>>(d_signal, d_puffer, batchsize, nn);

I think I mixed up something, this is not working like the one above, somebody please could help me, how to do the above simple loop with device variables in a kernel? (I have a Gefore8600GT, WinXP, VS2003+nvcc)

Thanks very much in advance :) ,



The value of k in the kernel is equal to 0 for every write to d_puffer.

upss, i made a really really primitive error :haha:

thanks very much for the answere! :)