correlation and reduction on a 2D array

I am trying to compare cross-correlation using FFT vs using windowing method.

My Matlab code is:

[codebox]isize = 20;

n = 7;

for i = 1:n %%7x7 xcorr

for j = 1:n

xcout(i,j) = sum(sum(ffcorr1 .* ref(i:i+isize-1,j:j+isize-1))); %%ref is 676 element array and ffcorr1 is a 400 element array

end

end[/codebox]

similar CUDA kernel:

[codebox]global void xc_corr(double* in_im, double* ref_im, int pix3, int isize, int n, double* out1, double* temp1, double* sum_temp1)

{

int p = blockIdx.x * blockDim.x + threadIdx.x;

int q = 0;

int i = 0;

int j = 0;

int summ = 0;



for(i = 0; i < n; ++i)

{

	for(j = 0; j < n; ++j)

	{

		summ  = 0; //force update

		for(p = 0; p < pix1; ++p)

		{

			for(q = 0; q < pix1; ++q)

			{

				temp1[((i*n+j)*pix1*pix1)+p*pix1+q] = in_im[p*pix1+q] * ref_im[(p+i)*pix1+(q+j)];		

				sum_temp1[((i*n+j)*pix1*pix1)+p*pix1+q] += temp1[((i*n+j)*pix1*pix1)+p*pix1+q];

				out1[i*n+j] = sum_temp1[((i*n+j)*pix1*pix1)+p*pix1+q];

			}

		}			

	}

}[/codebox]

I have called this in my kernel as

[codebox] int blocksize = 64; //multiple of 32

int nblocks = (pix3+blocksize-1)/blocksize; //round to max pix3 = 400

xc_corr <<< nblocks,blocksize >>> (ffcorr1, ref_d, pix3, isize, npix, xcout, xc_partial);

cudaThreadSynchronize();[/codebox]

Somehow, when I do a diff on the output file, I see that the CUDA kernel computes for only the first 400 elements.

What is the correct way to write this kernel??

Thanks in advance.

any suggestions??

any suggestions??

Correct me if I’m wrong, but it appears that you let all threads perform the same (complete) calculation. Although you set int p = blockIdx.x * blockDim.x + threadIdx.x, subsequently you use p as a loopcounter. This means that the kernelinvocation (which appears correct) has no influence on the outcome. You should check the code outside a kernel and see why it does not perform as desired; afterwards think of a way to assign only part of the job to each thread and get the benefits of a cuda kernel.

Correct me if I’m wrong, but it appears that you let all threads perform the same (complete) calculation. Although you set int p = blockIdx.x * blockDim.x + threadIdx.x, subsequently you use p as a loopcounter. This means that the kernelinvocation (which appears correct) has no influence on the outcome. You should check the code outside a kernel and see why it does not perform as desired; afterwards think of a way to assign only part of the job to each thread and get the benefits of a cuda kernel.