convert Matlab array multiplication and sum function to CUDA equivalent

I am trying to convert this Matlab code into its CUDA equivalent:

[codebox]isize = 20;

n = 7;

for i = 1:n %%7x7 xcorr

for j = 1:n

xcout(i,j) = sum(sum(ffcorr1 .* ref(i:i+isize-1,j:j+isize-1))); %%ref is 676 element array and ffcorr1 is a 400 element array

end

end[/codebox]

Can someone point me how the sum(sum(array)) can be implemented in CUDA?? What is the correct way to write this kernel in CUDA ??

sum(sum(array)) should just be a simple reduction.

You can do it yourself by using the code in the Parallel Reduction example @

http://developer.download.nvidia.com/compu…Algorithms.html

Or you can use CUDPP @

http://code.google.com/p/cudpp/

An example of calculating the sum of an array can be found @

http://cudpp.googlecode.com/svn/tags/1.1.1…implecudpp.html

Or you could use Thrust @

http://code.google.com/p/thrust/

The second example on the main page will show you how to do it.

sum(sum(array)) should just be a simple reduction.

You can do it yourself by using the code in the Parallel Reduction example @

http://developer.download.nvidia.com/compu…Algorithms.html

Or you can use CUDPP @

http://code.google.com/p/cudpp/

An example of calculating the sum of an array can be found @

http://cudpp.googlecode.com/svn/tags/1.1.1…implecudpp.html

Or you could use Thrust @

http://code.google.com/p/thrust/

The second example on the main page will show you how to do it.

I am new to Thrust and after going thrugh the examples tried this:

for(i = 0; i < npix*npix; ++i)

	{

		thrust::transform(ref_d.begin()+i, ref_d.end()+i, ffcorr1.begin(), vec_pp.begin(), thrust::multiplies<double>());

	}

	

	for(i = 0; i < npix*npix; ++i)

	{

		vec_sum[i] = thrust::reduce(vec_pp.begin()+i, vec_pp.end()+i);

	}

But this does not produce correct results. When i = 0, the answer is right. Can the thrust::transform be modified to model this behavior?

I am new to Thrust and after going thrugh the examples tried this:

for(i = 0; i < npix*npix; ++i)

	{

		thrust::transform(ref_d.begin()+i, ref_d.end()+i, ffcorr1.begin(), vec_pp.begin(), thrust::multiplies<double>());

	}

	

	for(i = 0; i < npix*npix; ++i)

	{

		vec_sum[i] = thrust::reduce(vec_pp.begin()+i, vec_pp.end()+i);

	}

But this does not produce correct results. When i = 0, the answer is right. Can the thrust::transform be modified to model this behavior?

Is the result after the multiplication wrong as well? Because I think the problem might be your indexing at your second loop. Can you try:

for(i = 0; i < npix; ++i)

	{

		vec_sum[i] = thrust::reduce(vec_pp.begin()+i*npix, vec_pp.begin()+(i+1)*npix-1);

	}

Is the result after the multiplication wrong as well? Because I think the problem might be your indexing at your second loop. Can you try:

for(i = 0; i < npix; ++i)

	{

		vec_sum[i] = thrust::reduce(vec_pp.begin()+i*npix, vec_pp.begin()+(i+1)*npix-1);

	}

the answers after the multiplication is wrong.

As per my understanding, I need to multiply the 400 values of ref_d with 400 values of ffcorr1. But ref_d has 676 values, so I want to shift the values of ref_d from 0 to 400, 7 to 407, etc. for each multiplication.

Should I be using a permuatation_iterator ??

the answers after the multiplication is wrong.

As per my understanding, I need to multiply the 400 values of ref_d with 400 values of ffcorr1. But ref_d has 676 values, so I want to shift the values of ref_d from 0 to 400, 7 to 407, etc. for each multiplication.

Should I be using a permuatation_iterator ??

Hello,

With Jacket (http://www.accelereyes.com), you can get a CUDA version of this directly in M, as follows:

[codebox]

isize = 20;

n = 7;

ffcorr1 = gdouble(ffcorr1); ref = gdouble(ref); % only need to add one line of code

for i = 1:n %%7x7 xcorr

for j = 1:n

xcout(i,j) = sum(sum(ffcorr1 .* ref(i:i+isize-1,j:j+isize-1))); %%ref is 676 element array and ffcorr1 is a 400 element array

end

end

[/codebox]

You only need to add one line of code and you’re done. You might also play with GFOR (http://wiki.accelereyes.com/wiki/index.php/GFOR_Usage) to accelerate and auto-vectorize the inner loop. We’re happy to help you if you have any questions: support@accelereyes.com

Best,

John

Hello,

With Jacket (http://www.accelereyes.com), you can get a CUDA version of this directly in M, as follows:

[codebox]

isize = 20;

n = 7;

ffcorr1 = gdouble(ffcorr1); ref = gdouble(ref); % only need to add one line of code

for i = 1:n %%7x7 xcorr

for j = 1:n

xcout(i,j) = sum(sum(ffcorr1 .* ref(i:i+isize-1,j:j+isize-1))); %%ref is 676 element array and ffcorr1 is a 400 element array

end

end

[/codebox]

You only need to add one line of code and you’re done. You might also play with GFOR (http://wiki.accelereyes.com/wiki/index.php/GFOR_Usage) to accelerate and auto-vectorize the inner loop. We’re happy to help you if you have any questions: support@accelereyes.com

Best,

John

Matlab is not installed on a machine with GPU. So I will have to rely on Thrust or CUDPP for getting this implemented.

Matlab is not installed on a machine with GPU. So I will have to rely on Thrust or CUDPP for getting this implemented.