Scalar product module

Uliveto · October 2, 2007, 2:11pm

Hi,

i am developping an application in which i adapt the scalar product sample to my needs.
Since i need to do a scalar product it shouldn’t be that difficult, all i need to do is just
to change the size of the parameters allocated inside the GPU memory (is that so?).
The thing is that the results i obtain are not correct and outside of all logics.
The program works fine in the sense that if, inside the kernel module, i assign an arbitrary value to the variable returned, the value returned in the main is correct.
So the problem is not in passing the parameters from kernel to main, and neither in the
opposite direction (i’m pretty sure about that). The problem is in the calculation of
the scalar product.
I didn’t change a line so i don’t understand why it doesn’t work…
What aree the things i should pay attention at?
and what’s the use of the tree-like reduction cycles at the end of the kernel code?

Thanks a lot

sicb0161 · October 8, 2007, 12:20pm

Hi,

i am developping an application in which i adapt the scalar product sample to my needs.

Since i need to do a scalar product it shouldn’t be that difficult, all i need to do is just

to change the size of the parameters allocated inside the GPU memory (is that so?).

The thing is that the results i obtain are not correct and outside of all logics.

The program works fine in the sense that if, inside the kernel module, i assign an arbitrary value to the variable returned, the value returned in the main is correct.

So the problem is not in passing the parameters from kernel to main, and neither in the

opposite direction (i’m pretty sure about that). The problem is in the calculation of

the scalar product.

I didn’t change a line so i don’t understand why it doesn’t work…

What aree the things i should pay attention at?

and what’s the use of the tree-like reduction cycles at the end of the kernel code?

Thanks a lot

[snapback]259603[/snapback]

I think that you do not need to change the grid and block size, but only the number of input vectors and vector lengths, which can be set arbitrarily. However vectorlengths should be a multiple of the warp size. If not, you have performance degradation (under-populated warps) and you are not using all computing resources (see programming guide 5.2)
The tree like reduction is used for summing up in parallel. Instead of the tree like reduction you might also use a sequential code (not a nice solution)

sum = 0;

for (i = 0; i < vec_length; ++i)

   sum += multiplied_data[i];