I wonder if I can steal some suggestions as well. :-)
I’m trying to implement some very basic element-to-element array functions on the GPU, which is a G92 (8800 GTS 512MB).
for (i=0; i < N_ELEMENTS; i++)
S[i] = A[i] + B[i];
and so on with other operators (arrays are of type “cufftComplex” in my case).
I’m facing extremely heavy performance hits because of uncoalesced memory access in my (admittely very naive) implementation: to the point where the CPU is much faster.
I cared about using dimensions which are multiple of 128, I selected optimal block size and grid size (after test runs) but nonetheless, my GPU functions are very very slow (2 ms for a 1024x1024-elements S=A+B sum! That’s 0.5 GFlop/s!) .
Problem is, I’m new to parallel programming so it’s not that easy for me to follow the “reduction” example, which was written to address a more complex problem than my basic array functions.
I’m studying the “matrix multiply” example, but again, being a matrix-level operation (select a submatrix, load it into shared memory, accumulate the cross product, write it back to device memory and so) it’s quite more complex than a simple element-level sum, so it’s not so straightforward for me to adapt the algorithm for my needs.
Any simple suggestion?
Thanks a lot!