Nice thread!

I wonder if I can steal some suggestions as well. :-)

I’m trying to implement some very basic element-to-element array functions on the GPU, which is a G92 (8800 GTS 512MB).

Something like:

```
for (i=0; i < N_ELEMENTS; i++)
S[i] = A[i] + B[i];
```

and so on with other operators (arrays are of type “cufftComplex” in my case).

I’m facing *extremely heavy* performance hits because of uncoalesced memory access in my (admittely very naive) implementation: to the point where the CPU is much faster.

I cared about using dimensions which are multiple of 128, I selected optimal block size and grid size (after test runs) but nonetheless, my GPU functions are very very slow (2 ms for a 1024x1024-elements S=A+B sum! That’s 0.5 GFlop/s!) .

Problem is, I’m new to parallel programming so it’s not that easy for me to follow the “reduction” example, which was written to address a more complex problem than my basic array functions.

I’m studying the “matrix multiply” example, but again, being a matrix-level operation (select a submatrix, load it into shared memory, accumulate the cross product, write it back to device memory and so) it’s quite more complex than a simple element-level sum, so it’s not so straightforward for me to adapt the algorithm for my needs.

Any simple suggestion?

Thanks a lot!

Fernando