Hi there,

I’m new to CUDA and I hope, I’m missing something simple to fix.

The code I’m writing sums up contributions from a huge number of point scatterers (sources) placed on the surface of a geometric shape. I’ve parallelized the code by dividing the set of sources into subsets of arbitrary size.

[

greater size means longer kernel runtime and eventually kernel timeout; total number of sources is possibly

great, ranging from 1e3 to 1e12

]

Each thread works on a different subset of the geometry by looping over all its sources.

For each source after a fairly complicated procedure (>200 lines of code) there is calculated

a single float value and

an index,

so speaking serially **result[index] = result[index] + value** for each source of each sub-geometry.

In pseudocode my kernel looks like[codebox]**global** void mykernel(float *result, float *parameters)

{

CONTINUE = 1;

while ( CONTINUE ) {

```
#include "some_complicated_calculations" /* yielding INDEX, VALUE, CONTINUE */
result[INDEX] = result[INDEX] + VALUE;
```

}

return;

}[/codebox]Now I’ve got 2 problems with this code:

[list=1]

I’m unsure if the line **result[INDEX] = result[INDEX] + VALUE; ** will yield the correct result. Each thread may read/write multiple times into each index of the array result, there is no beforehand knowledge about when or how, the array result is added to by all threads/sub-geometries.

Maybe I at least would need some atomic operation here, to ensure that between reading result[INDEX] and

writing the increased value there is no possibility for another thread to modify this memory cell.

In device emulation mode the result is the same for CUDA and the plain C reference; without emulation mode there are discrepancies that could stem from different float handling but may come from the problem described above.

The second problem is speed:

Using the CPU-reference version, a test program needs about 500s to run.

Using the kernel as shown above my GTX280 needs 350s and ptxasinfo shows 60 used registers and 3000bytes lmem.

Simply skipping the line **result[INDEX] = result[INDEX] + VALUE;** has ptxasinfo with 0 registers and negligible resources otherwise, the program completes in *4s*!

That is: all the complicated calculations shown as #include only need 1% of the execution time; the repeated modification of the kernel’s result array needs 99%.

Is there a simple way to overcome this? I could populate a thread-local array and add it to the kernel result only once per thread. But length(result) may be some 1000 and the timing results above hold true for very small sub-geometries, too, where most result[INDEX] are not touched at all.