slowdown modifying kernel return variable

ekroemer · December 27, 2009, 2:47pm

Hi there,

I’m new to CUDA and I hope, I’m missing something simple to fix.

The code I’m writing sums up contributions from a huge number of point scatterers (sources) placed on the surface of a geometric shape. I’ve parallelized the code by dividing the set of sources into subsets of arbitrary size.

[

greater size means longer kernel runtime and eventually kernel timeout; total number of sources is possibly

great, ranging from 1e3 to 1e12

]

Each thread works on a different subset of the geometry by looping over all its sources.

For each source after a fairly complicated procedure (>200 lines of code) there is calculated

[*]a single float value and

[*]an index,

so speaking serially result[index] = result[index] + value for each source of each sub-geometry.

In pseudocode my kernel looks like[codebox]global void mykernel(float *result, float *parameters)

{

CONTINUE = 1;

while ( CONTINUE ) {

#include "some_complicated_calculations"  /* yielding INDEX, VALUE, CONTINUE */

result[INDEX] = result[INDEX] + VALUE;

}

return;

}[/codebox]Now I’ve got 2 problems with this code:

[list=1]

[*] I’m unsure if the line result[INDEX] = result[INDEX] + VALUE; will yield the correct result. Each thread may read/write multiple times into each index of the array result, there is no beforehand knowledge about when or how, the array result is added to by all threads/sub-geometries.

Maybe I at least would need some atomic operation here, to ensure that between reading result[INDEX] and

writing the increased value there is no possibility for another thread to modify this memory cell.

In device emulation mode the result is the same for CUDA and the plain C reference; without emulation mode there are discrepancies that could stem from different float handling but may come from the problem described above.

[*]The second problem is speed:

[*]Using the CPU-reference version, a test program needs about 500s to run.

[*]Using the kernel as shown above my GTX280 needs 350s and ptxasinfo shows 60 used registers and 3000bytes lmem.

[*]Simply skipping the line result[INDEX] = result[INDEX] + VALUE; has ptxasinfo with 0 registers and negligible resources otherwise, the program completes in 4s!

That is: all the complicated calculations shown as #include only need 1% of the execution time; the repeated modification of the kernel’s result array needs 99%.

Is there a simple way to overcome this? I could populate a thread-local array and add it to the kernel result only once per thread. But length(result) may be some 1000 and the timing results above hold true for very small sub-geometries, too, where most result[INDEX] are not touched at all.

E.D_Riedijk · December 27, 2009, 9:42pm

Now I’ve got 2 problems with this code:

[list=1]

[*] I’m unsure if the line result[INDEX] = result[INDEX] + VALUE; will yield the correct result. Each thread may read/write multiple times into each index of the array result, there is no beforehand knowledge about when or how, the array result is added to by all threads/sub-geometries.

Maybe I at least would need some atomic operation here, to ensure that between reading result[INDEX] and

writing the increased value there is no possibility for another thread to modify this memory cell.

In device emulation mode the result is the same for CUDA and the plain C reference; without emulation mode there are discrepancies that could stem from different float handling but may come from the problem described above.

[*]The second problem is speed:

[*]Using the CPU-reference version, a test program needs about 500s to run.

[*]Using the kernel as shown above my GTX280 needs 350s and ptxasinfo shows 60 used registers and 3000bytes lmem.

[*]Simply skipping the line result[INDEX] = result[INDEX] + VALUE; has ptxasinfo with 0 registers and negligible resources otherwise, the program completes in 4s!

That is: all the complicated calculations shown as include only need 1% of the execution time; the repeated modification of the kernel’s result array needs 99%.

Is there a simple way to overcome this? I could populate a thread-local array and add it to the kernel result only once per thread. But length(result) may be some 1000 and the timing results above hold true for very small sub-geometries, too, where most result[INDEX] are not touched at all.

You may have problems with the summation, depends a bit, you can normally add things up in a block and if you make sure your blocks all write to other locations in memory, it will be okay.

Now for your speed problem: It is probably because each thread uses 3kB of local (slow) memory, maybe you can try to use shared memory?

As for the reason of the speedup, the compiler probably removes all code from the kernel, because nothing is written to global memory. The dead-code optimization is quite good…

ekroemer · December 28, 2009, 4:22pm

Thanks for your reply!

I commented out an optional part of the code that would produce most of the lmem, so now I’m left with approximately 60 registers/80 byte lmem, or, using -maxrregcount e.g. 4 reg / 400 byte lmem.

This has no measurable effect on execution time, so obviously execution time does not scale with lmem usage.

Does this mean I need to get rid of every single usage of lmem?

I think I could try to reduce the number of variables but this will play hell with code readability…

I’m afraid, you could be right, that would be quite a blow.

I did some more tests commenting out parts of the code and must admit that I’m confused, for example:

[*] compiling all of the code and simply adding a weight of zero, result[index] = result[index] + value * 0.0, will result in extreme fast execution. Maybe, as you said, the whole kernel is cleared.

[*] the result[index] = … line is protected by something like if (index<0) || (index >= imax) continue.

index itself is calculated using something like sqrt(distance), distance for each source being derived from internally calculated coordinates (no fixed values the compiler could know of).

Now, deleting the sqrt will produce an out-of-bound index, and the result[index] = … line won’t be reached.

execution time: again extremely fast and I do not see how the compiler can predict this case.

Seems I’ll have to test more…