I’m in the proces of porting an image registration program to the gpu. To do this I identified a total of eight bottlenecks in this software that will later become my kernels. One of these kernels have I already ported, reducing execution time from 500 ms to 10 ms. Nice!
I’m facing a problem with the kernel I’m working on right now however. This routine adds values to a common array of data, and there are lots of threads and just a handfull of data. So, multiple threads are adding values to the same addresses at times, corrupting the data of course.
I need to fix this.
My idea:
Since the resulting data is very small, I could just multiply the length of the array by the number of processors and have each processor update a memory location at its ‘own’ index in this location. Then each proc is updating it’s own address, and no race condition will occur.
Does cuda expose the physical proc number to the code? I think not. Secondly, this an ugly brute force solution to the problem.
DeviceQuery in the SDK should give you the answer for the number of physical processors.
However if I understand you correctly you can just use an arbitary number of blocks/thread groups/whatever, indeed allocate a space in RAM for each such group and have
each group accumulate into this place (thus hopefully removing the race condition). Then just use some sort of reduction to accumulate the intermiddiete results into one (using a
There are atomic update functions which may serve your purpose (see the CUDA
programming manual).
For integers, I believe there is an atomic add function. For floating point numbers,
you can simulate an atomic add using the atomic CAS (Compare And Swap)
function. This is done in nVidia’s sparse matrix-vector multiply library
available here:
I tried the atomic adds. Since i need double output, i used 64 bit integer atomic add in combination with fixed point math and a post processing step to cast everything to doubles. I get only a factor 2 speedup, so I think i’m gonna try and rewrite the function so the function iterates over different variables leaving out this scatter problem.