I’m in the proces of porting an image registration program to the gpu. To do this I identified a total of eight bottlenecks in this software that will later become my kernels. One of these kernels have I already ported, reducing execution time from 500 ms to 10 ms. Nice!
I’m facing a problem with the kernel I’m working on right now however. This routine adds values to a common array of data, and there are lots of threads and just a handfull of data. So, multiple threads are adding values to the same addresses at times, corrupting the data of course.
I need to fix this.
Since the resulting data is very small, I could just multiply the length of the array by the number of processors and have each processor update a memory location at its ‘own’ index in this location. Then each proc is updating it’s own address, and no race condition will occur.
Does cuda expose the physical proc number to the code? I think not. Secondly, this an ugly brute force solution to the problem.
Any ideas will be appreaciated :)