I’m attempting to accelerate a gridding algorithm in CUDA, and I’m hitting some serious problems getting the code to run faster on the GPU.
I have a list of sources (around 130k) which have to be placed on a 1500x1500 grid. Each source affects a 5x5 (or 7x7… this is a runtime decision for all sources) patch of the larger grid. The grid is of complex numbers (and there are actually four of them, but that doesn’t affect the algorithm; we just have to do the same thing once for each grid).
The CPU based code loops over all sources, adding the effect of each to each grid cell. This takes about 500ms. However, since each pixel in the large grid can potentially be affected by more than one source, it’s not safe to give each GPU thread a source (I’ve verified this with a test kernel - quite a few sources get missed due to race conditions).
So, in the CUDA code, I give each cell in the 1500x1500 grid to a thread, and let each thread loop over sources. This is thread-safe but hideously slow (as in around 20x slower than the CPU implementation), since most grid cells only have one or two sources in range (and quite frequently, zero if you look at the numbers). RIght now, I’m trying to use texture arrays to speed up the source-lookups, but I’m not certain this is going to result in any substantial speed gains - at least as compared with the CPU implementation.
Someone else had worked on an earlier version of the code, and had used OpenGL in this portion. Apparently the pixel pipelines there can do atomic adds on floats in the process of compositing an image. Unfortunately, I don’t know OpenGL.
Has anyone else come across a similar problem, and if so, how did you solve it within CUDA? Thanks in advance…