Best way to do scatter write without memory conflict?

I’m trying to do something that is very similar to the problem of splatting in direct volume rendering:
Each of my threads will produce an array of data that is to be written to or “splatted” onto my final output array. Each element in the output array may be visited by more than one threads, and I want these contributions to the same output element from different threads to be added.

Obviously this gives rise to potential memory conflict as different threads try to write to the same output array address simultaneously. I did some search on the internet but couldn’t find a good solution to it.

Atomic add is no good since it doesn’t handle floating point and I need that.

One solution that I could think of is to do a kind of merge sort on the data arrays produced by the threads. ie, take the two arrays from two threads. merge them and combine those elements that will contribute to the same output element, and keep repeating this process until I’m left with one single array that can be added to my output conflict-free.

This seems a rather complicated solution to a simple problem to me. Any suggestion to a better/simpler solution?


As you say, there are indeed many strategies. And they all depend on your application… each has tradeoffs.
If your problem can be reversed and posed as a GATHER where each voxel grabs its contributions, you’re all set since you can iterate through those without conflict. This may not be possible with splatting, but for many apps it’s the best.

Another solution is binning to a subgrid. You have a zillion voxels. Split them up into subregions of say 1000-4000 voxels.
Stream through your data and bin each splat into new lists for each subgrid.
Now launch a new kernel, and each block here “owns” a subgrid. It goes through the lists you just made and applies the splats to its own voxels. Here the subgrid is chosen to be small enough to fit into shared memory so you can control memory access much more easily.

If your problem is small enough to fit into shared to begin with, you could have every block apply some of the splats to a local copy of the grid, then run a second kernel to do a reduction that adds all subgrids together.

Lots of other solutions but again the details of your problem really dominate, and you should always try a couple variants to see which work best.

Hello there.
I did not make a new topic cause my problem can integrate in this one.
So, about the details of the problem SPWorley, I have 2 images processed by a kernel that returns into a width * height float array (global variable) the square difference between pixels(RGB).
In the next phase I would like all those values in the array to be added.
As rhyc said, I cannot use atomicAdd cause there are no integers in my array, so I was thinking of which way is better:

  • should I use shared memory per block and then, as you said, recall a reduction kernel until I hit the desired value?
  • should I directly use the global array that I saved my data into and after computing my array, call __syncthreads() and then in the same kernel doing a for (the same way as reduction algorithm does)
    and calculating my interest value?

Is there such an optimization using shared memory so I would need to quit my second idea? Or is not a big one, then the method will not interest me…


gufeatza, you have several solutions. Luckily they’re all pretty easy.
Easiest is to do summation per block and come up with a single sum for the block, then combine the results.

So lets say your image was 5K by 5K, you might assign one block for each 256x256 subimage, so you’d have 400 blocks.
You’d use your blocks to loop over the pixels, likely each thread would sum up one column of values, for example.
Then you’d sum up each of those totals to get a single value for the whole block. This step is a classic block-wide reduction.

Finally, you have 400 block results, and need to add THOSE. you could write them out into global memory and have a second kernel add them, you could just copy them to the CPU and have the CPU add them, or you could use global atomics. Any of those solutions is fine since it’s a small amount of work and no big deal compared to the original accumulation.


“Any of those solutions is fine since it’s a small amount of work and no big deal compared to the original accumulation”
This leads me to another question, I guess that’s the one I should’ve ask first of all :)

My interest is to compute an “error” between two images. (difference is a quite better term to use instead of error)
Through every thread I compute the sum of square differences of pixel colors. err_per_thread = (red_error)^2 + (red_green)^2 + (red_blue)^2 ,
each of those 3 variables meaning of course the difference of color between the 2 images.

Finally as I said I add all these results to obtain the total error (difference).

Well, here comes the question: Is there any improvement to make this computations on the GPU instead of CPU(2 whiles)?

If not, any idea on how to get that improvement using CUDA(where to apply parallelism) ?

Your question, “is it worth doing on the GPU” has tons of answers.
Is this a problem that’s a bottleneck for you?

Your problem itself may tell you the answer. If your images are already on the device, and you need to use the error on the device, it will be expensive to send the data to the CPU, do a compute, and send it to the GPU.
Conversely, if your data is on the CPU and you don’t need it on the device, is it worthwhile to send it to the GPU?

For such a simple compute, you’re likely going to be limited by PCIe bus speed and not compute speed at all.
So take your image size in gigabytes and divide by 4 GB/sec, and that will give you a back-of-the-envelope speed estimate of the transfer time alone. You might compare that to the speed you can compute on the CPU, the CPU may win even if the GPU’s computation is infinitely fast.

Thanks SPWorley.

I actually have a solution using a ray-casting approach which is a gather operation, and I want to try a splatting approach to compare the performances so…

I’ll try the method binning the voxels into groups that guarantee conflict-free writes, as you suggested.