shared atomics with compute capability 1.1

Hi,

I want to port an implementation of an gray tone spatial dependence matrix to the GPU.
For example:
512x512x4 times I have to increment values in a matrix with dimension 256X256. It is very random, at which coordinates I have to increment.
The whole thing works with atmoic global mem write, but this is very slow.

Now I want to generate a single row of the 256x256 matrix with each thread-block. But here the writes to the shmem[256] array can also be to the same location. Therefore I need a routine to increment a value in shared memory atomically.

Is there any way to do this on devices with compute capability 1.1?

Thx in advance!

burnie

A good puzzle. Lots of solutions, and it may be hard to choose without testing. You’re right that compute devices 1.2+ would solve your problems since they DO have shared atomics.

If you’re processing 256 bins per block, you could give each of 32 threads their own 256 bins in shared memory, accumulate, then finally use parallel reduction to make it a single 256 bin histogram. This would pretty much max out your shared mem use, and may not use your threads effectively.

How about this? Create two 256 bin arrays in shared memory. The first is your accumulator. The second is a notification array.

Do:
Clear notification array to FFFFFFFF. Syncthreads.
Each thread computes the bin it wants to increment.
A- Each thread writes its unique THREAD ID into the notification array, in the index which it wants to increment the accumulator.
Syncthreads.
Each thread checks the notification array. If its own thread ID is there, it increments the accumulator and resets the notifier to FFFFFF.
Syncthreads.
If any threads did NOT see their own ID, go back to the notify/check loop (labeled A) until all threads have done their increment.
Go back to the start, ready for the next increment.

This all relies on the fact that write collisions do guarantee that one write will succeed. It doesn’t matter who it is, just that everyone knows who’s going to increment.

This wouldn’t even be too slow, since collisions will be rare with the big 256 wide array.

Compute capability 1.2 devices can do atomic updates with shared memory… this would work very well for you though you say you need 1.1.

Thanks a lot for your help! I think this should help me getting the things right!
greetings burnie

To generate for example a row of the 256x256GLCM Matrix with each Thread-Block I have to iterate through the whole 512x512 input array to check if the pixel belongs to my specific row. So this isn’t also a good solution for me, because the memory bandwith is too slow as you know…
My fastest solution is still the one with atomic writes to global memory (the whole 256x256 matrix stays in global memory).
If anyone have a genious idea to solve my problem, feel free to write here :thumbsup:
I try to explain:
There is a 8bit graytonematrix [y] where I have to look at the neighboors of each pixel.
If there is a pixel with the value 245 near a pixel with the value 13 I have to increment the element [245] [13] in the GLCMmatrix. Cause of the 8bit values the GLCMmatrix is [256][256] in maximum.

What about not doing it every row, but multiple rows at once?
How big are your bins? I assume int2. So make an array say 256x26, that uses a conservative 13K, leaving room for your “notification” array of 256 signal slots. [That could be increased or decreased if you like…]

So this would require 10 passes through the image, but it’d all be tallied in shared memory. You’d probably just use 10 blocks and you’re done.

Reading the image from global memory is decent speed since it’s coalesced, your random scatter writes would be to fast shared RAM.

But as you say, it depends. Try it both ways. Perhaps global atomics really are faster.

After many tests I think there is no reason to port this kind of function to the gpu, cause its all about memory bandwith and atomic writes cause of the parallelism.
CPU and its RAM is faster, i think this part of my program should stay in CPU.