Random memory access and += Advice needed

I have a problem I’m trying to solve with CUDA but so far I haven’t been able to come up with anything that resembles an acceptable solution.

I need advice.

My input is an array of objects with the following format:

struct input


  uint index;

  float data[10];


The output is an array of objects with the following format:

struct output


  float data[10];


This is what I need, in pseudo-code:

void process( input const * in, size_t s, output * out )


  for( size_t i=0; i!=s; ++i )

    out[in[i].index].data += in[i].data;


(as you’ve probably guessed, += means “add the corresponding data elements”)

I will also add that the size of the input and the output arrays is many many millions of elements, however I know that a single output element is not used by more than 48 input elements (48 is the most common case as well.)

I figured, I could pad the output struct to 64 bytes, which would allow me to have 16 threads where 10 threads output 1 float each, and 6 threads output nothing, so that an entire output object’s write can be coalesced.

The problem is, I don’t need to write the output, I need +=.

Any ideas?

Here’s one way to handle it. It’s off the top of my head, so there may well be a more clever approach.

Let N be the number of input elements and M be the number of output elements. Also suppose that both N and M are large with respect to the number of processors in your GPU.

First, sort the input array using the output index field as key. If your input array is already sorted this way then you can skip this part. You can use a temporary array of input/output index pairs to do this sort and reduce the cost. This costs O(N log N log N) for bitonic sort. Radix sort may be faster, but it’s a little more complicated to code.

Second, launch M threads, where M is the size of your output. Each thread is associated with one output index. Each thread does a binary search to find the first place where the thread output index is equal to the output index from the temporary array (hint: use a texture binding for faster search). Then you iterate through the temporary array while the output indexes match and accumulate all of the data values. This costs O(M log N) for the search and O(N) for the operations.

Right, I forgot to mention that the elements in the input array are not sorted.

I thought about sorting, but isn’t that going to be really slow? We’re talking about sorting many millions of elements of size 44 bytes. Has anyone done something like this? I understand how Bitonic sort works, but I’m not sure how to implement it efficiently in CUDA. There is an example of it in the SDK but it’s for very small N and very small element size; I don’t think that this approach will work well in my case.

The other problem is, binary search. How can this be done with coalesced reads?

Am I wrong being concerned about coalescing all reads and writes?

The sorting I suggested uses an array of pairs, each pair having an input index and an output index, so you don’t actually move that much data. Bitonic sort is pretty easy to extend to large arrays, although it might not be the most efficient way to do things. You might even find that you can make up the indexing array using CPU routines using a fast quicksort or radix sort and transfer it to the GPU.

Once the indexing array is on the GPU you can bind a texture to it and use texture reads, getting some benefit from cacheing. The benefit (or penalty) from using texture reads will depend on the size of your data set. Depending on the data sizes you might find that doing an interpolation search would be faster than binary.

You might choose to do the searching at the block level, and to have the threads do the data element addition in parallel. This would leverage the coalesced reads/writes.

Just some suggestions. I’ve coded some similar things, but not this exact problem. The basic idea is for any given output location you can identify one and only one thread that is going to store it. That gives you the atomic update you want.

At compute level 1.1 you have atomic update as a primitive, I believe, but it can be inefficient to use. See the SDK example simpleAtomicIntrinsics for usage. Your actual milage will vary.

Thanks, this is very helpful.