a not so simple question

I have an “puzzle” that I need implemented on parallel processors, but it needs to be done with cuda to work with the rest of the program. After thinking about it for a while I decided that it would be best for me to talk to the community for your opinions on how to do it. I will try to write it in brain-dead english because it is hard to understand otherwise.

I have a kernel that has 64 blocks and 32 threads per block. Each thread has a register variable ‘int myInt;’ followed by another piece of data ‘int data;’. myInt is a random number between 0 and 50. I need the data portion in linear memory (global space) according to myInt (but only if myInt is not 0). If SOME of the data is lost that is ok but i’d like to keep as much of it as possible.

so optimally this is what will happen (where myInt is between 1 and 9):

each pair of integers is in register memory in a thread:

myInt      data

8             123

4             346

6             2

8             7445

7             23421

3             6516

8             3262

6             23445

3             93

would look like this in global memory (order of data doesn't matter):

globalPointer[3]: 6516, 93

globalPointer[4]: 346

globalPointer[6]: 2, 23445

globalPointer[7]: 23421

globalPointer[8]: 7445, 123, 3262

sorry for replying late. i think this is a canonical partitioning problem.
do you pre-allocate a seperate “large enough” memory for each “globalPointer”? that’ll be easy: each thread keeps a counter for “current writing offset of my globalPointer”. Each time, it reads 1 “myInt”, write to globalPointer[counter++].
O/w, if you require all "globalPointer"s sequentially align into a line, then that’s a little tricky, since you have to know the throughput of each thread beforehand, in order to scatter the data in such a way that they just align into a line. you can do 3 passes: 1st pass, each thread just “count” but not “write”; 2nd pass, you do a prefix sum to the “counters” of all threads, resulting the writing location of every thread; 3rd pass, you really write along with the count. please search on this board for my “in memory grid files” ppt for the 3-pass method.

One step in my code performs a similar function. Things I’ve tried/thought about.

  1. Each thread (or block) writes to one location in the globalPointer array
    Result: VERY slow. For my application, the number of partitions is proportional to the number of unique “data” values and this algorithm is thus O(N^2).
  2. In my app, at least, I can just as easily run threads over the “data” elements. But of course, one cannot build a list in parallel this way without atomic an read-increment instruction. I’m looking forward to trying this method on G92 which is fast and has sm11.
  3. Just do the partitioning on the CPU. This is much faster than 1) in my implementation as long as the number of data elements is less than 20,000. There, the times for 3) and 1) are about equal. And this IS counting the overhead of copying data from the GPU, partitioning, then copying the partition table to the GPU.

One could always sort by myInt as a last resort.
A radix sort is doable and has reasonably performance on G80.
However, CPU would still likely be a much better choice for less than 20000 as Mr. Anderson said.

misteranderson: that second thing you thought about with the g92… can you tell me more about that or where i can find out more?

also… im using way more than 20,000 sadly so cpu is out. - im looking at about 150,000 however as I said im only using 48 blocks and 32 threads… so to get 150,000 unique data values each thread loops 1000 times.

im thinking have each thread sort the values by myInt in shared memory then use this line of code: (y = maxmyInts/threaddimx.x) to figure out how many different myInt values each thread will have to compute. next, have every thread with the tid+every multiple of threaddimx.x send to the globalPointers in this fashion: globalPointers[n+myInt1Offset], globalPointers[n+myInt2Offset] (of course id be using an array of constants for the offset value to save on registers) <- kinda confusing but if anyone understands what i just wrote could you tell me if this would be a good method?

also… with 48 blocks and 32 threads/block, is my global memory accesses coalesced?

In my idea for G92 (or any sm11 device), each thread handles a single “data” value. From the value of data, I can determine what “myInt” is. globalPointer is preallocated to the expected maximum size of any one list, and all of the list lengths are set to 0.

In the kernel, after “myInt” is determined, the length of the appropriate list is incremented atomically and the value of data put in that location in the list. Memory writes are certainly not coalesced.

And there seems to be some confusion about the 20000 number I quoted. The GPU is FASTER for LESS than 20,000 data elements. After that the O(N^2) kicks in and makes the CPU faster for larger systems, because it can use a simple O(N) algorithm.