How to express this efficiently in CUDA?


I’m looking for a bit of advice regarding how to effectively code the following in cuda:

I’ve got 16 (or some 2^small_n) lists of 32 bit ints, sorted in ascending order, with a sentinel (0x7fffffff) at the end of each list as inputs. These are read linearly from 16 offsets within a 1D texture. I need to merge these 16 lists, producing a single result containing only the values that are present in >=k (0<k<=16) of the lists.

My (unimplemented) plan, which can probably be improved upon a lot, was to cache 4 ints at a time in each of 16 threads, then:



calculate the minimum across the 16 threads in shared memory.

if the minimum == the sentinel, break.

for each thread that has current value = minimum:

set array[thread] to 1

move to the next value (possibly requiring a new texture fetch).


set array[thread] to 0

if parallel sum across array is >=k, thread 0 writes the value.


It seems like a relatively inefficient approach, but maybe it’ll be ok, given the cycles taken for the memory access will probably dominate? Is there an accepted pattern for kernels that need to return a variable-sized amount of data?

As an aside, the latter half of the loop (parallel sum, and comparison with constant) seems like a useful generalisation of the warp vote functions __all() and __any().

An alternative approach (and the way I’ve implemented it serially) is to associate a counter with each value, and scan the 16 arrays, incrementing the referred-to counter and outputting the value when its counter reaches k. My intuition is that this’ll be slower, as it involves more memory traffic, and more synchronisation.

If anyone has any ideas, I’d be really grateful.