Generating unknown amount of data

In a cuda application I need to create new data on the GPU, based on the given values in an array. I would like to start a thread for each point in the array and this thread creates a new array.
However:

  • if, for example, the value in the array equals zero, the thread needs to do nothing
  • it is not known in advance how many threads will generate how much data

I could allocate a maximum per thread on the gpu, but:

  • this data structure will take up too much memory
  • the data structure will be sparse, therefore useless data will be moved from gpu to host

An example:
In block X only thread A is confronted with a value > 0 and creates an array of size 5kb
In block Y threads B and C are confronted with a value > 0 and create arrays of size 1kb and 3kb

I was thinking about allocating for example 20mb and use blocks of 1kb for storing information (much like a harddisk). But this might also create a sparse data structure. And how about administration?
Am I making any sense? Anybody any pointers to literature or websites (or something in the sdk I missed) perhaps? I can’t imagine I’m the first to encounter this problem O:)

You might try making two arrays: One short array has an integer for each thread which will store the number of bytes (or floats) that thread output into the main array. The main array is spaced out so that each thread owns a chunk of that memory space, where the width of each chunk is the maximum amount of data a thread can output.

After the kernel runs, first you read back the short usage array to find out how many bytes to read from each chunk of the main array. This is still a little inefficient, since you will have to issue one memory copy per chunk, rather than read back all the data in one large copy operation.

The problem with this memory layout is all the writes will be uncoalesced if each thread in a block is writing to a completely separate chunk of the main array. It would be better if each chunk were assigned to a block, rather than a thread.

Part of my code needs to perform a similar operation (one variable length array built per thread). My solution to this problem is to keep it simple and allocate the maximum amount for each thread. This does waste memory, but there is unfortunately no way around it that I can think of.

Any sort of memory allocation mechanism on the device is going to need slow atomic operations with many threads competing to increment the same value. If your algorithm is at all memory access bound, I would expect the performance to be no faster than an optimized CPU version.

If the problem is more in the wasted GPU->CPU memory transferred, you can always perform another pass and copy the sparse data to a compact array for transfer.

This is in interesting thought siebert, however I cannot wrap my mind around the following: there is no communication between blocks, so you don’t know how many threads in other blocks need to place information in the short array. Hence you must assume the worst and allocate for each possible thread an integer. And again, this will also create a sparse array.
I’m not interested in which thread created what, but only in the results.
Is a reduction algorithm of some sort perhaps an option?

In reply to MisterAnderson42:
Compacting the original array could be an option! It requires some additional administration, but that’s oke. The original array is about 620mb, but contains only about 10mb or 20mb of relevant data.
Does anybody have examples of how to compact your data? (true, haven’t used google yet B)

Sounds like we’re trying to solve similar problems… I need to compact (i.e, collect all floats > 0 in) a large array. Check this thread:

http://forums.nvidia.com/index.php?showtopic=60352

Haven’t come up with a good solution yet but will check the links provided by wumpus.

/Lars

You can perform a prefix sum (see the sample and whitepaper in the SDK) on an array that lists the length of each variable sized array. Since the prefix summed array lists the cumulative total, you can then use the value from the prefix sum array as the start point in the compacted array.

i.e.

Perform prefix sum on length[] into cumulative_length[]

Each thread:

start = cumulative_length[thread_index]

for i from 0 to length

    packed_data[start+i] = data[i]