Output through shared memory

I have a kernel where threads generate an output quite randomly. The maximum number of outputs per thread is defined, but the actual value depends largely on the input data. When each thread writes directly to the global memory writes are non-coalesced, therefore it is quite slow.
Another possibility is to use a small buffer in shared memory as an temporary output. When this buffer becomes full (or at the end of processing) write the content into global memory. I have problems with the former case.
As I see, when the SM buffer becomes full, all threads (or at least the first warp) should write into global memory - during this time, other threads should be paused. When writing ends, threads should resume processing. Any idea how to achieve this?.

You can use __syncthreads() to synchronize threads and make sure all threads see the same data in shared memory.

You could possible also use a more fine-grained mechanism using warp votes…

N.

Thanks for the reply.

My problem is where to synchronize threads. I have 512 threads in a thread block and a single output is 2x8 byte. To have ~0.5 occupancy I cannot allocate 512x2x8 byte output buffer (moreover SM is also used for input data), therefore I cannot guarantee that all threads can at least write an output before the buffer becomes full. So the thread which writes the last data should stop there - using __syncthreads() there is an option. However, it is also not guaranteed that other threads will ever execute the same code path, therefore they will not run into __syncthreads().

What I need is something like:

  • for the thread which writes last data, right after the write: jump to SM -> global memory copy

  • all other threads (irrespectively of the executed code path) jump to SM -> global memory copy

  • copy data

  • resume normal execution

I don’t see a way to do this with __syncthreads(), but I am really open to suggestions :).

By “threads generate an output quite randomly” do you mean that on an iteration of some outer loop that only some threads produce output ?
or were you meaning they all produce output but it is to locations that may be far apart in global memory ?

if it is the former what is the absolute maximum number of threads that might produce an output on the same iteration of that loop ?
Can you give some info on the probablity of a thread producing output on an iteration ?

If the maximum ‘M’ is a fair bit a less than the blocksize then you could have a buffer that any threads with output write to and if the free space left in the buffer is less than M then on that iteration you have to write the data to global. (i.e. the buffer must always have enough free space at start of an iteration to hold all the output that iteration might make)

If ‘M’ is up near blocksize then then this requires making the buffer able to hold one value per thread which you say you can’t do. In that case I think I would be looking at reducing the blocksize.

If you can’t reduce the blocksize and have to use a rather small buffer then I think it gets more complicated. e.g. If your threads have a spare register they could save one output value in that register and when all threads in a warp have an output they could all write to global.

I try to be a bit more specific and give more details. The algorithm is some kind of filtering on input values. That is, from every input several output candidates are generated. If the candidate fulfills a predefined filter equation, it is written to the output buffer. Every thread works on a small amount of input, and at maximum every thread generate 32 output values (a single output is 2*64 bit) - one in every iteration. The input is relatively small (~2kbyte) and stored in SM.

As the maximum number of outputs for a thread block in a single iteration is larger than what can be stored efficiently in SM it is not an option to allocate this amount of memory and write output to global memory at the end of every iteration.

Using registers to store a single output value for every thread is also problematic, because it may happen that a thread have several outputs before other threads have one.

What I did - without using SM for temporary output - is the following. For every thread I allocate a small amount of global memory, typically between 8 - 32 output large. If a thread used all of its allocated space, it gets a new one with global atomic. I did tests with three cases:

  • when no output is generated the runtime is 0.08 sec (still, all computation is done)

  • when all possible outputs are generated (32 per thread) the runtime is 3.65 sec with 32-output large buffers (that is no thread have to allocate a new memory buffer during execution). When the buffer is 32-output large, the runtime is 5.93 sec, when only 8-output large, the runtime is 22 sec

  • when filtering is done (~20% of the possible outputs are generated) the runtime is 1.34 sec for 8-output large buffer and 1.49 sec for 4-output large buffer

Without atomic, due to the non-coalesced write, the write memory bandwidth is only ~18 GB/s (GTX260). If a lot of new buffer allocation is required with global atomics the achieved bandwidth decreases considerably.

Naturally if you don’t mind gaps in your output buffer, you could simply always write the output along with a flag ok / not ok.

I had a very similar problem where the output buffer later was used as input, so this didn’t work for me as I wanted coalescing global memory reads.
I solved this using a global memory output buffer per threadblock, by using this approach you can store the current pointer in the output buffer in SM.
The basic idea is for each thread to use atomicInc to increase this pointer when necessary.
This has two problems:

  • writes coalescing is probably still bad
  • all threads in the block all want to perform atomicInc on the same pointer, so they will have to wait on each other.

This can be improved to obtain more coalescing global memory writes and reduce the number of threads that might have to wait on each other.
Each (half)warp first inventarises how many of its threads want to store something and settle on their order,
then the ‘leader’ thread (lowest threadId) uses atomicAdd to reserve the required contiguous space in the output buffer.
Then each thread that wants to store something simply writes it to the correct place within this reserved contiguous space.

More detailed:
Per (half)warp I additionally used a ok-counter and a temporary start pointer in SM.
I assume the (half)warp is syncronized at this point:

  1. thread with lowest id resets the ok-counter
  2. each thread that wants to store something uses AtomicInc to increase ok-counter and obtains its index
  3. thread with lowest id uses atomicadd to add the ok-counter to the current block output pointer, and obtains and stores a start pointer
  4. __threadfence_block() so each thread will see the correct start pointer
  5. each thread that wants to store something writes to start pointer + its index

Ideally the ok-counter stuff could be replaced by a vote function if CUDA supported one that would return the actual number of votes.
As it is now, the threads within a (half)warp have to wait on each other with the atomicinc’s,
but at least now they don’t have to wait anymore on other (half)warps except for the one atomicadd per (half)warp.

Global writes per (half)warp are now contiguous and are therefore coalesced (except it may take 2 iterations when the writes cross a coalescing bound?).

Also the atomic operations are now in SM (requires compute capability 1.3) and are a lot faster than in global memory.

First of all thanks everyone for the suggestions.

I have implemented two versions.

The first one (V1) allocates small blocks in global memory on a per thread basis. Block allocation is done with global atomic add. As consecutive addresses are written by the same thread this is expected to have not so good performance due to non-coalesced access.

The second one (V2) allocates a little larger global memory blocks per warp (again with a global atomic add). If they have valid output data threads within a warp write to consecutive addresses. The drawback is the shared memory atomic add to get the number of valid outputs within the warp.

In the table below nodata means that all computations are done, but no global memory write happens (so this is more or less the non-bendwidth limited case). filtered is a version where output validity is determined using a filter function.

_________GTX260 _______GTX480
_________V1___V2______V1____V2
nodata___2.8___14______1.3____8.4
filter_____53___15______4.3____8.7

So it seems that the L2 cache helps a lot for Fermi. Although V2 scales better for both cards, memory writes seem to be coalesced thanks to the L2 cache. And in the nodatacase V1 is probably faster because there are no shared memory atomics used. Interesting, I think.