Threads, branching and writing to global memory


I’m fairly new to CUDA development and am having some issues on best practice for CUDA kernel performance.

For my application I’ve written a kernel which searches some data to find certain patterns. This works well :) the data is partitioned into each thread and each thread works on a section of the data.

The problem I’m having is getting the results back to the host. What’s the best way to do this ?

So the majority of threads will find nothing, and need to simply terminate. Those that do find something need to get this information back to the host.


  1. Is it best to simply have a location in global memory for each thread and for each thread to write their result there? This is simplest kernel side, but requires a lot of global memory and then requires the host to search that global memory to collate the results.

  2. If I have a much smaller area of global memory that each thread could possibly write to this is more efficient for global memory but complicates the kernel as I now need critical sections to write to the memory.

  3. More generally, I’ve noticed that having conditional branches in the kernel really kills performance. I can understand this if the threads were diverging but I don’t believe they are. So if you want to test in the kernel and do something what’s the best way of doing that without killing performance?



I’ve been using the following approach in code we have. globalcounter is a pointer to a int in global memory that has been pre-initialized to 0 before the kernel is started (e.g. with cudaMemset)

if (found_result)
    int pos = atomicAdd(&globalcounter, 1);

    // the result array can only hold resultArrayMaxSize results
    if (pos < resultArrayMaxSize)
        // write to a result array at index pos

Recently I’ve replaced atomicAdd(&globalcounter, 1) with atomicInc(&globalCounter, 0xFFFFFFFF) because on Volta and Turing the atomicAdd instruction started to produce synchronization errors in cuda-memcheck’s synccheck tool. atomicInc() doesn’t exhibit this behavior.

To get the results back from device to host, you first need to copy the contents of globalcounter back, followed by a cudaMemcpy for the number of elements found in globalcounter

for faster cudaMemcpy operations, consider making the host side destination buffer for globalcounter and the resultarray page locked memory (i.e. allocate with cudaMallocHost)



Thanks for the prompt reply. I like! Much better than my attempt last night looping on a global CAS, which just felt totally wrong. I’ll try this later.


Also, if the atomic counter becomes a bottleneck you can do several tricks, like doing a warp/threadblock wide reduce/prefix before incrementing the counter or using several result arrays each with its own atomic counter, and a stride of 1 kbyte between two different atomic counters.