What is the best way to keep threads busy doing nothing until they are needed again?

I have a fairly complex 1 block kernel with 1020 threads.

Part way through, only a subset of the threads are required and the rest would mess up shared memory if they execute the same function.

If I use:

if(tx < x && ty < y)
subset function which requires __syncthreads()

Synchronization gets messed up for the remaining functions which require more threads than x and y.

What is the best way to make threads greater than x and y wait for those less than x and y?

Is this what __threadfence() is for?

If the number of threads required for the subset function is less than or equal to a warp block size, and within the same warp block, reconsider whether you really need __syncthreads() within that particular function

Otherwise, have all threads execute - move/ jump to - the subset function, and rather make function or within-function-code-block execution conditional on thread nr/ count or id
This would keep __syncthreads() converged

In plain words, move your if() to reside within the function instead

If you actually need __syncthreads within the particular if-section (now in the function) as well, consider casting the condition as a boolean, and break the if-section up in the blocks delineated by __syncthreads() calls

Like so:

if (tx < x && ty < y)

[code block A]


[code block B]


[code block C]



bool_recast = false;

if (tx < x && ty < y)
bool_recast = true;

if (bool_recast == true)

[code block A]

if (bool_recast == true)

[code block B]


if (bool_recast == true)

[code block C]


On sm_20 or higher you can use bar.sync inline PTX with a thread count. The thread count is the number of active warps times the warp size (32).

On any compute capability you can have a flag in shared memory that indicates whether you are still in the block with fewer active threads and have those warps that have no active threads loop over __syncthreads() while they are unused.

If the number of __syncthreads() needed inside the block is deterministic, just issue that many __syncthreads() for those warps with no active threads in the block.

As an addendum the simplest case would be where you can arrange threads so that each warp has at least one active thread inside the block. In that case you could just use __syncthreads() inside the block. However this arrangement would be inefficient due to low occupancy within the block as warps are only partially filled.

NOTE: In all these cases, it is important to think in terms of warps, not threads. Each warp is either part of the synchronization, or is not. Having any warp take both code paths will lead to undefined results.


“On sm_20 or higher you can use bar.sync inline assembly with a thread count. The thread count is the number of active warps times the warp size (32).”

Could you possibly elaborate on this, please

Also, would your addendum not be complex - or simply non-simplistic - to implement, particularly when one wishes to reference global memory and/ or conduct tasks like scans with/ across the active threads?
Regardless, I fully agree that it is a hypothetical possibility/ option

Ah sorry, had meant to link to the PTX documentation for bar.sync. Now done.

The addendum indeed is quite hypothetical.

Semi-OT, this is an educational read: Singe: Leveraging Warp Specialization for High Performance on GPUs

Thanks to all for your suggestions.

I’ve decided the best approach for my app is make reduction functions (uses __syncthreads) which accept dummy values for shared memory ptr and offset and call them with the correct values for threads less than x and y, and dummy values for larger threads.

That way, all threads call the same functions, but the threads greater than x and y will not mess up the shared memory being reduced.

ETA: initially I was going to use a single shared memory address for the dummy functions, but then I realized there will be memory conflicts as multiple threads try to read/write the same address.
The solution is to maintain a separate shared memory block for the dummy functions.

Maybe some day I will learn how to use bar.sync…

With regards to your dummy functions, why would you actually care if memory conflicts occur (assuming that the conflicts are confined to threads redundant/ obsolete with regards to the particular function that must be executed)?

Also, just be careful not to have threads not required at the time, but executing anyway, draw on scarce resources - like cores - as this would unnecessarily slow down overall execution

Since all the threads have to sync up later, I am under the impression any slowdown in the superfluous threads due to memory conflicts would slow down all synced threads. Is this not true?

I am not familiar with how threads occupy cores. Would bar.sync free up cores that would otherwise be used.

I, on the other hand, was more under the impression that memory conflicts would more or less imply “expect garbage in memory”, because of races
I know on the host side, you may have memory deadlocks with regards to memory shared by threads; but I am not sure whether this can actually occur on the device; perhaps someone more knowledgeable than me can shed light on this
Because you do not care about the result, you would not care about garbage in memory

The reference to cores, was a reference to cuda cores - if, for example you have arithmetic in your function that are processed by the cuda cores, you would have more threads than truly necessary compete for cuda cores

A mere flag in local or shared memory (and I suppose bar.sync too) would prevent a particular thread from executing the function code, if the thread is not needed; if a thread does not execute the function, it can not compete for such resources

This true, but I am under the impression that if 2 threads try to write to the same shared memory address, they will be serialized and take more clock cycles.

Multiply by N extra cycles for N threads trying to read/write the same address. Not so?

I was not aware that threads occupy and relinquish cores. I would have thought that once a thread starts, it occupies the same core until it terminates.

threads do not occupy cores, but reference - make use of - them
double precision (data type double) arithmetic serves as a good example; if you have double precision arithmetic in your code, you can be sure that the cores process that, not the threads per se; that is sent to the cores for processing

OK, I think you’re referring to an ALU, not a core.

IIRC, there are only 2-4 floating point ALUs per SM, which are allocated to threads as needed, whereas there are 1-2K cores. Not sure whether a thread can stay alive w/o running on (occupying) a core. I believe hyper-threading on an Intel CPU may store thread info in extra registers, which get swapped into actual CPU state registers when the hyper-thread becomes an active thread. Don’t know if Nvidia has hyper-threads. It seems that the 2K cores per SM in a Titan may be the source of the 2K thread limit per SM.

If it had hyper-threads, it seems the thread count would not be limited by the core count.

ETA: I was mistaken about the core count. Only 192 cores per SM on the Titan, 2688 total, so apparently they must have hyper-threading to allow 2K threads per SM.

Thanks for your suggestion, it appears you are correct.

I should have said that the easiest way is to maintain dummy functions which do not write relevant memory, not the best way.

I will have to reexamine how to best do this when further optimizing for speed.

Going back to your point regarding threads writing to the same shared memory:

As per the programming guide:

“Shared memory has 32 banks that are organized such that successive 32-bit words map
to successive banks. Each bank has a bandwidth of 32 bits per clock cycle.
A shared memory request for a warp does not generate a bank conflict between two
threads that access any address within the same 32-bit word (even though the two
addresses fall in the same bank): In that case, for read accesses, the word is broadcast to
the requesting threads and for write accesses, each address is written by only one of the
threads (which thread performs the write is undefined).”

This is for device compute capability of 3, and higher I believe

Thus, threads in the same warp would not cause serialization, when writing to the same shared memory location, it seems
Honestly, I know not what happens when threads across multiple warps attempt to write to the same shared memory location
At the same time, the question whether different warps would always arrive at such a instruction - write to shared memory - at the same time

With regards to scarce resources and cores:
I have come across mention of the divide between the programming model that cuda provides, and actual, physical gpu architecture/ hardware
When I look at diagrams of gpu architecture, I see cuda cores, load/ store units, special function units, the SM, the schedulers, and of course the possibility of SIMT; but hardly threads, or units that ‘store’ them, per se
Perhaps I should look closer