Is it faster for all threads in a warp to compute the same value, or for one thread to compute it and broadcast it to the other threads?

Hi all, I have a somewhat general question regarding a pattern that seems to come up fairly frequently.

Often I’ll have a situation where all threads need to compute a particular value, but I can guarantee that that value will be equal across all threads in a warp (e.g. if each thread in the warp is processing a different channel at the same x,y location then any f(x,y) will be equal across all threads in the warp).

In these situations, is it faster for each thread to compute the value individually, or for the first thread in the warp to compute it (within an if (threadIdx.x % warpSize == 0) {...} block) and then broadcast it to the other threads with __shfl_sync?

On the one hand I’d think that since all threads in a warp have to execute the same instructions under the SIMT model then there would be no advantage to having 31 threads idling rather than performing (redundant) calculations. On the other hand, I suppose that the number of cores in the SM is finite, so perhaps if only a single thread actually needs to use them (without being masked out) then that might leave space to allow the SM to execute instructions from other warps in the block?? The latter option seems unlikely, but honestly I have no idea how the hardware actually operates at that level. Is it the case that the hardware units are still “occupied” by inactive threads?

There is no advantage (for basic arithmetic computations) to just use one lane or one thread.

Rethink your algorithm: Can the other 31 threads do the same computations with other data? E.g. for the next iterations or other locations? Then you can calculate 32 locations and distribute one after the other to all 32 threads for common processing of each location.

Very basic integer computations can be done with the uniform instructions, which is like a separate lane and execution unit common for all 32 threads. To use it, all 32 threads have to do the computation and it has to be clear for the compiler that they are doing the same computation. It is often used for index calculations. And it works in parallel (but the instruction has to be still issued).

The uniform datapath seems like a complete integer datapath, as it even includes UFLO, UPOPC, UPRMT. Is there operations supported by the GPU’s regular integer datapath that are not supported by the uniform datapath? Off-hand I cannot think of any.

Redundant address computation certainly seems to be the biggest application of the uniform datapath, and is presumably also the prime motivation why it was added. In floating-point computation with matrices (1) a not insignificant portion of the instructions is needed for addressing and (2) much addressing computation occurs in uniform control flows, so that it makes sense to move these integer operations “out of the way” of the floating-point instructions that do data-transforming work.

This improves overall efficiency of execution, possibly also with respect to energy consumption for a given amount of data transformation, something that is important in supercomputers.

Ah, thank you, this idea hadn’t occurred to me (I’m quite new to CUDA programming). Very helpful :)
I suppose it would mean launching 32x fewer threads, so might not always be ideal in terms of occupancy if the inputs aren’t sufficiently large, though I guess I could tweak this to use only a portion of the 32 values in such cases.

Is there any way of knowing what the compiler will be able to recognise as uniform without somehow inspecting the SASS? E.g. is it likely to know that dividing the thread index by 64 will give the same result for all threads in the warp?

I guess this leaves open the possibility that in some cases (non-“basic” arithmetic) it could be faster?

There are two general ways to do it:

  • Each thread is responsible for a different location or
  • In some sections of your kernel each thread does a different location; and in another section each location is done cooperatively by all 32 threads, the locations are processed for that other section in a for loop over all locations.

So you are quite free to change the responsibility of the threads on the fly, as long as you handle the data transfer in a clever way and some synchronization (like __syncwarp());

Some sub-tasks can be parallelized more easily than others and sometimes you just want to parallelize global memory accesses for reading or writing.

So far I have not manually optimized the use of uniform instructions and have not looked into how intelligent the assembler does it (as far as I remember there is no way do prescribe using uniform instructions in PTX, but it is done as optimization by ptxas creating SASS out of PTX). It could be that the effect on better use of computation units is negligible (compared to using the normal INT32 path), but that the main advantage is using less normal registers, but the shared (by a warp) uniform register set, thus lowering register pressure.

I am not sure for some special operations like FP64 on consumer cards or FP16 with older cards, which perhaps have just 2 computation units, or where those units are shared over the whole SM.

But for the computation units integrated into the SM Partitions, there is no advantage. Those have a fixed relationship to one lane or a group of a few lanes.
There is (probably) no way those computation units can read the registers of other lanes as operands.

1 Like

(Assuming one block consists of 1024 threads.)

One option then would be to do those calculations by one warp of the whole block, by it creating 32 results and store those in shared memory.

Then 32 warps each read just one of the 32 results and each of those warps (with its 32 threads) cooperatively processes one location.

So the first calculation is done by one warp for the whole block. The second calculation is done by all warps.

Whether that makes sense depends on how complicated or simple those calculations are.

1 Like

I don’t know of any other way than inspecting the generated SASS. Anecdotally, from looking at quite a bit of SASS (but mostly for kernel of fairly low complexity), the compiler does a fairly good job of finding opportunities to use the uniform operations / registers.

Based on that, my take is that CUDA programmers should not worry about this until they have reached ninja-level optimization efforts.

1 Like

For posterity, having tried both versions on a little test kernel, I can confirm that there is definitely no advantage to the single thread + broadcast approach