Is it faster for all threads in a warp to compute the same value, or for one thread to compute it and broadcast it to the other threads?

maxj · December 16, 2024, 2:30pm

Ah, thank you, this idea hadn’t occurred to me (I’m quite new to CUDA programming). Very helpful :)
I suppose it would mean launching 32x fewer threads, so might not always be ideal in terms of occupancy if the inputs aren’t sufficiently large, though I guess I could tweak this to use only a portion of the 32 values in such cases.

Is there any way of knowing what the compiler will be able to recognise as uniform without somehow inspecting the SASS? E.g. is it likely to know that dividing the thread index by 64 will give the same result for all threads in the warp?

I guess this leaves open the possibility that in some cases (non-“basic” arithmetic) it could be faster?

Topic		Replies	Views
A good idea or not ? need advice CUDA Programming and Performance	3	4403	January 11, 2010
Do I understand the nuances of __syncwarp() and __shfl() correctly? CUDA Programming and Performance	12	639	July 31, 2024
Every thread add to the same __shared__ memory at once? CUDA Programming and Performance	3	521	May 30, 2017
Thread question CUDA Programming and Performance	5	1968	December 2, 2008
Can one warp be doing one thing while another warp does something else? CUDA Programming and Performance	6	958	July 11, 2017
What's Uniform Register in Turing ? CUDA Programming and Performance	2	5495	November 27, 2019
Execute different instruction for each warp and synchronize CUDA Programming and Performance	6	1476	November 22, 2011
Global Sum on Multiprocessor CUDA Programming and Performance	6	7368	July 6, 2007
A question about calculatePartialSum sample code in CUDA_C programming guide CUDA Programming and Performance	13	906	September 29, 2023
Why does a warp consist of 32 threads? Why is a thread not say 16 or 64 threads? Whats the hardware CUDA Programming and Performance	14	21376	September 15, 2009

Is it faster for all threads in a warp to compute the same value, or for one thread to compute it and broadcast it to the other threads?

Related topics