Is it faster for all threads in a warp to compute the same value, or for one thread to compute it and broadcast it to the other threads?

Ah, thank you, this idea hadn’t occurred to me (I’m quite new to CUDA programming). Very helpful :)
I suppose it would mean launching 32x fewer threads, so might not always be ideal in terms of occupancy if the inputs aren’t sufficiently large, though I guess I could tweak this to use only a portion of the 32 values in such cases.

Is there any way of knowing what the compiler will be able to recognise as uniform without somehow inspecting the SASS? E.g. is it likely to know that dividing the thread index by 64 will give the same result for all threads in the warp?

I guess this leaves open the possibility that in some cases (non-“basic” arithmetic) it could be faster?