A tight reduction loop may be the best example of this helping. It’s the best case for this kind of optimization I can think of… very tight, syncthread dominating.
For best contrast assume worst case of 512 threads to maximize the potential savings.
You could transform this code:
tid=threadIdx.x;
if (tid<256) val[tid]+=val[tid+256];
__syncthreads();
if (tid<128) val[tid]+=val[tid+128];
__syncthreads();
if (tid<64) val[tid]+=val[tid+64];
__syncthreads();
if (tid<32) { // no need for syncthreads for same warp which is always synced with itself
val[tid]+=val[tid+32];
val[tid]+=val[tid+16];
val[tid]+=val[tid+8];
val[tid]+=val[tid+4];
val[tid]+=val[tid+2];
val[tid]+=val[tid+1];
}
__syncthreads();
into this code:
tid=threadIdx.x;
if (tid<256) {
val[tid]+=val[tid+256];
__syncthreads();
if (tid<128) {
val[tid]+=val[tid+128];
__syncthreads();
if (tid<64) {
val[tid]+=val[tid+64];
__syncthreads();
if (tid<32) { // no need for syncthreads for same warp which is always synced with itself
val[tid]+=val[tid+32];
val[tid]+=val[tid+16];
val[tid]+=val[tid+8];
val[tid]+=val[tid+4];
val[tid]+=val[tid+2];
val[tid]+=val[tid+1];
}
}
}
__syncthreads();
What’s the savings? The warps that were disabled via the conditionals didn’t need to participate in the __syncthreads() overhead, and they don’t need to do extra test and compares.
Is this much of a savings? A bit. But if __syncthreads() costs 2 clocks per warp, this saves 8+12+14= 34 warp calls to __syncthreads(), that’s 68 clocks. It also saves 8+12=13=36 “if (tid<x)” tests, which are also 34 or 68 clocks. (I am just guessing in exact clocks here.)
Let’s try a simple cost model for a quick analysis. Say an if test and an add and a syncthreads all have similar costs of 1 “op”. Let’s assume there’s plenty of blocks so no latency or pipeline costs. Then the first reduction code, annotated with warp costs:
if (tid<256) val[tid]+=val[tid+256]; // 16+8
__syncthreads(); // 16
if (tid<128) val[tid]+=val[tid+128]; //16 + 4
__syncthreads(); // 16
if (tid<64) val[tid]+=val[tid+64]; // 16 + 2
__syncthreads(); // 16
if (tid<32) { // 16
val[tid]+=val[tid+32]; // 1
val[tid]+=val[tid+16]; // 1
val[tid]+=val[tid+8]; // 1
val[tid]+=val[tid+4]; // 1
val[tid]+=val[tid+2]; //1
val[tid]+=val[tid+1]; //1
}
__syncthreads(); // 16
The total is 148 warp ops.
The “relaxed syncthreads” cost:
if (tid<256) { // 16
val[tid]+=val[tid+256]; // 8
__syncthreads(); //8
if (tid<128) { // 8
val[tid]+=val[tid+128]; // 4
__syncthreads(); // 4
if (tid<64) {// 4
val[tid]+=val[tid+64]; // 2
__syncthreads(); //2
if (tid<32) { // 2
val[tid]+=val[tid+32]; // 1
val[tid]+=val[tid+16]; // 1
val[tid]+=val[tid+8]; // 1
val[tid]+=val[tid+4]; // 1
val[tid]+=val[tid+2]; // 1
val[tid]+=val[tid+1]; // 1
}
}
}
__syncthreads(); // 16
The total is 80 warp-ops, versus 148. So that’s really nice… about 45% fewer ops.
Of course this is best case analysis of a the best function that could benefit, and my assumptions about op costs may not be valid, but it’s a quick analysis anyway.
However to be honest I don’t see too many other cases other than reduction where this optimization would help noticably.
Perhaps easier for NVidia (and users) would be to leave __syncthreads() alone, but add a new superoptimized __syncthreads_reduce() primitive which acts like __syncthreads_count() but does full reductions on the major types of int, float, long, and double. (Such a routine might require dynamic shared memory, which would need to be allocated like the example above implies anyway.)