It doesn’t work that way, because your code is executed by a SIMD (“SIMT”) machine.
The compiler will insert a branch reconvergence point just after the end of the loop, so that the processor has to wait for all threads of the warp to finish executing the loop before proceeding further.
This heuristic minimizes the amount of branch divergence in common code, but causes a deadlock here (S=0; is never executed).
i use cuda-gdb to debug the code
it seems 1 thread is stay at “if(atomicCAS(&S, 1, 0))”
other threads stay at the end of while loop, maybe waiting for the join and starting next loop
These things were discussed long back in NV formums… We even got code that working…
Here is a code that demonstrate spinlock written by Tim murray long time back… But I am not sure this copy was modified by me OR is a pure copy… or whatever… It could be an untested version as well…
But it demonstrates how to go about locking in CUDA:
#include <stdio.h>
#include <stdlib.h>
__device__ volatile int lock = -1;
__device__ volatile int counter = 0;;
__global__ void spinlol()
{
__shared__ int intraCTAlock;
if (!threadIdx.x && !threadIdx.y)
intraCTAlock = -1;
__syncthreads();
if (!threadIdx.x && !threadIdx.y)
while (atomicCAS((int*)&lock, -1, blockIdx.x) != -1);
__syncthreads();
/* ATTENTION: Changed 12 to threadIdx.x */
while (atomicCAS(&intraCTAlock, -1, threadIdx.x) != -1);
counter++;
__threadfence();
atomicExch(&intraCTAlock, -1);
__syncthreads();
if (!threadIdx.x && !threadIdx.y)
atomicExch((int*)&lock, -1);
}
int main(int argc, char** argv)
{
int hostcounter = -1;
dim3 grid, block;
if (argc != 3)
{
printf("Usage: <executable-name> N_Blocks N_Threads\n");
return -1;
}
grid.x = atoi(argv[1]);
block.x = atoi(argv[2]);
printf("Spawning kernel with %d blocks, %d threads_per_blocks\n", grid.x, block.x);
spinlol<<<grid, block>>>();
cudaThreadSynchronize();
printf("err = %s\n", cudaGetErrorString(cudaGetLastError()));
cudaMemcpyFromSymbol(&hostcounter, "counter", sizeof(int), 0, cudaMemcpyDeviceToHost);
printf("counter = %d\n", hostcounter);
}
honestly, if you’re trying to do this you’re probably going down the wrong path, but general rules of thumb are
don’t have multiple threads within a warp contending for a lock, that leads to all sorts of confusing issues for most people because inter-warp branches are not the same as intra-warp branches
avoid global memory contention as much as possible (e.g., if you need to have a critical section among all warps in all CTAs, do per-CTA shared memory locks then a global lock)
traditional threading primitives implemented with atomics are a pretty terrible idea, if you can avoid atomics as much as possible (or entirely) you can get a big perf win (and there are very interesting ways you can do this, and when I say big perf win, I mean on the order of 5-10x)
(“well,” you think, “it sounds like tim is speaking from experience!” oh yes, I am)
i knew this could be a wrong path at first place, just wanna check it by myself. because if it works in a efficient way, I would definitely gain perf on what I’m trying to accompolish.
Check this out… Biiig thread… and lot of initial findinds were challenged and somewhere concluded in the thread… I dont have time to explore this… If you have, you may check out… If I remember things , I will post it for you.
The easiest way to do a per warp critical section is to just make all threads in the warp operate on the exact same data - hence, if you wanted only the first element of an array processed by a given warp, have all threads in the warp just process element 0. Since the warp is executed in lockstep, every thread will be computing identical results to if there were just one thread running.
So, to do a warp wide critical section, simply make all threads run through identical code paths with identical data input.