Busy Waiting in CUDA

Hi all,

I am new at CUDA programming and need to create a program that performs some operation inside a matrix. I split the matrix into columns, assigning one thread to process each column. The logic of the program, however, forces me to make each thread y wait for the completion of the task of its predecessor y-1. I just thought about using a token, shared for all threads and that starts with 0. All the threads, before doing what they are supposed to do, should verify if the token has the value of the thread Id (y). By the end of the task, a thread increments the token, giving opportunity to the next one do its job.

My problem with it is that I need atomic operations, otherwise, things like increment over a variable that is shared will not work properly. Of course, I know, CUDA has atomicInc(), and that works very well.

The problem is when I try to make the loop that makes the thread waits until it is its time to proceed (busy waiting).

Let y be the thread Id and token a pointer to the place in memory where the real token is stored. By doing while(*token != y) {} I of course get in trouble, since it is not performed as atomic operation. Then I found atomicCAS(), another CUDA built-in function. It seemed very likely to solve the problem, however, it didn’t. And here is why: the CUDA atomicCAS() changes the value of the memory space I used to compare! Here is what is said at CUDA Programming Guide 4.2:

[i]int atomicCAS(int* address, int compare, int val);

reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old == compare ? val : old), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old (Compare And Swap).[/i]

So, instead of comparing and returning back the result, it only returns the old value of token (which I don’t care as return value) and put the value I give to it into the token if the comparison is succeeded. So, I cannot perform the simple operation is token equal to thread ID? atomically. I am pretty sure there is another way to do such a thing, but I am not being able to figure out what way.

Could you guys help me please?

p.s.: I am using CUDA 5 in a GPU with architecture 3.0

Hi Mhkgalvez,

Actually, your

while (*token != y);

loop should work… Provided you have declared token as

volatile int

This forces the code to re-read it from global memory every time you access it.

I am a bit confused, though, about how you’re trying to synchronize threads. Recall that within a warp, i.e. within chunks of 32 threads in each block, the threads execute in strict lock-step parallelism. Making one single thread wait will cause the whole warp to wait.

Hi PedroUK,

Thank you so much for your answer.

I really didn’t know about the lock-step parallelism. As I am testing my application with only 10 threads, of course they are all in the same warp. But, if things are this way, so I suppose I am really unable to do what I want, right? And that is because if I make thread 1 wait for thread 0, but both are in the same warp, then I have a livelock condition, since both are blocked.

Does anyone have another idea or solution for the problem?


I just teste volatile int, but it won’t work since I don’t have an integer. What I have is a pointer for integer.