Busy Waiting in CUDA

Hi all,

I am new at CUDA programming and need to create a program that performs some operation inside a matrix. I split the matrix into columns, assigning one thread to process each column. The logic of the program, however, forces me to make each thread y wait for the completion of the task of its predecessor y-1. I just thought about using a token, shared for all threads and that starts with 0. All the threads, before doing what they are supposed to do, should verify if the token has the value of the thread Id (y). By the end of the task, a thread increments the token, giving opportunity to the next one do its job.

My problem with it is that I need atomic operations, otherwise, things like increment over a variable that is shared will not work properly. Of course, I know, CUDA has atomicInc(), and that works very well.

The problem is when I try to make the loop that makes the thread waits until it is its time to proceed (busy waiting).

Let y be the thread Id and token a pointer to the place in memory where the real token is stored. By doing while(*token != y) {} I of course get in trouble, since it is not performed as atomic operation. Then I found atomicCAS(), another CUDA built-in function. It seemed very likely to solve the problem, however, it didn’t. And here is why: the CUDA atomicCAS() changes the value of the memory space I used to compare! Here is what is said at CUDA Programming Guide 4.2:

[i]int atomicCAS(int* address, int compare, int val);

reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old == compare ? val : old), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old (Compare And Swap).[/i]

So, instead of comparing and returning back the result, it only returns the old value of token (which I don’t care as return value) and put the value I give to it into the token if the comparison is succeeded. So, I cannot perform the simple operation is token equal to thread ID? atomically. I am pretty sure there is another way to do such a thing, but I am not being able to figure out what way.

Could you guys help me please?

p.s.: I am using CUDA 5 in a GPU with architecture 3.0