Busy Waiting in CUDA

mhkgalvez · February 8, 2013, 10:53pm

Hi all,

I am new at CUDA programming and need to create a program that performs some operation inside a matrix. I split the matrix into columns, assigning one thread to process each column. The logic of the program, however, forces me to make each thread y wait for the completion of the task of its predecessor y-1. I just thought about using a token, shared for all threads and that starts with 0. All the threads, before doing what they are supposed to do, should verify if the token has the value of the thread Id (y). By the end of the task, a thread increments the token, giving opportunity to the next one do its job.

My problem with it is that I need atomic operations, otherwise, things like increment over a variable that is shared will not work properly. Of course, I know, CUDA has atomicInc(), and that works very well.

The problem is when I try to make the loop that makes the thread waits until it is its time to proceed (busy waiting).

Let y be the thread Id and token a pointer to the place in memory where the real token is stored. By doing while(*token != y) {} I of course get in trouble, since it is not performed as atomic operation. Then I found atomicCAS(), another CUDA built-in function. It seemed very likely to solve the problem, however, it didn’t. And here is why: the CUDA atomicCAS() changes the value of the memory space I used to compare! Here is what is said at CUDA Programming Guide 4.2:

[i]int atomicCAS(int* address, int compare, int val);

reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old == compare ? val : old), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old (Compare And Swap).[/i]

So, instead of comparing and returning back the result, it only returns the old value of token (which I don’t care as return value) and put the value I give to it into the token if the comparison is succeeded. So, I cannot perform the simple operation is token equal to thread ID? atomically. I am pretty sure there is another way to do such a thing, but I am not being able to figure out what way.

Could you guys help me please?

p.s.: I am using CUDA 5 in a GPU with architecture 3.0

PedroUK · February 9, 2013, 1:25pm

Hi Mhkgalvez,

Actually, your

while (*token != y);

loop should work… Provided you have declared token as

volatile int

This forces the code to re-read it from global memory every time you access it.

I am a bit confused, though, about how you’re trying to synchronize threads. Recall that within a warp, i.e. within chunks of 32 threads in each block, the threads execute in strict lock-step parallelism. Making one single thread wait will cause the whole warp to wait.

mhkgalvez · February 9, 2013, 6:37pm

Hi PedroUK,

Thank you so much for your answer.

I really didn’t know about the lock-step parallelism. As I am testing my application with only 10 threads, of course they are all in the same warp. But, if things are this way, so I suppose I am really unable to do what I want, right? And that is because if I make thread 1 wait for thread 0, but both are in the same warp, then I have a livelock condition, since both are blocked.

Does anyone have another idea or solution for the problem?

mhkgalvez · February 9, 2013, 8:21pm

PedroUK,

I just teste volatile int, but it won’t work since I don’t have an integer. What I have is a pointer for integer.

Topic		Replies	Views
Busy Waiting in CUDA CUDA Setup and Installation	0	918	February 8, 2013
Can you help-me about Cuda `AtomicCAS`? CUDA Programming and Performance cuda , kernel	0	327	October 5, 2023
atomicCAS for mutiple blocks & mutiple threads - CUDA 3.2 - Fedora 10 CUDA Programming and Performance	7	2595	April 25, 2011
atomicCAS issue (possible deadlock) CUDA Programming and Performance	5	3324	October 26, 2011
atomicCAS() doesn't work! CUDA Programming and Performance	4	9233	July 22, 2010
Deadlock in busy waiting queue CUDA Programming and Performance cuda	6	381	June 13, 2024
Implementing mutual exclusion lock using atomicCAS() CUDA Programming and Performance	2	2427	August 5, 2009
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	20106	January 21, 2010
noob question about ATOMIC operations... CUDA Programming and Performance	3	1218	February 18, 2013
A weird behaviour of atomicCAS() CUDA Programming and Performance	2	1461	June 15, 2012

Busy Waiting in CUDA

Related topics