I am trying to implemented a lock and unlock function in CUDA.
The principle is simple. I put a global variable “lockVal” in the device memory. Its initial value is 0.
If one thread wants to enter the critical section, it has to read 0 from lockVal and change it to 1 at the same time. Otherwise, it will enter to the loop. device inline void lock(int *lockVal)
{
int tmp0=0;
int tmp1;
int val=1;
while((tmp1=atomicCAS(lockVal,tmp0,val))!=tmp0);
}
device inline void unlock(int *lockVal)
{
int tmp0=1;
int tmp1;
int val=0;
while((tmp1=atomicCAS(lockVal,tmp0,val))!=tmp0);
}
But it cannot work. It seams that no threads can enter to the critical section.
Can anyone help me figure out the reason of the failure? It makes me crazy.
Correct. When one thread grabs your lock, that thread will be temporarily disabled as the remaining 31 warp threads keep cycling try to reach that same “success!” instruction so the warp can be reconverged. But that lone disabled thread is holding onto your lock, so the remaining 31 threads will never succeed. Boom, you shot yourself in the foot.
Locks are tricky even on the CPU… on the GPU they’re even more complicated. As tmurray always will tell us, don’t go there, don’t do it, it’s not worth it, you’ll hurt yourself.
I’m almost regretting helping you (read again my warning!) but you may have more luck with these hacks by inlining your atomic operation inside the lock acquisition and check.
Something like:
{
// assume *lock has been globally initialized to 1. Any thread which can "grab" this value owns the lock and must return it.
bool needToDoWork=true;
while (needToDoWork) {
if (atomicExch(lock, 0)) {
/* Lucky winner! I got the lock! */
// do my work here......
atomicExch(lock, 1); // return the lock
needToDoWork=false;
}
}
Caveat: I have not tried the above code… I am just showing you a form I used successfully. It’s still evil.