Whoa! I think that is because the “ELSE” part gets executed first during a WARP divergence… And, hence the warp scheduler is busy with the loop and does NOT execute “*lock = 0”.
Let me leverage this concept and see if that clicks ( thanks to PDan (iirc) for raising that interesting topic)
Great ideas usually occur only in restrooms. I think it is high time scientists research this phenomenon.
The variable result is thread-local, so whether it is volatile has no meaning.
I suspect the race condition is due to multiple threads writing to the same location. The program is definitely hanged since you reverse the while condition. Can you try this:
......
} while (result != 0xFF);
_syncthreads();
if (threadIdx.x == 0)
{
atomicAdd(variable, 1);
*lock = 0;
}
Yep, I tried that __syncthreads() thing and that too did NOT work.
Here is an alternate code that does NOT work - using atomicExch()
__global__ void mykernel(int *lock, int *variable)
{
volatile int result;
spin:
result = atomicExch(lock, 0xFF);
__syncthreads();
if (result == 0)
{
*variable = *variable + 1;
atomicExch(lock, 0);
}
__syncthreads();
if (result != 0)
{
goto spin;
}
}
This one works for 10 blocks with 512 threads – but it is DEAD slow!
But does NOT work with 20 blocks… The kernel times out.
There is something seriously wrong, I guess…
At this point , I think the problem is related to “Starvation” – due to bad scheduling…
The atomicExch(…,0) – the one that releases the lock – is NOT scheduled correctly due to in-cessant spinning from other threads…
I think one should NOT tight spin… Thats the moral. There should be a way to wait…
Oh yes… I remember the same problem while writing CPU code as well. …
TRY first and if that succeeds then LOCK… THat was the secret of performance… Otherwise cache-lines were un-necessraily killed resulting in CPU cache ping pong between multiple CPUs…
May b, I should try doing that.
Nonetheless, I think NVIDIA’s hardware must be smart enough to find the isolated transaction and schedule it fast (because only one thread uses things atomically and the lone memory transaction should be honoured more than the atomic locking requests)
Of course your code is going to deadlock. One thread will succeed and progress past both __syncthreads(), while every other thread in the CTA is going to go back and hit to __syncthreads() that not all threads in the CTA hit. There is no vagary of scheduling here, your code is just wrong.
The locking code is absolutely wrong… You keep on spinning until “result != 0xFF” i.e. whenever it is un-locked, you spin… That is wrong.
You have to spin, only when it is locked… (result == 0xFF). atomicCAS returns the old value present in that memory location. Please check the documentation of “atomicCAS”.
Also, I dont understand why you have used “threadIDx.x == 0” condition. Only the thread that acquires the lock must do the operation.
Here is the code that works for 10x512 perfectly. It takes 2.xxx seconds to complete though. See attachment for the required header file “PerformanceCounter.h”
Does NOT work for 20x512 - just times out. At least on few occasions, I have seen it passing without hitting watchdog. It took 12 seconds, in those cases.
Here is a sample output.
"
CUDA Kernel Launch Error Value = 0 (0 means success)
Kernel ran for 12.644320 seconds
10240
"
Without the __syncthreads(), the code JUST hangs! This is because of the WARP divergence problem I think – of which programmers hve no control.
#include <stdio.h>
#include "PerformanceCounter.h"
__global__ void mykernel(int *lock, int *variable)
{
volatile int result;
spin:
result = atomicExch(lock, 0xFF);
if (result == 0)
{
*variable = *variable + 1;
atomicExch(lock, 0);
}
__syncthreads();
if (result != 0)
{
goto spin;
}
}
int main(void)
{
int *dev_variable, init=0;
cudaError_t err;
HPTimer profiler;
SetThreadAffinityMask( GetCurrentThread() , 1);
cudaMalloc((void **)&dev_variable, 2*sizeof(int));
init = 0;
cudaMemcpy(dev_variable, &init, sizeof(int), cudaMemcpyHostToDevice);
init = 0;
cudaMemcpy(dev_variable + 1, &init, sizeof(int), cudaMemcpyHostToDevice);
profiler.start();
mykernel<<< 10, 512>>> (dev_variable+1, dev_variable);
err = cudaThreadSynchronize();
profiler.stop();
printf("CUDA Kernel Launch Error Value = %d (0 means success)\n", err);
if (err == 0)
{
printf("Kernel ran for %f seconds\n", profiler.TimeInSeconds());
}
cudaMemcpy(&init,dev_variable, sizeof(int), cudaMemcpyDeviceToHost);
printf("%d\n", init);
return 0;
}
If not all threads in a block hit a __syncthreads(), behavior is undefined. Your code may work. It probably won’t. It certainly isn’t guaranteed to work consistently.
// keep track of ownership by storing blockIdx.x
// -1 implies unlocked
__device__ volatile int lock = -1;
__global__ void spinlol()
{
//hey what's up it's a spinlock
while (atomicCAS(&lock, -1, blockIdx.x) != -1);
/* do a bunch of stuff */
//unlock
atomicExch(&lock, -1);
}
completely off the top of my head and totally untested because I am super lazy, but it should work fine. add if threadIdx.x == 0 or warp ID things as necessary.
(edit: okay, to avoid pedantry, that should be volatile although since it’s only being accessed via atomics it won’t matter)
Your code times out. Find below the full CU code. I removed “volatile” coz there is no overloaded function “atomicCAS” that takes “volatile int *” as argument. (i.e. compilation fails)
Technically, there is no difference between your code and my code. Its all the same. Locking with “blockIdx” does NOT guarantee that all threads in the block will pass through.
However, if you have used “atomicCAS(&lock, blockIdx.x, blockIdx.x)” – it would guarantee you that all (participating) threads of a single block pass through. However, the initial value of “lock” must be some blockID – ANd, that means , you are assuming about the scheduling order of blocks – which is un-supported.
Even if all threads of block participate, i have to again use a lock in shared memory to sychronize among them. ANd, they too will suffer the __syncthreads problem…
But I think this is all too much info at the momment. I request you to test the code below on your hardware and reason out why it times out (I am sure its because of the warp divergence over which programmer has no control). And, this is the point I have been making since yesterday night.
Thanks,
#include <stdio.h>
// keep track of ownership by storing blockIdx.x
// -1 implies unlocked
__device__ int lock = -1;
__device__ int variable = 0;
__global__ void spinlol()
{
//hey what's up it's a spinlock
while (atomicCAS(&lock, -1, blockIdx.x) != -1);
variable++;
//unlock
atomicExch(&lock, -1);
}
int main()
{
cudaError_t err;
int host = 0;
spinlol<<<10,512>>>();
err = cudaThreadSynchronize();
printf("CUDA Launch Error = %d (0 means success)\n", err);
cudaMemcpyFromSymbol(&host, "variable", sizeof(int), 0, cudaMemcpyDeviceToHost);
printf("Variable = %d\n", host);
}
You don’t understand how atomicCAS works. One block will return -1, every other block will return the first block’s blockIdx.x. The first block has acquired the lock, the rest attempt to acquire the lock. Eventually, the first block will decide it’s finished, insert -1, and finish. One of the n - 1 spinlocking blocks will then get -1 from atomicCAS while every other remaining block will get that block’s blockIdx.x, repeat.
Of course this code doesn’t work in any reasonable length of time, and if you think it doesn’t eventually complete because it times out congrats on solving the halting problem. Add a if (!threadIdx.x && !threadIdx.y) conditional around the lock/unlock if you want it to execute cleanly. Whatever scheduling policy you think you’ve discovered is irrelevant in this case. Spinlocks are not hard.
(your test case is wrong anyway without inserting a threadfence after the variable++)
atomics work at thread level and not block level. When one thread acquire the lock, the value is already -1 out there… Other threads of the blocks will only see -1. So, only one thread passes out in that block while others spin.
Well, thats what I get from the manual. Moreover, as an insider, you might know more. Can you confirm?
Thanks for the heads-up. It should affect only the value of variable though and not the time-out problem.
But I dont see any such reference in the manual. My code, actually, wants to implement a per-thread mutex. So, ultimately I want the counter to be gridDim.x * blockDim.x and not just “gridDim.x”
For example, atomicAdd() function can be used to get “gridDim.x*blockDim.x”. It works at a per-thread level. So, wht is wrong if I try to implement a spin-lock at per-thread level?
What I wrote earlier is a per-CTA mutex and is way too slow to be used as a per-thread mutex. It’s not necessarily wrong, but it obviously is unbelievably slow. Better way is to do an inter-CTA mutex and an intra-CTA mutex. I do that here to count warps, extend to whatever you like.
#include <stdio.h>
#include <stdlib.h>
__device__ volatile int lock = -1;
__device__ volatile int counter = 0;;
__global__ void spinlol()
{
__shared__ int intraCTAlock;
if (!threadIdx.x && !threadIdx.y)
intraCTAlock = -1;
__syncthreads();
if (!threadIdx.x && !threadIdx.y)
while (atomicCAS((int*)&lock, -1, blockIdx.x) != -1);
__syncthreads();
if (threadIdx.x % 32 == 0)
{
while (atomicCAS(&intraCTAlock, -1, 12) != -1);
counter++;
__threadfence();
atomicExch(&intraCTAlock, -1);
}
__syncthreads();
if (!threadIdx.x && !threadIdx.y)
atomicExch((int*)&lock, -1);
}
int main(int argc, char** argv)
{
int hostcounter = -1;
spinlol<<<60, 512>>>();
cudaThreadSynchronize();
printf("err = %s\n", cudaGetErrorString(cudaGetLastError()));
cudaMemcpyFromSymbol(&hostcounter, "counter", sizeof(int), 0, cudaMemcpyDeviceToHost);
printf("counter = %d\n", hostcounter);
}
None of these cases have taken me long to write, and I’m not using any secret knowledge that can only be obtained if you pass through the ring of fire in NVIDIA HQ. Basic mutexes aren’t hard, just potentially very slow. If I ever thought I needed a per-thread mutex, I would probably decide I was doing something very wrong.