Dear All,
I am trying to implement a FIFO-list for each thread in a warp. For simplicity, I am running just 1 block with 256 threads, so that 256/32 = 8 threads should access the same FIFO-list (for simplicity, I only implement the push function here) .
As all 8 warps are being multiplexed, we need to use one atomic lock per thread in the warp in order to protect the FIFO-list position index.
Note: I use pyCUDA, meaning the kernel code is compiled with nvcc (and, in my case, arch=“compute_11” and code=“sm_11”).
However, the following code should result in 32 full FIFO-lists, filled with 5s.
However, the resultant counter-array looks like this:
(thread in warp -->)
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
5 5 5 5 5 -1 5 5 5 5 5 5 5 -1 5 5 5 5 5 5 5 -1 5 5 5 5 5 5 5 -1 5 5
5 5 5 5 5 5 -1 -1 5 5 5 5 5 5 -1 -1 5 5 5 5 5 -1 -1 5 5 5 5 5 5 -1 -1 5
-1 -1 -1 5 5 5 5 5 -1 -1 -1 5 5 5 5 5 -1 -1 -1 5 5 5 5 5 -1 -1 -1 5 5 5 5 5
-1 5 5 5 5 -1 -1 -1 -1 5 5 5 5 -1 -1 -1 -1 5 5 5 5 -1 -1 -1 -1 5 5 5 5 -1 -1 -1
-1 -1 5 5 5 -1 -1 -1 -1 -1 5 5 5 -1 -1 -1 -1 -1 5 5 5 -1 -1 -1 -1 -1 5 5 5 -1 -1 -1
-1 -1 -1 -1 5 5 -1 -1 -1 -1 -1 -1 5 5 -1 -1 -1 -1 -1 -1 5 5 -1 -1 -1 -1 -1 -1 5 5 -1 -1
-1 -1 -1 -1 -1 -1 -1 5 -1 -1 -1 -1 -1 -1 -1 5 5 -1 -1 -1 -1 -1 -1 -1 5 -1 -1 -1 -1 -1 -1 -1
EDIT: obviously, this output has some pattern: in the first warp, all threads are able to push to the lists.
in the second warp, 4one of the threads is unable to do so, in the third 4two and so on. I don’t know why that is though…
__device__ bool x_push(int entry, int &pos, int max_length, int* stor){
pos++;
if(pos < max_length){
stor[pos] = entry;
return true;
} else {
pos--;
return false;
}
}
extern "C" __global__ void multiply_them(int *counter, int *locks)
{
__shared__ int storage[32][8];
__shared__ int pos[32];
unsigned int BankID = threadIdx.x % 32;
unsigned int BankIDt = (threadIdx.x - threadIdx.x % 32) / 32;
storage[BankID][BankIDt] = -1;
pos[BankID] = -1;
__syncthreads();
bool buf = 0, needlock = true;
for(int i=0; i< 32; i++){
if(threadIdx.x%32 == i){
while(needlock){
if(atomicCAS(&locks[i], 0, 1) != 0){
buf = x_push(5, pos[i], 8, storage[i]);
if( buf ){
counter[i + 32*BankIDt] = storage[i][BankIDt];
} else {
counter[i + 32*BankIDt] = 7;
}
atomicExch(&locks[i], 0);
needlock = false;
}
}
}
}
}
The exact output pattern actually seems to be slightly non-deterministic, so that seems to point to problems with the locks.
I have already tried to remove the cycling through each bank (as we have 32 independent locks), but that did not change the problems with the output.
Also, using a while-loop directly to check the status of the mutex did not bring any benefits.
Also, the push-function should work fine, I tested it on CPU using boost-threads and a similar lock-pattern and it worked flawlessly.
For example, the following code successfully returns an array of 8s:
__device__ void lock(int *mutex){
while(atomicCAS( mutex, 0, 1) != 0);
}
__device__ void unlock(int *mutex){
atomicExch(mutex, 0);
}
extern "C" __global__ void multiply_them(int *counter, int *locks)
{
unsigned int BankID = threadIdx.x % 32;
lock(&locks[BankID]);
counter[BankID]++;
unlock(&locks[BankID]);
}
Output: [8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8]
This means, in the latter example we do have a working lock system, demonstrating that one lock per thread in warp does work (as it should).
My experience is that the output starts to be non-deterministic as soon as I insert anything more advanced than “counter[BankID]++;” within the restricted area.
I am currently using a Geforce8600M GT, however, did test the upper snippet for arch,code = 20 on a Tesla C2050 and did not get different results.
I’d really appreciate any help with the above issue as I don’t really know how to proceed - of course I could try to have the pos variables in global memory and just protect them using atomicAdd/atomicInc (or even protect them in shared memory on the Tesla card), but I guess the above should work somehow and I havn’t understood sth. yet about the details of atomics in CUDA.
Thanks and Regards
Christian
EDIT: I add the initialising python code (host):
multiply_them = mod.get_function("multiply_them")
counter = numpy.zeros(256).astype(numpy.int32)
locks = numpy.zeros(32).astype(numpy.int32)
locks_h = drv.mem_alloc(locks.nbytes)
counter_h = drv.mem_alloc(counter.nbytes)
drv.memcpy_htod(locks_h, locks)
drv.memcpy_htod(counter_h, counter)
multiply_them(counter_h, locks_h,
block=(256,1,1), grid=(1,1))
drv.memcpy_dtoh(counter, counter_h)
for idx in range(0,8):
for id in range(0,32):
print counter[idx*32 + id],
print "\n"