Implementing a FIFO-list in shared memory using atomics - memory inconsistencies Each thread in a wa

chris90 · February 22, 2012, 1:01pm

Dear All,

I am trying to implement a FIFO-list for each thread in a warp. For simplicity, I am running just 1 block with 256 threads, so that 256/32 = 8 threads should access the same FIFO-list (for simplicity, I only implement the push function here) .

As all 8 warps are being multiplexed, we need to use one atomic lock per thread in the warp in order to protect the FIFO-list position index.

Note: I use pyCUDA, meaning the kernel code is compiled with nvcc (and, in my case, arch=“compute_11” and code=“sm_11”).

However, the following code should result in 32 full FIFO-lists, filled with 5s.

However, the resultant counter-array looks like this:

(thread in warp -->)

5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

5 5 5 5 5 -1 5 5 5 5 5 5 5 -1 5 5 5 5 5 5 5 -1 5 5 5 5 5 5 5 -1 5 5

5 5 5 5 5 5 -1 -1 5 5 5 5 5 5 -1 -1 5 5 5 5 5 -1 -1 5 5 5 5 5 5 -1 -1 5

-1 -1 -1 5 5 5 5 5 -1 -1 -1 5 5 5 5 5 -1 -1 -1 5 5 5 5 5 -1 -1 -1 5 5 5 5 5

-1 5 5 5 5 -1 -1 -1 -1 5 5 5 5 -1 -1 -1 -1 5 5 5 5 -1 -1 -1 -1 5 5 5 5 -1 -1 -1

-1 -1 5 5 5 -1 -1 -1 -1 -1 5 5 5 -1 -1 -1 -1 -1 5 5 5 -1 -1 -1 -1 -1 5 5 5 -1 -1 -1

-1 -1 -1 -1 5 5 -1 -1 -1 -1 -1 -1 5 5 -1 -1 -1 -1 -1 -1 5 5 -1 -1 -1 -1 -1 -1 5 5 -1 -1

-1 -1 -1 -1 -1 -1 -1 5 -1 -1 -1 -1 -1 -1 -1 5 5 -1 -1 -1 -1 -1 -1 -1 5 -1 -1 -1 -1 -1 -1 -1

EDIT: obviously, this output has some pattern: in the first warp, all threads are able to push to the lists.

in the second warp, 4one of the threads is unable to do so, in the third 4two and so on. I don’t know why that is though…

__device__ bool x_push(int entry, int &pos, int max_length, int* stor){

    pos++;

    if(pos < max_length){

        stor[pos] = entry;

        return true;

    } else {

        pos--;

        return false;

    }

}

extern "C" __global__ void multiply_them(int *counter, int *locks)

{

    __shared__ int storage[32][8];

    __shared__ int pos[32];

unsigned int BankID = threadIdx.x % 32;

    unsigned int BankIDt = (threadIdx.x - threadIdx.x % 32) / 32;

storage[BankID][BankIDt] = -1;

    pos[BankID] = -1;

    __syncthreads();

bool buf = 0, needlock = true;

    for(int i=0; i< 32; i++){

        if(threadIdx.x%32 == i){

            while(needlock){

                if(atomicCAS(&locks[i], 0, 1) != 0){

                    buf = x_push(5, pos[i], 8, storage[i]);

                    if( buf ){

                        counter[i + 32*BankIDt] = storage[i][BankIDt];

                    } else {

                        counter[i + 32*BankIDt] = 7;

                    }

                    atomicExch(&locks[i], 0);

                    needlock = false;

                }

            }

        }

    }

}

The exact output pattern actually seems to be slightly non-deterministic, so that seems to point to problems with the locks.

I have already tried to remove the cycling through each bank (as we have 32 independent locks), but that did not change the problems with the output.

Also, using a while-loop directly to check the status of the mutex did not bring any benefits.

Also, the push-function should work fine, I tested it on CPU using boost-threads and a similar lock-pattern and it worked flawlessly.

For example, the following code successfully returns an array of 8s:

__device__ void lock(int *mutex){

    while(atomicCAS( mutex, 0, 1) != 0);

}

__device__ void unlock(int *mutex){

    atomicExch(mutex, 0);

}

extern "C" __global__ void multiply_them(int *counter, int *locks)

{

    unsigned int BankID = threadIdx.x % 32;

    lock(&locks[BankID]);

    counter[BankID]++;

    unlock(&locks[BankID]);

}

Output: [8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8]

This means, in the latter example we do have a working lock system, demonstrating that one lock per thread in warp does work (as it should).

My experience is that the output starts to be non-deterministic as soon as I insert anything more advanced than “counter[BankID]++;” within the restricted area.

I am currently using a Geforce8600M GT, however, did test the upper snippet for arch,code = 20 on a Tesla C2050 and did not get different results.

I’d really appreciate any help with the above issue as I don’t really know how to proceed - of course I could try to have the pos variables in global memory and just protect them using atomicAdd/atomicInc (or even protect them in shared memory on the Tesla card), but I guess the above should work somehow and I havn’t understood sth. yet about the details of atomics in CUDA.

Thanks and Regards

Christian

EDIT: I add the initialising python code (host):

multiply_them = mod.get_function("multiply_them")

counter = numpy.zeros(256).astype(numpy.int32)

locks = numpy.zeros(32).astype(numpy.int32)

locks_h = drv.mem_alloc(locks.nbytes)

counter_h = drv.mem_alloc(counter.nbytes)

drv.memcpy_htod(locks_h, locks)

drv.memcpy_htod(counter_h, counter)

multiply_them(counter_h, locks_h,

        block=(256,1,1), grid=(1,1))

drv.memcpy_dtoh(counter, counter_h)

for idx in range(0,8):

    for id in range(0,32):

        print counter[idx*32 + id],

    print "\n"

tera · February 22, 2012, 2:44pm

counter[i + 32*BankIDt] = storage[i][pos[i]];

?

There is no guarantee that the warps execute exactly round-robin.

Two performance-related comments:

[*]It would be better to organize the storage as

__shared__ int storage[8][32];

to avoid bank conflicts.

[*]Why use locks at all? FIFOs are much simpler to implement using atomicInc() / atomicDec().

chris90 · February 22, 2012, 11:21pm

counter[i + 32*BankIDt] = storage[i][pos[i]];
?

There is no guarantee that the warps execute exactly round-robin.

Two performance-related comments:
[*]It would be better to organize the storage as
__shared__ int storage[8][32];
to avoid bank conflicts.

[*]Why use locks at all? FIFOs are much simpler to implement using atomicInc() / atomicDec().

First of all, thanks for your answer!

I now use atomicInc (on position index in global memory) instead of a lock, and I think (not sure) that I address shared memory without bank conflicts.

__device__ bool x_push(int entry, unsigned int *pos, int max_length, int* stor){

    unsigned int BankID = threadIdx.x % 32;

    unsigned int posbuf = atomicInc(pos, 8);

    if(posbuf < max_length){

        stor[BankID + posbuf*32] = entry;

        return true;

    } else {

        atomicDec(pos, -1);

        return false;

    }

 }

extern "C" __global__ void multiply_them(int *counter, int *locks, unsigned int *pos)

{

    __shared__ int storage[8*32];

unsigned int BankID = threadIdx.x % 32;

    unsigned int BankIDt = (threadIdx.x - threadIdx.x % 32) / 32;

//storage[BankID + 32*BankIDt] = -1;

    storage[255] = 5;

    pos[BankID] = -1;

bool buf = 0;

buf = x_push(5, &pos[BankID], 16, storage);

__syncthreads();

    counter[BankID + 32*BankIDt] = storage[BankID + BankIDt*32];

}

However, the output from the code above is still corrupt, i.e. I get sth. like:

5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5

(which is obviously not all fives).

Do you know what is going on here?

Thanks a lot,

Regards,

Christian

tera · February 22, 2012, 11:53pm

Each thread needs to use the value returned by atomicInc(), not the content of *pos which may have been changed again by other threads since the atomicInc():

__device__ bool x_push(int entry, unsigned int *pos_p, int max_length, int* stor){

    unsigned int BankID = threadIdx.x % 32;

    unsigned int pos = atomicAdd(pos_p, 1);

    if(pos < max_length){

        stor[BankID + pos*32] = entry;

        return true;

    } else {

        atomicSub(pos_p, 1);

        return false;

    }

}

requiring to change the initialization to

pos[BankID] = 0;

([font=“Courier New”]pos[/font] now holds the position of the first free element, not of the last used one.)

Also note the tricky interaction if the storage overflows for multiple threads. Above code would be appropriate for a LIFO. In a FIFO, overrun or underrun detection is difficult and it would be best if you could avoid these situations altogether.

chris90 · February 23, 2012, 12:09am

Thanks, your LIFO works like a charm! :)

That might actually be feasible for my application.

I still do not really understand why my FIFO does not work, though. I think my last posting does use the value returned by atomicInc (your posting was probably written before my last edit).

(Is this correct :) ?)

Is there a straight-forward way to get my FIFO code to work? Do you have any literature/links about the issue with FIFOs and multithreading on CUDA that you mention (just so I can wise up a bit)?

Thank you so much, I’d never have been able to get my head around the LIFO.

Christian

chris90 · February 23, 2012, 12:15pm

Ok, so I am trying to implement a pop for a LIFO list:

__device__ bool x_push(int entry, unsigned int *pos_p, int max_length, int* stor){

    unsigned int BankID = threadIdx.x % 32;

    unsigned int pos = atomicAdd(pos_p, 1);

    if(pos < max_length ){

        stor[BankID + pos*32] = entry;

        return true;

    } else {

        atomicSub(pos_p, 1);

        return false;

    }

}

__device__ int x_pop(unsigned int *pos_p, int* stor){

   unsigned int BankID = threadIdx.x % 32;

   unsigned int pos = atomicDec(pos_p, 0);

   if(pos != 0){

       return stor[BankID + (pos-1)*32]; 

   } else {

       return 255;

   }

}

extern "C" __global__ void multiply_them(int *counter, int *locks, unsigned int *pos)

{

    __shared__ int storage[8*32];

unsigned int BankID = threadIdx.x % 32;

    unsigned int BankIDt = (threadIdx.x - threadIdx.x % 32) / 32;

storage[255] = 5;

    pos[BankID] = 0;

bool buf = 0;

buf = x_push(5, &pos[BankID], 8, storage);

__syncthreads();

    counter[BankID + 32*BankIDt] = x_pop(&pos[BankID], storage);

}

Result:

5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255

i.e. somehow the atomicDec is being ignored?

So what is happening now?

Thanks and Regards

Christian

tera · February 24, 2012, 10:21am

Sorry I don’t have time currently to look closely, but there are two things: Convert [font=“Courier New”]pos[/font] to int and change the check on pop to [font=“Courier New”]pos<=0[/font] to guard against concurrent underflow, and put a [font=“Courier New”]__syncthreads()[/font] between push and pop (and also between pop and push) operations, to make sure the datum has been written/read.

tera · February 24, 2012, 9:18pm

Looking a bit closer now I notice the [font=“Courier New”]atomicDec(pos_p, 0)[/font], which seems to be a clever optimization but actually behaves exactly like [font=“Courier New”]*pos_p=0[/font]. Convert [font=“Courier New”]pos[/font] to int instead as written above and handle the underflow case analogous to the overflow on push.

Looking further down in the code you already have the [font=“Courier New”]__syncthreads()[/font], so my second point was moot.