serialise access to database several threads wish to add same datum

I have a case where threads try to add to a common area (in shared memory).
If the datum is already present they should append to it, rather than
create a new instance. When several threads simultaneously have the same (new) datum,
all their checks to see if it is already present say it is missing and
so they all try to add it:-(

At present I am getting round this by forcing all updates via thread zero.
I am not sure my code is working and it seems an ugly hack.

Is there a better way?
Has anyone got a working example?

As always any help would be most welcome.
Bill

Take a look at Histogram CUDA SDK documentation. That should answer your queries. :)

Thanks for the pointer to Podlozhnyuk’s PDF.

As I understand it Victor assembles partial histograms stored in shared memory

either per thread or per 32-threads. He then uses additional kernels or

atomic operations to assemble these into a single histogram in global memory.

A lot of what is described is about shared memory and working around

the lack of atomicAdd prior to compute level 1.2

This includes packing counts of 8bit numbers and 5bit thread information into 32 bits.

However he also uses the fact that the hardware serialises access

to shared memory. He uses it to emulate atomic operations.

(Does the hardware do the same for 64-bit access?)

I am sorry my description was poor. The data are 32 unsigned int.

I phrased my orginal question as a database update question

as I thought this would be more familiar to more people.

I expected everyone else is already wise to this, but on first

reading I did not spot the volatile keyword in listing 3.

(I expact I shall add volatile to pretty much all shared memory from

now on…)

Thanks again

Bill

The following seems to be ok. It can save data for up to 32 threads and spot cases where threads

wish to save the same data (it is stored only once).

__device__ int save(const int data) {

  volatile int* s_ndata = &shared_array[0];

  volatile int* s_data  = &shared_array[1];

int id;

  do { //assume operates in a warp

    id = s_ndata[0];

    //at least one thread will succeed

    s_ndata[0] = id + 1; 

    s_data[id] = data;  //one thread in warp will update s_data[id]

  } while (s_data[id] != data); //s_data must be volatile

  return id;

}

Any comments, or help or suggestions on how to extend this to more than 32 threads would be most welcome.

Thank you

Bill

I think this should work:

__device__ int save(const int data) {

  int* s_ndata = &shared_array[0];

  int* s_data  = &shared_array[1];

  shared int more_data;

int id;

  do { //assume operates in a warp

    id = s_ndata[0];

    __syncthreads();

    //at least one thread will succeed

    s_ndata[0] = id + 1; 

    s_data[id] = data;  //one thread in warp will update s_data[id]

    more_data = 0;

    __syncthreads();

    if (s_data[id] != data)

      more_data = 1;

    __syncthreads();

  } while (more_data);

  return id;

}

My current approach is to follow the SDK Histogram examples and have separate data structures

per warp. There is sufficient shared memory to allow 8. This takes the code from 32 threads

to 256 threads. When all complete, there is a second stage in the same kernel which uses

32 threads to combine the 8 structures into one.

Thanks again

Bill