problem with atomic operations on global memory implement mutual exclusion with atomicop

I have been working for over a month to port a code of seismic waves propagation on CUDA.

This code uses the spectral elements method : the calculations are made element by element, then the contributions are sumed in global arrays. One ‘global’ point (of a global array) may correspond to several points of various elements (ie a point at the surface of an element is also part of an other neighbor element).

The calculations are done on arrays which correspond to elements points, then we sum the contributions of all the elements in a global array with an indirection array :

float local_array[num_elem][i][j][k]

float global_array[num_glob]

int ibool[num_elem][i][j][k] = num_glob

To avoid expensives transfers host-> device, device-> host, we wanted to make the sum of contributions on the device, but as many threads can read and write simultaneously at the same place in the global arrays, we need a mutual exclusion mechanism (or atomic operation on float).

Early versions made this update on the host, because the graphic card that we had (8800 GTX) did not allow atomic operations. We have to buy a 8800 GT 1GB PCIe 2.0 to be able to do everything on the device thanks to the atomic operations.

So I implemented a mutual exclusion like this (atomic operations are available only on integers):

device void get_mutex(int* mutex)


while(atomicCAS(mutex,0,1)) {}


device void rel_mutex(int* mutex)


*mutex = 0;


global void kernel_2(…)


// many calculations on one element

iglob = d_ibool[bx*NGLL3_ALIGN+tx]; // d_ibool = tableau d’indirection local->global

// sum of the contributions in global arrays

// zone of mutual exclusion


        d_accel_x[iglob] -= fac1*tempx1l + fac2*tempx2l + fac3*tempx3l

        d_accel_y[iglob] -= fac1*tempy1l + fac2*tempy2l + fac3*tempy3l

        d_accel_z[iglob] -= fac1*tempz1l + fac2*tempz2l + fac3*tempz3l



The code works fine using emulation (but mutual exclusion is not relevant in sequential). In contrast to the normal mode, all calls CUDA subsequent to the initial kernel call throw this error:

The launch timed out and was terminated

Then the machine freeze completely! <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

This problem does not appear when I commented in the kernel 2 calls get_mutex() and rel_mutex(). The code is running normally, although it calculates anything, of course …

I assumed that this problem could be due to the latency of global access memory (d_mutex_accel is declared as pointer then allocated in global memory from the host : cudaMalloc((void**) &d_mutex_accel, NGLOB*sizeof(int))) … But in fact I do’nt know …

Have anyone encountered similar problems on codes of finite elements? What does it Could it be? Is there a solution “on the device” for such problems?

I also have another question: I have been unable to find information on the distribution of the card memory among texture and global memory. Could someone tell me what are the different max memories sizes ?

Thank you.

Using atomic operations is not the best way to achieve this. Parallel reduction may be much better. Check reduction sample in SDK.