thread writing into global memory (thread sync)

Hi all,

my problem is this one.

I have a matrix, say U, initialized to zero, that has to be filled in a particular way. If the element U(i,j) should be set to a value different from zero, then also the block sized (2K+1)x(2K+1) and centered on the considered element has to be set to values different from zero (K is a parameter that can be chosen). Accordingly if two elements of the matrix, that should be set, are distant less or equal to 2K+1 in coloumns or in rows, the relative blocks overlaps and in the overlapping positions the values should be the sum of the corresponding positions in the two blocks.

Due to the nature of the algorithm I chose this implementation. I defined a block of threads sized (2K+1)x(2K+1) and a properly sized grid of block that allows to cover all elements of the matrix that should be set. Each block of threads is responsible to evaluate the (2K+1)x(2K+1) sub-metrix and to put it into the final matrix, so each thread computes a value of the sub-matrix. I exploited a shared memory area sized (2K+1) x (2K+1) and after all the evaluation of the block of threads the sub-matrix is copied back to the global memory into the original matrix U.

But what happens when overlaps of sub-matrix occur? In particular, if two threads of two different block of threads should write into the same global memory area, say U(h,k), the last writing overwrites the other. For this reason I used a workaround solution, the thread READS the U(h,k) and immediately after (i.e. next instruction in the code) adds the computed value.

Nevertheless, I come now to another doubt. I’m wondering if the following situation could happen:

assume that U(h,k)=0;

TIME 2, THREAD A BLOCK X WRITES new U(h,k)= U(h,k) RED + alpha = 0+ alpha=alpha;
TIME 3, THREAD B BLOCK Y WRITES new U(h,k) = U(h,k) RED + beta = 0+ beta=beta;

The final correct value of U(h,k) should be: alpha+beta, but the result is U(h,k)=beta;

Is it possible that the above described sequence of events could present? Is there a way to sync the threads of different blocks in order to be sure that if a thread acces to U(h,k) to READ the same thread MUST write the new value before other threads of other blocks READ U(h,k)?




Absolutely this kind of memory race can occur. You’re right to worry about it.

The use of a double-check read after writing is not a solution, because again the ordering is not guaranteed… what if two threads read, one writes, that same thread reads again and the double check seems to work, but afterwards the second thread finally writes? You lose.

There’s an easy hardware solution for this problem… atomics. But the performance of atomics is usually unpleasant. However, it may be your application is insensitive to memory speeds. Since atomics are a one-line change, it’s easy enough to check if its performance hurts you. It likely will, but checking is so easy it’s worthwhile.

Other solutions can involve gathering instead of scattering… can you make your algorithm reverse itself so each element finds the values it needs to add to itself? If so, that’s not subject to race conditions.

Thank you for your quick reply.

I was wondering if I use a CUBLAS function such as cublasSaxpy(int n, float alpha, float y, floatx ) that returns y=y+alpha*x, do you think that I can avoid the issue of consecutive readings followed by consecutive writings? I mean if each thread execute the CUBLAS mentioned function the coerence of the data in the global memory is ensured?

I guess no…probably, the issues can occur the same. But, I ask the same.


the instruction cublasSaxpy completely before