Read a value in global memory which was written by another thread block

Hi all,

GPU: GTX 480
CUDA 6

In my kernel code, a thread (thread no 0) of a thread block reads a value written by a thread of another thread block. Basically, my first thread block waits till that value is updated by the other thread block. This dependence among thread blocks is pair wise 1st thread block waits for 0th thread block i.e. (0<-1, 1<-2, 2<-3,…).

To avoid deadlocks, I’m using number of thread blocks less than the amount of SMs. I also skip the L1 cache by using compiler parameter “-dlcm=cg”. Therefore, I won’t read stale values in L1 cache.

But still, my code deadlock. Can some one point out an issue in my idea.

PS. I checked the generated ptx code. non of the loads contain the suffix .cg.
i.e. ptx has ld.global.u32 instead of ld.global.cg.u32.

Let me know, if you need more information

Thanks in advance,
Waruna

Am I blind or is there not any code posted?

Here is a code sample to what I have tried to explain in words in my previous post. The code sample is little bit more complicated than I have explained. A thread block communicates with the previous thread multiple times (10 times according to my sample see outer for loop on column_of_tile) instead of just once.
Length of array A (of type unsigned integer) is equal to the number of thread blocks and elements are initialized to 0.

row_of_tile=blockIdx.x;

for (column_of_tile=0; column_of_tile<10; column_of_tile++) {
  if ( threadId == 0 && row_of_tile != 0) {
    while ( A[row_of_tile -1] <= column_of_tile )
    {}
  }
  __syncthreads();

// some computations

// Turn on green light to the next thread block to start processing the same column of a different row
  if ( threadId == 0) {
    A[ row_of_tile ]++;
  }
  __syncthreads();
}

Let me know if you need more clarifications and details.

Thanks,
Waruna

I would suggest providing a short complete code that demonstrates the problem, something that can be copied, pasted, and compiled and run, without adding anything or changing anything, and give the compile commmand line, CUDA version, OS, and GPU you are using.

I assume your A variable is in global memory. Have you marked it as “volatile” ?

CUDA version: 6
OS: Fedora 20
GPU: GTX 480

I will provide a link to a simplified self contained sample. The need of the sample depends on the answer to my next question below.

Yes A is in global memory. I did not mark A as volatile since I’m using compiler option “-dlcm=cg” to skip L1 cache. But when I marked A as “volatile” it works fine.

  1. Do you know, what is wrong with using compiler flag “-dlcm=cg” instead of making global memory array volatile?
  2. If I mark a global array volatile, will the read request to that array locations will always go to global memory even skipping L2 cache? (I know it skips L1 cache but not sure about the L2 cache.)

Thanks
Waruna

I think I found the answer to 2nd question by profiling my kernel. It does not see any increase in read request to global memory, but to L2 cache. So, this does not skip L2, which is good. But, still I like know the answer to my 1st question.

Thanks,
Waruna