Read a value in global memory which was written by another thread block

Waruna · October 14, 2014, 6:46pm

Hi all,

GPU: GTX 480
CUDA 6

In my kernel code, a thread (thread no 0) of a thread block reads a value written by a thread of another thread block. Basically, my first thread block waits till that value is updated by the other thread block. This dependence among thread blocks is pair wise 1st thread block waits for 0th thread block i.e. (0<-1, 1<-2, 2<-3,…).

To avoid deadlocks, I’m using number of thread blocks less than the amount of SMs. I also skip the L1 cache by using compiler parameter “-dlcm=cg”. Therefore, I won’t read stale values in L1 cache.

But still, my code deadlock. Can some one point out an issue in my idea.

PS. I checked the generated ptx code. non of the loads contain the suffix .cg.
i.e. ptx has ld.global.u32 instead of ld.global.cg.u32.

Let me know, if you need more information

Thanks in advance,
Waruna

MutantJohn · October 14, 2014, 8:06pm

Am I blind or is there not any code posted?

Waruna · October 14, 2014, 10:29pm

Here is a code sample to what I have tried to explain in words in my previous post. The code sample is little bit more complicated than I have explained. A thread block communicates with the previous thread multiple times (10 times according to my sample see outer for loop on column_of_tile) instead of just once.
Length of array A (of type unsigned integer) is equal to the number of thread blocks and elements are initialized to 0.

row_of_tile=blockIdx.x;

for (column_of_tile=0; column_of_tile<10; column_of_tile++) {
  if ( threadId == 0 && row_of_tile != 0) {
    while ( A[row_of_tile -1] <= column_of_tile )
    {}
  }
  __syncthreads();

// some computations

// Turn on green light to the next thread block to start processing the same column of a different row
  if ( threadId == 0) {
    A[ row_of_tile ]++;
  }
  __syncthreads();
}

Let me know if you need more clarifications and details.

Thanks,
Waruna

Robert_Crovella · October 14, 2014, 11:14pm

I would suggest providing a short complete code that demonstrates the problem, something that can be copied, pasted, and compiled and run, without adding anything or changing anything, and give the compile commmand line, CUDA version, OS, and GPU you are using.

I assume your A variable is in global memory. Have you marked it as “volatile” ?

Waruna · October 15, 2014, 1:32am

CUDA version: 6
OS: Fedora 20
GPU: GTX 480

I will provide a link to a simplified self contained sample. The need of the sample depends on the answer to my next question below.

Yes A is in global memory. I did not mark A as volatile since I’m using compiler option “-dlcm=cg” to skip L1 cache. But when I marked A as “volatile” it works fine.

Do you know, what is wrong with using compiler flag “-dlcm=cg” instead of making global memory array volatile?
If I mark a global array volatile, will the read request to that array locations will always go to global memory even skipping L2 cache? (I know it skips L1 cache but not sure about the L2 cache.)

Thanks
Waruna

Waruna · October 15, 2014, 2:33am

Waruna:

txbob:

I would suggest providing a short complete code that demonstrates the problem, something that can be copied, pasted, and compiled and run, without adding anything or changing anything, and give the compile commmand line, CUDA version, OS, and GPU you are using.

CUDA version: 6
OS: Fedora 20
GPU: GTX 480

I will provide a link to a simplified self contained sample. The need of the sample depends on the answer to my next question below.

txbob:

I assume your A variable is in global memory. Have you marked it as “volatile” ?

Yes A is in global memory. I did not mark A as volatile since I’m using compiler option “-dlcm=cg” to skip L1 cache. But when I marked A as “volatile” it works fine.

Do you know, what is wrong with using compiler flag “-dlcm=cg” instead of making global memory array volatile?

If I mark a global array volatile, will the read request to that array locations will always go to global memory even skipping L2 cache? (I know it skips L1 cache but not sure about the L2 cache.)

I think I found the answer to 2nd question by profiling my kernel. It does not see any increase in read request to global memory, but to L2 cache. So, this does not skip L2, which is good. But, still I like know the answer to my 1st question.

Thanks,
Waruna

Topic		Replies	Views
Lock causes deadloop bug report CUDA Programming and Performance	6	3863	September 27, 2007
Correct usage of ldcg and stcg for inter-block communication CUDA Programming and Performance cuda , kernel , linux	9	1388	February 7, 2023
read global memory conflict CUDA Programming and Performance	4	2659	July 16, 2009
Access Global memory from kernel CUDA Programming and Performance cuda	2	647	December 15, 2020
Force flush to global memory on grid level in cooperative kernels CUDA Programming and Performance	5	1263	August 13, 2019
CUDA Memory Consistency CUDA Programming and Performance	23	55636	March 8, 2007
Serializing and after parallelism again CUDA Programming and Performance	8	2474	July 12, 2008
Measuring global memory access speed CUDA Programming and Performance	9	2127	October 25, 2018
Why this kernel hangs? CUDA Programming and Performance	14	16471	October 13, 2009
Does the volatile keyword work properly on global memory CUDA Programming and Performance	4	1351	August 17, 2022

Read a value in global memory which was written by another thread block

Related topics