Correct usage of ldcg and stcg for inter-block communication

wolfangA · February 3, 2023, 6:03pm

Below is a dummy example of what I am trying to achieve. I am trying to use ldcg and stcg to skip L1 cache in order for consecutive blocks to be able to share data through global memory/L2. In my example, I would expect the value of flag to != 1 after exiting the while loop however, on my Gpu (GEForce RTX 5000, cc75, CUDA 11.2, Linux), the last block is hitting the assert. This should not be possible as the threads in that block should not be able to exit the while loop without that flag changing value. This suggests to me that there is some kind of over-writing of flag/val after it exits the while loop. Could someone advise me if this is a bug or if I am assuming too much when using these primitives?

Thanks!

__device__ 
int32_t GetNextBlock(int32_t* __restrict__ & blockCounter) {
    __shared__ int32_t sBlockIndex;
    // only the first thread in a block increments the counter
    // so all threads in a block share the same block index. 
    if (threadIdx.x == 0) {
        // increase counter and store the previous value in shared memory
        sBlockIndex = atomicAdd(blockCounter, 1);
    }
    __syncthreads();
    // broadcast to all threads in the block.
    return sBlockIndex;
}

template<int32_t BLOCKSIZE>
__global__ 
void kGlobalRead(int32_t * __restrict__ blockCounter,
                 int32_t * __restrict__ globalStore,
                 int32_t totalThreads) {

    // Thread blocks get incremental block indexes. 
    // This guarantees that the previous thread block has finished its computation 
    // before the current one.
    int32_t blockIndex = GetNextBlock(blockCounter);
    int32_t tIdx = threadIdx.x;

    int32_t val = 0;
	if (blockIndex > 0) {
		int32_t flag = __ldcg(&globalStore[blockIndex-1]);
        printf("BEFORE bIdx %i tIdx %i flag %i\n", blockIndex, tIdx, flag);
		while (flag == -1) {
			flag = __ldcg(&globalStore[blockIndex-1]);
		}
        printf("AFTER bIdx %i tIdx %i flag %i\n", blockIndex, tIdx, flag);
		val = flag;
	}
    // thread should not be able to get here unless flag is != -1
    if (val == -1) {
        printf("FAIL bIdx %i tIdx %i val %i\n", blockIndex, tIdx, val);
    }
    assert(val != -1); //FAILS!! some how 

	if (tIdx == BLOCKSIZE - 1) {
		/*atomically write the partial sum of the thread block to global memory*/
		int32_t sum = blockIndex + val;
        printf("STORE: bIdx %i tIdx %i val %i\n", blockIndex, tIdx, sum);
        __stcg(&globalStore[blockIndex], sum);
	}
	__syncthreads();

}

template<int32_t BLOCKSIZE>
int32_t RunGlobalReadGpu(
        int32_t totalThreads,
        int32_t smem) {
    
    int32_t* blockCounterBuff;
    cudaMalloc(&blockCounterBuff, sizeof(int32_t));
    cudaMemset(blockCounterBuff, 0, sizeof(int32_t));

    int32_t numBlocks = totalThreads % BLOCKSIZE == 0 ? 
        totalThreads / BLOCKSIZE : 
        (totalThreads / BLOCKSIZE) + 1;

    int32_t* prevBlockIndexBuff;
    size_t bytes = numBlocks * sizeof(int32_t);
    cudaMalloc(&prevBlockIndexBuff, bytes);
    cudaMemset(prevBlockIndexBuff, -1, bytes);
    kGlobalRead<BLOCKSIZE><<<numBlocks, BLOCKSIZE, smem, 0>>>(blockCounterBuff, prevBlockIndexBuff,totalThreads);
    
    cudaError_t err = cudaStreamSynchronize(0);
    if (err != cudaSuccess){
        printf("Uh oh!\n");
    }
    return 0;
}


int main() {
    int32_t N = 15;

    constexpr int32_t BLOCKSIZE = 5;
    int32_t smem = 0;

    int32_t i = N;

    std::cout << "Problem size: " << i << std::endl << 
        "Number of blocks: " << (i % BLOCKSIZE == 0 ? i / BLOCKSIZE : (i / BLOCKSIZE) + 1) << std::endl;

    RunGlobalReadGpu<BLOCKSIZE>(i, smem);
}

template int32_t RunGlobalReadGpu<32>(
    int, 
    int);
template int32_t RunGlobalReadGpu<10>(
    int, 
    int);
template int32_t RunGlobalReadGpu<5>(
    int, 
    int);

Robert_Crovella · February 3, 2023, 8:22pm

~~On CUDA 11.4, cc7.5, the SASS looks broken to me. In short,~~ there is no evidence of the while loop in the SASS. ~~So it looks like a compiler code generation issue to me. I would assume 11.2 might be similar.~~

~~My suggestion is as follows:~~

~~Check behavior on latest CUDA 12.0~~

~~If it still manifests a problem, file a bug.~~

wolfangA · February 3, 2023, 11:10pm

Thanks!

njuffa · February 3, 2023, 11:58pm

The way the code is written, by standard C++ semantics, the compiler can safely assume that the value of flag never changes after initialization. Therefore the while-loop is redundant and can safely be eliminated.

The compiler has been instructed, by use of __restrict__, that the data object pointed to by globalStore is not reachable via some other path, and since there are no writes to globalStore inside the if-block, reading from the same location globalStore[blockIndex-1] multiple times will always result in the same value of flag.

In C++, if we have a data object that can be modified by an agent outside the present scope, this data object needs to be declared volatile. In this case this presumably applies to globalStore[blockIndex-1].

In many such situations, declaring a data object volatile is a necessary but not sufficient condition to achieve some intended functionality. For example, some sort of synchronization may be required in addition. I have not further examined the code in this regard.

wolfangA · February 4, 2023, 12:19am

Thanks for the response!

The kernel works if I mark globalStore as volatile in the kernel args but then I can’t use ldcg or stcg as they don’t have overloads with the volatile qualifier.

Removing ldcg and stcg altogether and just doing a read and store the old fashioned way, i.e.

flag = globalStore[blockIndex-1];

gives the correct answer but as you mentioned, I’m concerned that it is not sufficient.

So far so good but I will keep pushing on it .

njuffa · February 4, 2023, 1:26am

Depending on how you wrote the code, volatile used with the kernel arguments may not give you the semantics needed. The use of qualifiers can be tricky: A volatile pointer to data OR a pointer to volatile data?

Specifying exactly the loads you want using some PTX inline assembly inside an asm volatile block may be the way to go. Since no larger context was provided, making this a bit of an XY problem, I cannot tell one way or the other.

Robert_Crovella · February 4, 2023, 2:34pm

Thanks njuffa for sorting this out.

It isn’t going to be productive to file a bug.

It should be possible to achieve what you want without special intrinsics but by marking the pointer as volatile. In the example you have shown here, your while loop should provide the necessary synchronization.

Regarding how to mark the pointer to achieve the desired effect, I think the typical method is correct. I have never had any trouble with that approach, and you can find CUDA sample codes that use that approach. (e…g. p2pBandwidthLatencyTest)

Robert_Crovella · February 7, 2023, 6:08pm

If I take that statement at face value (I’m definitely not an expert here) then it seems to me that the stated allowance (“safely”) results in a change in application behavior. So I don’t know what “safely” means in this context.

When entering the while loop, there are two possibilities. Either the value is equal to -1 or it is not. If the value is not equal to -1, the while loop should exit. If the value is equal to -1, and we posit that that the read value will never change, then the application behavior is a hang. However, the actual observation is that the application does not hang, but instead produces unexpected results.

I don’t know much about compiler optimization, but it seems odd to me that this could be a valid/proper outcome from applying the optimization “the value will never change”.

njuffa · February 7, 2023, 7:53pm

That is an interesting point.

I did not look at the generated code myself, I merely responded to “no evidence of a while loop”, which I took to mean “no evidence of a while-loop that performs __ldcg(&globalStore[blockIndex-1]) multiple times”. Multiple loads are redundant, and the code as posted is (by my understanding of C++) equivalent to:

int32_t flag = __ldcg(&globalStore[blockIndex-1]);
if (flag == -1) for (;;);

One would have to go back to the disassembly to examine the generated machine code in detail. I understand your point that if the part if (flag == -1) for (;;) were actually there, the kernel should hang, but that there is no evidence that it does. The assumption underlying that is that this code reads from a location that was previously initialized to -1, which may or may not be the case. It could be reading from the wrong location or at the correct location which is uninitialized. I have not studied the code to find out.

More work would be required to find out whether the compiler translates anything incorrectly here. Given that the CUDA compiler is mature, it is (in my experience) usually a bad bet to assume a compiler bug, but of course the possibility is always there and you may want to run this case by NVIDIA’s compiler folks.

njuffa · February 7, 2023, 8:14pm

I created a small test program, and the CUDA compiler translates a while-loop with redundant reads into the if-statement plus infinite loop that I expected:

#include <stdio.h>
#include <stdlib.h>

#define INIT_DATA  (0x00)   // use 0xff for infinite loop

__global__ void kernel (int *data)
{
    int flag = data[0];
    while (flag == -1) {
        flag = data[0];
    }
}

int main (void)
{
    int *data_d = 0;
    cudaMalloc ((void**)&data_d, sizeof (*data_d));
    cudaMemset (data_d, INIT_DATA, sizeof (*data_d));
    kernel <<<1,1>>>(data_d);
    return EXIT_SUCCESS;
}

With CUDA 9.5, compiled for sm_30, kernel() translates to:

        code for sm_30
                Function : _Z6kernelPi
        .headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
                                                                       /* 0x22e2f2c3f2804307 */
        /*0008*/                   MOV R1, c[0x0][0x44];               /* 0x2800400110005de4 */
        /*0010*/                   MOV R2, c[0x0][0x140];              /* 0x2800400500009de4 */
        /*0018*/                   MOV R3, c[0x0][0x144];              /* 0x280040051000dde4 */
        /*0020*/                   LD.E R2, [R2];                      /* 0x8400000000209c85 */
        /*0028*/                   ISETP.EQ.AND P0, PT, R2, -0x1, PT;  /* 0x190efffffc21dc23 */
        /*0030*/              @!P0 EXIT;                               /* 0x80000000000021e7 */
        /*0038*/                   BRA 0x38;                           /* 0x4003ffffe0001de7 */  <<<<<< infinite loop
        /*0040*/                   BRA 0x40;                           /* 0x4003ffffe0001de7 */

Topic		Replies	Views
CUDA Memory Consistency CUDA Programming and Performance	23	55454	March 8, 2007
Immediate termination of all threads after the condition is met CUDA Programming and Performance	4	547	March 13, 2023
Many threads updating a single flag in global memory CUDA Programming and Performance	13	6494	May 9, 2011
L2 cache (.cg) memory load performance CUDA Programming and Performance	6	1600	January 5, 2017
this code resets my computer CUDA Programming and Performance	21	7190	March 29, 2008
Custom CPU to GPU ringbuffer CUDA Programming and Performance	21	13684	May 14, 2013
Beginer question Thread synchronization with shared memory CUDA Programming and Performance	35	9312	April 6, 2010
How is the compiler optimizing the thread launch? CUDA Programming and Performance	12	314	October 26, 2022
Overlapping CPU and GPU operations using streams. Total failure. Any help? CUDA Programming and Performance	6	5990	April 2, 2013
warp synchronization test CUDA Programming and Performance	5	1656	September 2, 2014

Correct usage of ldcg and stcg for inter-block communication

Related topics