Correct usage of ldcg and stcg for inter-block communication

Below is a dummy example of what I am trying to achieve. I am trying to use ldcg and stcg to skip L1 cache in order for consecutive blocks to be able to share data through global memory/L2. In my example, I would expect the value of flag to != 1 after exiting the while loop however, on my Gpu (GEForce RTX 5000, cc75, CUDA 11.2, Linux), the last block is hitting the assert. This should not be possible as the threads in that block should not be able to exit the while loop without that flag changing value. This suggests to me that there is some kind of over-writing of flag/val after it exits the while loop. Could someone advise me if this is a bug or if I am assuming too much when using these primitives?


int32_t GetNextBlock(int32_t* __restrict__ & blockCounter) {
    __shared__ int32_t sBlockIndex;
    // only the first thread in a block increments the counter
    // so all threads in a block share the same block index. 
    if (threadIdx.x == 0) {
        // increase counter and store the previous value in shared memory
        sBlockIndex = atomicAdd(blockCounter, 1);
    // broadcast to all threads in the block.
    return sBlockIndex;

template<int32_t BLOCKSIZE>
void kGlobalRead(int32_t * __restrict__ blockCounter,
                 int32_t * __restrict__ globalStore,
                 int32_t totalThreads) {

    // Thread blocks get incremental block indexes. 
    // This guarantees that the previous thread block has finished its computation 
    // before the current one.
    int32_t blockIndex = GetNextBlock(blockCounter);
    int32_t tIdx = threadIdx.x;

    int32_t val = 0;
	if (blockIndex > 0) {
		int32_t flag = __ldcg(&globalStore[blockIndex-1]);
        printf("BEFORE bIdx %i tIdx %i flag %i\n", blockIndex, tIdx, flag);
		while (flag == -1) {
			flag = __ldcg(&globalStore[blockIndex-1]);
        printf("AFTER bIdx %i tIdx %i flag %i\n", blockIndex, tIdx, flag);
		val = flag;
    // thread should not be able to get here unless flag is != -1
    if (val == -1) {
        printf("FAIL bIdx %i tIdx %i val %i\n", blockIndex, tIdx, val);
    assert(val != -1); //FAILS!! some how 

	if (tIdx == BLOCKSIZE - 1) {
		/*atomically write the partial sum of the thread block to global memory*/
		int32_t sum = blockIndex + val;
        printf("STORE: bIdx %i tIdx %i val %i\n", blockIndex, tIdx, sum);
        __stcg(&globalStore[blockIndex], sum);


template<int32_t BLOCKSIZE>
int32_t RunGlobalReadGpu(
        int32_t totalThreads,
        int32_t smem) {
    int32_t* blockCounterBuff;
    cudaMalloc(&blockCounterBuff, sizeof(int32_t));
    cudaMemset(blockCounterBuff, 0, sizeof(int32_t));

    int32_t numBlocks = totalThreads % BLOCKSIZE == 0 ? 
        totalThreads / BLOCKSIZE : 
        (totalThreads / BLOCKSIZE) + 1;

    int32_t* prevBlockIndexBuff;
    size_t bytes = numBlocks * sizeof(int32_t);
    cudaMalloc(&prevBlockIndexBuff, bytes);
    cudaMemset(prevBlockIndexBuff, -1, bytes);
    kGlobalRead<BLOCKSIZE><<<numBlocks, BLOCKSIZE, smem, 0>>>(blockCounterBuff, prevBlockIndexBuff,totalThreads);
    cudaError_t err = cudaStreamSynchronize(0);
    if (err != cudaSuccess){
        printf("Uh oh!\n");
    return 0;

int main() {
    int32_t N = 15;

    constexpr int32_t BLOCKSIZE = 5;
    int32_t smem = 0;

    int32_t i = N;

    std::cout << "Problem size: " << i << std::endl << 
        "Number of blocks: " << (i % BLOCKSIZE == 0 ? i / BLOCKSIZE : (i / BLOCKSIZE) + 1) << std::endl;

    RunGlobalReadGpu<BLOCKSIZE>(i, smem);

template int32_t RunGlobalReadGpu<32>(
template int32_t RunGlobalReadGpu<10>(
template int32_t RunGlobalReadGpu<5>(

On CUDA 11.4, cc7.5, the SASS looks broken to me. In short, there is no evidence of the while loop in the SASS. So it looks like a compiler code generation issue to me. I would assume 11.2 might be similar.

My suggestion is as follows:

  1. Check behavior on latest CUDA 12.0
  2. If it still manifests a problem, file a bug.


The way the code is written, by standard C++ semantics, the compiler can safely assume that the value of flag never changes after initialization. Therefore the while-loop is redundant and can safely be eliminated.

The compiler has been instructed, by use of __restrict__, that the data object pointed to by globalStore is not reachable via some other path, and since there are no writes to globalStore inside the if-block, reading from the same location globalStore[blockIndex-1] multiple times will always result in the same value of flag.

In C++, if we have a data object that can be modified by an agent outside the present scope, this data object needs to be declared volatile. In this case this presumably applies to globalStore[blockIndex-1].

In many such situations, declaring a data object volatile is a necessary but not sufficient condition to achieve some intended functionality. For example, some sort of synchronization may be required in addition. I have not further examined the code in this regard.

Thanks for the response!

The kernel works if I mark globalStore as volatile in the kernel args but then I can’t use ldcg or stcg as they don’t have overloads with the volatile qualifier.

Removing ldcg and stcg altogether and just doing a read and store the old fashioned way, i.e.

flag = globalStore[blockIndex-1];

gives the correct answer but as you mentioned, I’m concerned that it is not sufficient.

So far so good but I will keep pushing on it .

Depending on how you wrote the code, volatile used with the kernel arguments may not give you the semantics needed. The use of qualifiers can be tricky: A volatile pointer to data OR a pointer to volatile data?

Specifying exactly the loads you want using some PTX inline assembly inside an asm volatile block may be the way to go. Since no larger context was provided, making this a bit of an XY problem, I cannot tell one way or the other.

Thanks njuffa for sorting this out.

It isn’t going to be productive to file a bug.

It should be possible to achieve what you want without special intrinsics but by marking the pointer as volatile. In the example you have shown here, your while loop should provide the necessary synchronization.

Regarding how to mark the pointer to achieve the desired effect, I think the typical method is correct. I have never had any trouble with that approach, and you can find CUDA sample codes that use that approach. (e…g. p2pBandwidthLatencyTest)

If I take that statement at face value (I’m definitely not an expert here) then it seems to me that the stated allowance (“safely”) results in a change in application behavior. So I don’t know what “safely” means in this context.

When entering the while loop, there are two possibilities. Either the value is equal to -1 or it is not. If the value is not equal to -1, the while loop should exit. If the value is equal to -1, and we posit that that the read value will never change, then the application behavior is a hang. However, the actual observation is that the application does not hang, but instead produces unexpected results.

I don’t know much about compiler optimization, but it seems odd to me that this could be a valid/proper outcome from applying the optimization “the value will never change”.

That is an interesting point.

I did not look at the generated code myself, I merely responded to “no evidence of a while loop”, which I took to mean “no evidence of a while-loop that performs __ldcg(&globalStore[blockIndex-1]) multiple times”. Multiple loads are redundant, and the code as posted is (by my understanding of C++) equivalent to:

int32_t flag = __ldcg(&globalStore[blockIndex-1]);
if (flag == -1) for (;;);

One would have to go back to the disassembly to examine the generated machine code in detail. I understand your point that if the part if (flag == -1) for (;;) were actually there, the kernel should hang, but that there is no evidence that it does. The assumption underlying that is that this code reads from a location that was previously initialized to -1, which may or may not be the case. It could be reading from the wrong location or at the correct location which is uninitialized. I have not studied the code to find out.

More work would be required to find out whether the compiler translates anything incorrectly here. Given that the CUDA compiler is mature, it is (in my experience) usually a bad bet to assume a compiler bug, but of course the possibility is always there and you may want to run this case by NVIDIA’s compiler folks.

I created a small test program, and the CUDA compiler translates a while-loop with redundant reads into the if-statement plus infinite loop that I expected:

#include <stdio.h>
#include <stdlib.h>

#define INIT_DATA  (0x00)   // use 0xff for infinite loop

__global__ void kernel (int *data)
    int flag = data[0];
    while (flag == -1) {
        flag = data[0];

int main (void)
    int *data_d = 0;
    cudaMalloc ((void**)&data_d, sizeof (*data_d));
    cudaMemset (data_d, INIT_DATA, sizeof (*data_d));
    kernel <<<1,1>>>(data_d);
    return EXIT_SUCCESS;

With CUDA 9.5, compiled for sm_30, kernel() translates to:

        code for sm_30
                Function : _Z6kernelPi
        .headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
                                                                       /* 0x22e2f2c3f2804307 */
        /*0008*/                   MOV R1, c[0x0][0x44];               /* 0x2800400110005de4 */
        /*0010*/                   MOV R2, c[0x0][0x140];              /* 0x2800400500009de4 */
        /*0018*/                   MOV R3, c[0x0][0x144];              /* 0x280040051000dde4 */
        /*0020*/                   LD.E R2, [R2];                      /* 0x8400000000209c85 */
        /*0028*/                   ISETP.EQ.AND P0, PT, R2, -0x1, PT;  /* 0x190efffffc21dc23 */
        /*0030*/              @!P0 EXIT;                               /* 0x80000000000021e7 */
        /*0038*/                   BRA 0x38;                           /* 0x4003ffffe0001de7 */  <<<<<< infinite loop
        /*0040*/                   BRA 0x40;                           /* 0x4003ffffe0001de7 */