Thread safety of reading and writing different area of constant memory in multiple concurrently executed kernels?

Suppose there are multiple kernels launched in different streams. These kernels update or read some part of constant memory. The kernels in same stream access same part of constant memory. Kernels in different streams are guaranteed to access different part of constant memory. There is no overlap of constant memory accessed by kernels in different streams, but maybe located in same cache line.
Is it safe to run these kernels concurrently?

“Safe” in which sense? What specific concerns do you have? I am assuming that you reasoned through this scenario and in so doing stumbled across something that indicates some risk.

(1) First off, it is unlikely for kernels to run concurrently. I have encountered very few real-life scenarios where kernels are small enough (in terms of resource usage) to actually run concurrently. Witness the several questions in this forum asking how to demonstrate concurrent execution, because it occurs rarely.

(2) Based on “no overlap of constant memory accessed by kernels in different streams”, there is no issue of data integrity that I can think of. While kernels in the same stream share data objects in constant memory, they are also guaranteed to execute in enqueuing order, i.e. serially with respect to each other and other activity, such as data transfers, in the same stream.

(3) Re “maybe located in same cache line”: I do not know at what granularity constant memory (as a logical memory space) is carved out of global memory, but I think it is somewhat likely that this occurs with granularity >= the size of a cache line. However, if two data objects belonging to kernels in different stream were to share a cache line, and such kernels were able to run concurrently, this could lead to performance degradation due to false sharing (in particular caused by cache misses due to cache line invalidation when one of the streams updates a data object in constant memory). Given the read-only nature of the data as seen from the device side the effect should be so small as to be non-observable in practice.

kernels don’t update or write to constant memory.

I come up with a scenario which may cause problem based on some assumption on hardware implementation:
There is an array located in constant memory:
__constant__ int cdata[2];
The pointer to the cdata array ptr is obtained by cuda Runtime API cudaGetSymbolAddress.
There are two kinds of kernels which updates or read a specific element in the cdata array:
kernel W updates cdata by pointer ptr
kernel R read cdata
Now if 4 kernels(two W and two R)are launched in two streams :
W0(updates cdata[0] in stream 0),
R0(Read cdata[0] in stream 0),
W1(updates cdata[1] in stream 1),
R1(Read cdata[1] in stream 1)
The LDC instruction in R0 will load the cache line which contains cdata[0] to constant memory cache in every SMs which execute one or more blocks belongs to kernel R0.
As cdata[1] is just after cdata[0], it will also be loaded to the constant memory cache. As W1 and R1 are launched in different stream, they may be executed concurrently with R0 on the same SMs(R1 runs after W1). The store instruction in W1 may be executed after the LDC instruction in R0. I don’t know whether the store instruction in W1 will also update cdata[1] located in the constant memory cache loaded in previous executed LDC instruction in kernel R0. If not, the data in constant memory cache will be out-of-date.
Then if the LDC instruction in kernel R1 consider it’s a cache hit and use data in constant memory cache, R1 will not read data updated by W1. This cause problem.

There is a cuda Runtime API cudaGetSymbolAddress.
I can use the API to get global memory address of the __constant__ variable and pass it to a kernel.
Then the kernel can update data in the constant memory.
I don’t update constant memory though the __constant__ variable directly

I am really curious now: What happened when you tried this? So your code tries to circumvent the read-only property of __constant__ memory as seen from device code by using a generic pointer to access a data object. Honestly I cannot tell you what would happen off the top of my head, but I have doubts it would have the intended effect.

Is this a question based on pure curiosity, or is there some specific use case you have in mind for this? If the latter, what is that use case? There may be better / cleaner way to accomplish whatever it is you are trying to accomplish.

As for hardware details: NVIDIA puts everything it wants people to know about their hardware in the documentation. Efforts to extract some additional information out of them are unlikely to succeed. You can of course try to reverse engineer implementation details, but any knowledge so gained will be brittle, i.e. it could easily change with every architecture generation.

I wrote a simple test case to see what happens why one tries to write to constant memory. My observations for sm_89:
Writing to constant memory seems to be possible if the address is passed via kernel argument writetocmembyaddresskernel.
attempting to write to the cmem variable directly writetocmembynamekernel or via generic adress writetocmembygenericaddresskernel results in undefined launch failure of the respective kernel. Looking at the ptx of those kernels, this is no surprise because those kernels consist of a single trap instruction.

For kernel writetocmembygenericaddresskernel the compiler gives the warning “Warning: Cannot store to pointer that points to constant memory space” . That is also mentioned in the PTX guide: “Stores to const memory are illegal…”

#include <cstdio>
#include <cassert>

__constant__ int cmem[4];

__global__
void printcmemkernel(){
    for(int i = 0; i < 4; i++){
        printf("%d ", cmem[i]);
    }
    printf("\n");
}

__global__
void printaddresseskernel(int* addressFromApi){
    size_t cmemaddressnumber = size_t(&cmem[0]);
    int* genericaddress = (int*)__cvta_constant_to_generic(cmemaddressnumber);
    printf("api: %p, &cmem: %p, generic: %p\n", addressFromApi, &cmem, genericaddress);
}

__global__
void writetocmembyaddresskernel(int* address){
    for(int i = 0; i < 4; i++){
        address[i] = 2*i;
    }
}

__global__
void writetocmembynamekernel(){
    for(int i = 0; i < 4; i++){
        cmem[i] = 3*i;
    }
}

__global__
void writetocmembygenericaddresskernel(){
//__device__ void * __cvta_constant_to_generic(size_t rawbits);
    size_t cmemaddress = size_t(&cmem[0]);
    int* genericaddress = (int*)__cvta_constant_to_generic(cmemaddress);
    for(int i = 0; i < 4; i++){
        genericaddress[i] = 4*i;
    }
}


int main(){
    cudaError_t status = cudaSuccess;

    int data[4]{0,1,2,3};

    status = cudaMemcpyToSymbol(cmem, &data[0], sizeof(int) * 4);

    printcmemkernel<<<1,1>>>();
    status = cudaDeviceSynchronize();
    assert(status == cudaSuccess);

    int* address = nullptr;
    status = cudaGetSymbolAddress((void**)&address, cmem);
    assert(status == cudaSuccess);

    printaddresseskernel<<<1,1>>>(address);
    status = cudaDeviceSynchronize();
    assert(status == cudaSuccess);

    //works
    {

        writetocmembyaddresskernel<<<1,1>>>(address);
        status = cudaDeviceSynchronize();
        assert(status == cudaSuccess);

        printcmemkernel<<<1,1>>>();
        status = cudaDeviceSynchronize();
        assert(status == cudaSuccess);

    }

    //fails
    {

        writetocmembynamekernel<<<1,1>>>();
        status = cudaDeviceSynchronize();
        assert(status == cudaSuccess);

        printcmemkernel<<<1,1>>>();
        status = cudaDeviceSynchronize();
        assert(status == cudaSuccess);

    }

    //fails
    {

        writetocmembygenericaddresskernel<<<1,1>>>();
        status = cudaDeviceSynchronize();
        assert(status == cudaSuccess);

        printcmemkernel<<<1,1>>>();
        status = cudaDeviceSynchronize();
        assert(status == cudaSuccess);

    }

    status = cudaDeviceSynchronize();
    assert(status == cudaSuccess);
}
0 1 2 3 
api: 0x7fb220c00000, &cmem: 0x7fb220c00000, generic: 0xff6441800000
0 2 4 6 
int main(): Assertion `status == cudaSuccess' failed.

I have wrote a program to test whether a kernel can update constant memory and then another kernel can read the updated value. The program works as expected. But it’s just one example. Massively parallel program sometimes works doesn’t mean there is no bug. So I want to figure out whether it’s guaranteed by cuda programing model or something else:
Whether a kernel is guaranteed to read data in constant memory updated by previous kernel in the same stream no matter how other kernels running concurrently update nearby data(especially in the same cache line).

It’s a use case in my work. There are several kernels work in sequence to complete a job(multiple jobs run in parallel in different streams). Two of these kernels work like producer and consumer. The producer kernel executed by one thread to generate several bytes which is one input of the consumer kernel. The consumer kernel executed by a lot of threads. The best place to store these bytes is constant memory because they can be all hold in the constant memory cache and reduce demand of register file throughput. So I come up with this question.

The compiler is very smart and knows that you try to write to constant memory inside a kernel even you convert it to generic address before you write to it.
I think there is a rule that kernel is not allowed to update constant memory. But it only works at compile time. That’s why kernel writetocmembynamekernel and writetocmembygenericaddresskernel can not be compiled successfully.

But if you pass the pointer by parameter, the compiler just assumes it’s pointed to global memory and use global memory access instructions to store data. Maybe because the constant memory is also located in global memory so it works as expected. But I don’t know how the cache line is updated in constant memory cache.

I was just going to point out that caveat. Given the intended operation, with the logical memory space__constant__ being read-only from the device, and constant cache presumably being invalidated at the start of each kernel launch, thus allowing modification from the host side viacudaMemcpyToSymbol, I would claim “no”, as I would not expect any cache snooping to take place.

But I could not point you to chapter & verse in the documentation to support that assessment, which therefore should be considered speculative.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.