Why is my __constant__ memory float array sometimes all zeros?

I’m declaring and initializing constant memory of a float array like this in C++:

__constant__ float fullConvKernel[ MAX_CONV_KERNEL_SIZE * 2 ];

ThrowIfCudaError( "SetupKernel",
                  cudaMemcpyToSymbol( fullConvKernel, fullConvKernelCopy, sizeof( fullConvKernel ) ) );

I’ve asserted that the fullConvKernelCopy is not zeroed out, but on the device, on only some runs, it is all zeros (0.0) and on the other runs the constant memory has the correct floating point values.

If I use shared memory instead of constant memory, the code works correctly every time.

I tried moving it into a struct, but that didn’t help:

struct FullConvKernel
{
    float m_fullConvKernel[ MAX_CONV_KERNEL_SIZE * 2 ];
};

__constant__ FullConvKernel fullConvKernel;

I’m not sure what kind of race condition is going on here or what I’m doing wrong. I’m using CUDA version 9.2 on Centos 6 and the failure occurs on multiple video card types and seems to be some kind of race condition.

Why does the https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g9bcf02b53644eee2bef9983d807084c7 documentation for cudaMemcpyToSymbol() state:

This function exhibits synchronous behavior for most use cases.

In what cases is it not synchronous?

[url]https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior[/url]

I’ve tracked down my issue to using dlopen() to load a shared library that is statically linked against the code with my constant memory, thus giving me two copies.

I suspect this is the source of the issue and wonder if there is a way to resolve this and somehow force the proper copy of the constant memory to be used or somehow otherwise resolve this issue?