Symbol stripped by MSVC in device compilation pass

kshegunov · October 28, 2022, 9:06am

Hello,
I’ve hit a rather strange problem relating to code optimization with MSVC as the host compiler. I have a small lookup table (a couple of kilobytes) I need to load into the constant memory of the GPU, which works gloriously in debug mode, but cudaGetSymbolAddress fails with cudaErrorInvalidSymbol in release. After some digging around (assuming my code is correct) I narrowed down the problem to optimization flags passed to the host compiler during the nvcc device compilation pass. It appears as if the global variable is stripped whenever I pass any optimization flags for the host compiler, probably during the DSE/DCE optimizer pass(es). For what is worth the pragma to disable the optimizations (see code bellow) doesn’t appear to change anything for me.
I am building by generating a separate relocatable device code before I link everything together, if it matters (which I don’t believe it should).
The code which is giving me headaches is as follows:
data.cpp

#include "data.h"

#pragma optimize("", off)

MM_CUDA_CONSTANT MyNamespace::Data GlobalData;

namespace MyNamespace {

bool Data::initialize() noexcept
{
    H.initialize();
    F.initialize();

#if defined(MM_CUDA)
    void * dataSymbol = nullptr;
    cudaError_t status = cudaGetSymbolAddress(&dataSymbol, GlobalData);
    if (status != cudaSuccess)  {
        log("Couldn't get symbol address off the CUDA device!");
        return false;
    }

    status = cudaMemcpy(dataSymbol, this, sizeof(MyNamespace::Data), cudaMemcpyHostToDevice);
    if (status != cudaSuccess)  {
        log("Couldn't intialize the memory on CUDA device!");
        return false;
    }
#endif

    return true;
}

MM_CUDA_LOCAL const Data & data() noexcept
{
    return GlobalData;
}

}

Where the header is only declaring the (trivially constructible POD) struct:

#include "fdata.h"
#include "hdata.h"

namespace MyNamespace  {

struct Data
{
    HData H;
    FData F;

    bool initialize() noexcept;
};
MM_CUDA_LOCAL const Data & data() noexcept;

}

The macro MM_CUDA_LOCAL expands to __device__ __host__, whereas MM_CUDA_CONSTANT is equivalent to __constant__. As you can probably tell my code is “mixed-mode”, allowing to be compiled with or without CUDA support. In any case, here’s the compile line I prepare and use (from qmake):

nvcc.exe -arch=native -m64 --no-exceptions -DQ_NVCC -x cu -std c++17 -I. (... more include paths ...) -Xcompiler -nologo,-Zc:wchar_t,-FS,-Zc:rvalueCast,-Zc:inline,-Zc:strictStrings,-Zc:throwingNew,-Zc:referenceBinding,-Zc:__cplusplus,-O2,-Zi,-MD,-std:c++17,-DUNICODE,-D_UNICODE,-DWIN32,-D_ENABLE_EXTENDED_ALIGNED_STORAGE,-DWIN64,-DMM_MSVC,-DMM_WITH_SELFTEST,-DMM_CUDA,-DNDEBUG,-DQ_NVCC --device-c data.cpp -o release\data.cpp.obj

Forcing -O2 to -Od makes the above code resolve the symbol and copy the data.

MSVC version is: 19.28.29337
NVCC version is: 11.8.89

So the question is:
Is this a bug, or am I doing something wrong?

Kind regards,
Konstantin.