Static Shared Memory and C++ auto&& alias?

I’ve looked all around, but search engine rot is making it harder to find useful information anymore. The closest I came was Robert Crovella’s answer in Basic question about shared memory usage and device functions:

I can’t say that this will replicate the issue, but have some code like this:

__global__ auto myFunc() -> void
{
    __shared__ unsigned int histogram[MAX_BINS];

    auto idx = getHistIndex( /* implementation detail */ ); // returns an index into shared memory

    auto&& myHist = histogram[idx]; // this appears problematic?

    --myHist;
    if ((myHist & 0xFFFF) == 0xFFFF)
    {
        printf("This is bad: %#010x   %#010x\n", myHist, histogram[idx]);
    }
}

What I see is:

This is bad: 0000000000  0xffffffff

I don’t understand how the condition can evaluate to true but the value printed can be 0. I also don’t understand how the unaliased version has a different (the expected -1) value.

Is this just something that isn’t supported in CUDA?

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

I assume you meant ((myHist & 0xFFFF) == 0xFFFF), as == has higher precedence than &.

aren’t you exploring UB? you haven’t initialized histogram contents, but then proceed to manipulate them arithmetically. The compiler is free to do anything it wants in the presence of UB like that, in my understanding.

And FWIW, my attempt to create a simple reproducer around the code you have shown produced this output:

This is bad: 0xffffffff  0xffffffff

without being able to reproduce the observation, i wouldn’t be able to say anything conclusive about it. But defects are always possible.

I’m not aware of any stated limitations around use of auto&& in any CUDA setting. My testing suggests it seems to work.

Fortunately, yes I did mean that. The actual code in question does have the parens. I’ll edit my question.

No, not UB (at least, not the initialization). The shared memory is initialized; I intentionally elided all but the most essential syntax.

I’m glad to know that you can’t reproduce the behavior; I was quite concerned by what I found.

At this point, I think it wasn’t the auto&& syntax at all, but odd behavior of printf. In the “real” code, I printed a lot more context, including the size_t idx; according to cppreference I thought the proper format for size_t should be "%zu"; however using that string caused subsequent arguments to be distorted. Ultimately (and this was after backing out the auto&& syntax) I used "%llu" and that cleared up the output.

Sorry for the confusion. I spent more than a little time debugging my debugging print statements! It appears this one really took the cake.

That is the correct format specifier for size_t according to the ISO C++ standard. However, this format specifier does not seem to be supported by device-side printf() in CUDA 12.3 (I tried it on a Win10 / MSVC 2019 / CUDA 12.3 platform). You may want to file a enhancement request with NVIDIA via the bug reporting form.

#include <cstdio>

__global__ void kernel(int arg)
{
    printf ("GPU: arg=%d sizeof(arg)=%zu\n", arg, sizeof(arg));
}

int main (void)
{
    int arg = 42;
    printf ("CPU: arg=%d sizeof(arg)=%zu\n", arg, sizeof(arg));
    kernel<<<1,1>>>(arg);
    cudaDeviceSynchronize();
    return 0;
}

With CUDA 12.3, this prints

CPU: arg=42 sizeof(arg)=4
GPU: arg=42 sizeof(arg)=%zu

Yes, device-side printf could be a lot more picky about argument types vs. format specifier at compile-time. It easily leads to difficult-to-find bugs and there have been several instances in this forum, too.

Your output is surprising. If you swap the order of the args, do surprising things happen?

printf("GPU: sizeof(arg)=%zu  arg=Z%d\n", sizeof(arg), arg);

that’s closer to how my debug was structured.

When I swap the arguments, I see this:

CPU: sizeof(arg)=4 arg=42
GPU: sizeof(arg)=4 arg=-790847680

I wouldn’t say it is particularly surprising, given that mismatches between assumed argument size based on format specifier and actual argument size often has this kind of effect.

The most straightforward solution is for NVIDIA to simply add support for all simple format specifiers to device-side printf. The easiest way to motivate such change is to file a bug report.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.