I can’t say that this will replicate the issue, but have some code like this:
__global__ auto myFunc() -> void
{
__shared__ unsigned int histogram[MAX_BINS];
auto idx = getHistIndex( /* implementation detail */ ); // returns an index into shared memory
auto&& myHist = histogram[idx]; // this appears problematic?
--myHist;
if ((myHist & 0xFFFF) == 0xFFFF)
{
printf("This is bad: %#010x %#010x\n", myHist, histogram[idx]);
}
}
What I see is:
This is bad: 0000000000 0xffffffff
I don’t understand how the condition can evaluate to true but the value printed can be 0. I also don’t understand how the unaliased version has a different (the expected -1) value.
Is this just something that isn’t supported in CUDA?
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
aren’t you exploring UB? you haven’t initialized histogram contents, but then proceed to manipulate them arithmetically. The compiler is free to do anything it wants in the presence of UB like that, in my understanding.
And FWIW, my attempt to create a simple reproducer around the code you have shown produced this output:
This is bad: 0xffffffff 0xffffffff
without being able to reproduce the observation, i wouldn’t be able to say anything conclusive about it. But defects are always possible.
I’m not aware of any stated limitations around use of auto&& in any CUDA setting. My testing suggests it seems to work.
No, not UB (at least, not the initialization). The shared memory is initialized; I intentionally elided all but the most essential syntax.
I’m glad to know that you can’t reproduce the behavior; I was quite concerned by what I found.
At this point, I think it wasn’t the auto&& syntax at all, but odd behavior of printf. In the “real” code, I printed a lot more context, including the size_t idx; according to cppreference I thought the proper format for size_t should be "%zu"; however using that string caused subsequent arguments to be distorted. Ultimately (and this was after backing out the auto&& syntax) I used "%llu" and that cleared up the output.
Sorry for the confusion. I spent more than a little time debugging my debugging print statements! It appears this one really took the cake.
That is the correct format specifier for size_t according to the ISO C++ standard. However, this format specifier does not seem to be supported by device-side printf() in CUDA 12.3 (I tried it on a Win10 / MSVC 2019 / CUDA 12.3 platform). You may want to file a enhancement request with NVIDIA via the bug reporting form.
Yes, device-side printf could be a lot more picky about argument types vs. format specifier at compile-time. It easily leads to difficult-to-find bugs and there have been several instances in this forum, too.
I wouldn’t say it is particularly surprising, given that mismatches between assumed argument size based on format specifier and actual argument size often has this kind of effect.
The most straightforward solution is for NVIDIA to simply add support for all simple format specifiers to device-side printf. The easiest way to motivate such change is to file a bug report.