Hello again, I seem to have noticed that, in a kernel with up to 1024 threads per block and high register pressure (62-64 registers per thread) one of my
long long int variables is getting silently swapped for
int32_t. In another context, a kernel with only 40-41 registers per thread and 256 x 5 threads per SM, I am seeing no problem. I found something to the effect that high register pressure can sometimes do this. When I do have high register pressure, I do this instead, which seems to get me the int64_t that I want while preserving all of the correct information. Some of the declarations are simply for demonstrative purposes, to show you what each variable is.
__shared__ int overflow;
__shared__ int primary;
int pos = threadIdx.x;
long long int ifrc = __double2ll_rn((double)(overflow[pos]) * max_int_accumulation);
ifrc += primary[pos];
This is the simpler code I would rather implement, and do use in the lower-pressure kernel:
constexpr max_int_accumulation_ll = 2147483648;
const long long int ifrc = ((long long int)(overflow[pos]) * max_int_accumulation_ll) +
(long long int)(primary[pos]);
So, to summarize, I’m finding that I sometimes have to be clever in order to get a faithful combination of int32_t into a single int64_t. As for why I’m using 2^31 and not 2^32 to scale one of the numbers, I’m working in a 63-bit range for reasons that you might see on my other thread.
I am also eyeing this thread from Norbert Juffa, which may be useful in other situations.
Code or it did not happen :-) The titular question maybe be unanswerable based only on the snippets presented. Consider posting a minimal complete example that reproduces whatever the issue is.
Generally speaking C++ compilers optimize according to the as-if rule: The generated code must behave as if it were following the abstract execution model specified by the language standard. Changing an integer type “under the hood” is possible in limited circumstances, but only if no effect of the switch is unobservable in the context of the program.
It is not clear to me whether the question is claiming that an observable deviation from the abstract C++ execution model is happening, and how exactly that manifests itself . If so, that would be a compiler bug, and the first thing to try is to switch to the latest CUDA toolchain to see whether the issue reproduces there.
Before jumping to that step: I am wondering why there is no explicit type given in
constexpr max_int_accumulation_ll = 2147483648;
By my understanding of C++, this makes
max_int_accumulation_ll take type
int (as the default). And since 2147483648 cannot be represented as an
int, this causes an overflow, and the result of an overflowed
int expression is undefined per the C++ standard. What happens when the above is replaced with
constexpr long long int max_int_accumulation_ll = 2147483648LL;
FWIW, it is not clear to me why one would want to use
constexpr here instead of simply
Thanks, @njuffa! Indeed, I was paraphrasing a great deal and in my actual code I did explicitly add the LL to that integer, which would indeed overflow the standard int format (I coded it as
(1LL << 31), and checked the result to verify that it came out correct). It may not be so easy to get a minimal example here, given that one has to code a fairly complex kernel, which pushes register pressure close to the limit and then relies on the compiler to move things around in order to stay within the 64k bounds, to make the problem appear. Let me see what I can do.
Given that 64-bit integers have to be represented as register pairs in the 32-bit GPU, a compiler bug where it loses track of half of such a pair is certainly within the realm of possibilities, and I seem to vaguely recall just such a bug occurring in the distant past.
I am fairly certain that outside of defined interfaces, the compiler does not force the register pair to comprise aligned consecutive registers (e.g. Rn and Rn+1, n even), making losing location information more likely, with “loss” of the more significant half a plausible error mechanism.
INTERESTING. In that case, I will raise the priority of getting my code into a state where it is ready for release, but also something I can transfer to some NVIDIA techs for analysis. Your insight on the casting of 2147483648 was, as always, right on the mark–the way I decided that the int32 casting was probably happening was that I was consistently seeing values of 1 in
overflow[pos] come out -2147483648 in the final
"long long int" result. Larger and large values would continue to run 'round and 'round within the 32-bit format limits, resulting in a sum that was disappointingly small the more “precise” my fixed-precision format got. But if the things can get cleaved in this way it might explain the behavior I was seeing. For now, I’m working under the mode “trust but verify,” with many unit tests that engage this part of the code and bring both parts of the larger number format to bear.