I’m getting a few nans for certain inputs into my kernel. I’ve tracked down the source of the nan to a multiplication, which has me a little confused. Here is the snippet I have for my debugging…
This produces the following output…
0.127511 * 0.000000 = nan
0.062237 * 0.000000 = nan
etc…
So it seems to me that I have two properly formatted floating point values, one of which is a zero, and their product is giving me a nan?? I’m clearly missing something.
Thanks for any help
Edit:
I’ve realised that I should just continue to backtrack through my algorithm but using “assert(!isnan(value * 10.0f))” as my test.
I guess now I’m interested to know how the value is secretly a nan but only realises it when you mutiply it by something, so I’ll leave this post here.
It may be easier to help you if you also print out the actual bit pattern of the underlying values. Something like this (assuming all values are float):
With that info, you can simply decode the bit pattern to find out what kind of number each one represents in IEEE-754 format.
However I wouldn’t be surprised if njuffa comes along and identifies the issue without that piece of information.
The only multiplication by zero that I am aware of that can produce a NaN is infinity by zero, and that doesn’t seem to be the case here. And of course its possible that your number printing out as zero is not actually zero.
Something else is going on here. For example the bit pattern for 63.632366 is not 2684354560 (=0xA0000000). Multiplying either pattern by 10.0f also does not produce NaN.
My guess would be some sort of stack corruption, or some sort of variable errors such as interpreting double as unsigned, or using variables that are shared by multiple threads. It’s also always possible that there is a compiler defect. I’m not sure if these cases cannot be sorted out via forum discussion without a complete reproducer.
Sounds resonable, I’ve stripped it back to a single thread so its not a concurrency issue. I’ll investigate stack/memory issues. I am on the experimental driver for wsl2 but I’m no where near ready to start blaming the driver for this.
Okay, I was working with some scratch memory which I was allocating locally…
float scratch_a[N][N];
float scratch_b[N][N];
(Note that for me N is around 3-5). Changing this to shared memory (which I intended to do when I optimise this algorithm anyway) seems to have gotten rid of my shrodinger nans.
The easiest way to reinterpret the bit pattern underlying a float in device code is to use the __float_as_int() device function intrinsic. The commonly used standard-compliant way to do this in C++ host code is:
The call to memcpy() is optimized out by modern compilers, and one winds up with the machine instruction provided for reinterpretation (e.g. movd on x86-64 hardware, fmov on ARM64).
Since isinf() and isnan() are functions with trivial implementations (something like fabsf(x) == __int_as_float(0x7f8000000 and !(fabsf(x) <= __int_as_float (0x7f800000) the chance of a bug is tiny. The chance of an issue with underlying floating-point hardware is pretty much nil. Which leaves use with some sort of data corruption as a plausible hypothesis.
It’s not obvious to me that that should cause issues for N values of 3 to 5 based on what you have shown, but it certainly seems like an interesting clue as to what may actually be happening. I wouldn’t rule out the possibility of a compiler defect at this point. Such guesses can’t be confirmed without a full test case, however. If you have time to invest, you might try experiments such as seeing if the behavior is different in debug (-G) vs. not, and also perhaps try different CUDA versions.
Ahh debug compilation fixes the issue aswell. I guess that suggests a compiler optimisation as the cause. Switching to volatile for the local memory also works, again pointing to a compiler optimisation?
marking local memory as volatile seems comical to me, even more so that it makes a difference. (why should marking a local variable as volatile have any effect at all? I don’t know. I guess maybe the compiler is taking a rigid view of things and forcing reads/writes to local memory rather than allowing a value to persist in a register, but this seems really outlandish, to me.) I really would hesitate to speculate further as to what is going on. In any event it certainly seems like the compiler is an actor here. Whether it is a bad actor I cannot say. I don’t know what it means to say “pointing to compiler optimisation?” I agree it points to the compiler as an actor here.
Can you post a self-contained minimal reproducer code? The observations regarding optimization level and use of volatilemay point at a compiler issue, but these changes may also mask other potential sources of error (such as missing synchronization in source code).
Does the code in question implement a reduction, by any chance?
The semantics of volatile are: the data object so marked may be changed by an agent outside the code scope in which is defined at any time. Classic use cases are in device drivers for memory mapped-hardware status registers and data updated by ISR (interrupt service routines), such as a timer tick count. Typically insufficient in multi-threaded scenarios, where explicit synchronization (e.g. mutex) is needed. Generally volatile forces the compiler to access the underlying physical storage of that data item every time the data object is accessed, i.e. it cannot load it into a register-based temp variable and use that copy for the duration of the code, which is the common optimization normally allowed by the as-if model (as far as observable effects are concerned, code behaves as-if it is following the abstract execution model of C++ exactly).
Alrighty I’ve managed to trim it all the way down for you folks. In doing so I discovered some uninitialised values in the arrays I was using, this seams to be the primary cause of the issue. I’m unsure on the exact guarentees for unitialised local memory. But I have a feeling this is still a bit of a bug, at least an odd interaction given that volatile/__shared__/-G fix it.
__global__ void Kernel()
{
// no nans for volatile or __shared__
// volatile float array[N];
//__shared__ float array[N];
float array[N];
// no nans is we initialise all the values
// for (int i = 0; i < N; i++)
// {
// array[i] = 1.0f;
// }
for (int i = 0; i < N; i++)
{
printf("as_float: %f as_uint: %u\n", array[i], __float_as_int(array[i]));
// This never throws
assert(!isnan(array[i]));
// This throws
assert(!isnan(array[i] * 10.0f));
}
}
int main()
{
printf("dispatching...\n");
Kernel<<<1, 1>>>();
cudaDeviceSynchronize();
}
There are no guarantees when uninitialized data is used. You would always want to initialize all source data that is consumed anywhere in your code.
For future reference: self-contained means someone else can cut&paste the code into a text file and throw that at the compiler. N, assert, printf, isnan have to come from somewhere …
For what it is worth, the code (with include files added and N set to 10) works fine for me because the uninitialized memory happens to contain zeros.
General C++ notes: func(); and func(void); are not the same thing. For a function with no arguments, you would want func(void);. main has an int return value that you would want to set. EXIT_SUCCESS (from cstdlib) is a good default choice.