I came across what I believe is a bug in the compiler (version 3.1, both Linux and Windows). Here’s the smallest piece of code I could trigger it with:
The problem lies in line 98: an unsigned integer is converted into a larger unsigned integer, but is instead treated as signed, causing its sign to be extended. This in turn causes the code to produce incorrect results.
Would it be possible to post a self-contained, runnable little program, stating both the actual and the expected output? I would be happy to look at the app and file a compiler bug if necessary. Thanks!
The last line is the output from the kernel. Let the first line be X, the second Y and the third Z. The last line is X + Y*Z (all multiple precision integers). The most significant word in the last line is wrong — it should be 0. If, in the function addmul_1, I change the type of “cy” to u64, the result is correct, since no sign extension is performed.
I was able to reproduce the problem on WinXP64 with a GTX285. Thank you for taking the time to reduce the code to an easily runnable standalone app, and for bringing this issue to our attention. I will file a compiler bug. For ease of debugging, I further simplified to a repro case with N=1:
I agree with your analysis that the instruction cvt.u64.s32 in the generated PTX is a key component of the incorrect behavior.
My experiments indicate that when dialing down Open64 optimizations to -O2 the problem disappears. I would suggest trying that as a workaround. Simply add ‘-Xopencc -O2’ to the nvcc invocation.
Indeed, -O2 makes it go away. Thanks. By the way, I’ve noticed that -Os, which appears to be not very different from -O2, does also cause the problem. That might be useful for narrowing down the cause.