When I tried to use CUDA release 4.1 (default compiler, llvm one) to build and execute my program, the output is not as same as the one I can get by using open64 compiler of 4.1 version or earlier.
I tried my best to strip my large program and got the simple program as attached. It might seems making non-sense, but its result shows difference between using open64 and llvm nvcc compiler.
If I try to simplify it a little bit more, the output difference will disappear.
Basically, this little example does some calculation and fill an array in global memory. Only uses 1 block, 1 thread. Run make.sh script can generate a file diff.log, which is the output difference of two binary build with open64 and llvm.
I hope I could get kindly help from you or compiler expertise to let me know what is the exact root cause of this problem; it will help me to fix the the problem of my project.
The platform info is as following:
[b]OS: CentOS release 5.5
CPU: Intel Xeon X5660
CUDA toolkit: release 4.1[/b]
I Appreciate your attention deeply.
code_with_problem_41_llvm.tar.gz (1.48 KB)
I can reproduce the difference in behavior between the two builds on my Linux64 system. The code compiles without warnings, and I did not spot any instances of undefined C behavior being invoked, so this looks like a compiler bug to me.
Please file a bug against the compiler. If you log into the registered developer website partners.nvidia.com, there is a menu on the left side of the starting page, the third item from the top is a link to the bug reporting system. The archive you attached to this forum post is fully sufficient to reproduce the issue, so please attach that to the bug report as well. Thank you for your help, and sorry for the incovenience. If you could let me know the bug number assigned to your bug report, I would appreciate it.
The problem seems to be with the bitcount() function. If I replace that with the CUDA intrinsic __popc(), the results of the two executables match. I would suggest trying that with your full application.
Thank you very much for your kindly reply.
The bug ID: 939870
I tried to use __popc() in my original project, the result is still fail. It works in the code I attached here. However, for this simple code, if I modify something else and twist a little bit, the error could also disappear… I attached another package in which the line 27 of lfg_kernel.cu is changed from:
new_fill = new_fill = 0;
new_fill = 0;
new_fill = 0;
The problem disappeared after this modification, the outputs become identical.
Thank you again for your time and your suggestion.
code_with_problem_41_llvm_2.tar.gz (1.47 KB)
Thanks for filing the bug. It is good to hear you found a workaround (a quite surprising one, for sure). As for using __popc() for bitcount() that may still be worthy of consideration, as this maps to a hardware instruction on sm_2x and up, and thus may result in better performance.
Thank you for reply.
This modification cannot solve my original problem. It only works in this striped sample code. There are other ways to make similar result with this modification.
I mention it to show that the symptom is out of my understanding… The modification seems to me has no reason to solve this problem.
With these kinds of no reasonable modifications, even it works in this striped code, it bring no help for me get the reason and the solution.
Thank you, and I am waiting for further result from your side on this problem.
Seems I missed that the workaround you identified only works for the stripped version of the code. Maybe the compiler team can suggest a robust workaround once they have identified the root cause. From inspecting the PTX it seemed to me that incorrect code was being generated for bitcount(), which is why I tried __popc() which took care of the problem in my local build.
I got the status update of the bug, it is closed today, the comment says that the next release will contain the fix.
Would you please tell me the root cause and whether it is possible to get a binary for testing my original project?
Thanks again for your kindly help.
All registered CUDA developers have access to the initial release candidate (RC1) for each new CUDA version, which includes the latest compiler. Questions related to a specific bug are normally handled via the corresponding bug report, so I would suggest you ask there.