Problem with CUDA release 4.1, using default LLVM compiler

susangao · February 12, 2012, 11:32pm

HI all,

When I tried to use CUDA release 4.1 (default compiler, llvm one) to build and execute my program, the output is not as same as the one I can get by using open64 compiler of 4.1 version or earlier.

I tried my best to strip my large program and got the simple program as attached. It might seems making non-sense, but its result shows difference between using open64 and llvm nvcc compiler.

If I try to simplify it a little bit more, the output difference will disappear.

Basically, this little example does some calculation and fill an array in global memory. Only uses 1 block, 1 thread. Run make.sh script can generate a file diff.log, which is the output difference of two binary build with open64 and llvm.

I hope I could get kindly help from you or compiler expertise to let me know what is the exact root cause of this problem; it will help me to fix the the problem of my project.

The platform info is as following:

[b]OS: CentOS release 5.5

CPU: Intel Xeon X5660

GPU: M2070

CUDA toolkit: release 4.1[/b]

I Appreciate your attention deeply.

Best Regards,

Susan

code_with_problem_41_llvm.tar.gz (1.48 KB)

njuffa · February 14, 2012, 12:18am

I can reproduce the difference in behavior between the two builds on my Linux64 system. The code compiles without warnings, and I did not spot any instances of undefined C behavior being invoked, so this looks like a compiler bug to me.

Please file a bug against the compiler. If you log into the registered developer website partners.nvidia.com, there is a menu on the left side of the starting page, the third item from the top is a link to the bug reporting system. The archive you attached to this forum post is fully sufficient to reproduce the issue, so please attach that to the bug report as well. Thank you for your help, and sorry for the incovenience. If you could let me know the bug number assigned to your bug report, I would appreciate it.

The problem seems to be with the bitcount() function. If I replace that with the CUDA intrinsic __popc(), the results of the two executables match. I would suggest trying that with your full application.

susangao · February 14, 2012, 1:58am

Thank you very much for your kindly reply.

The bug ID: 939870

I tried to use __popc() in my original project, the result is still fail. It works in the code I attached here. However, for this simple code, if I modify something else and twist a little bit, the error could also disappear… I attached another package in which the line 27 of lfg_kernel.cu is changed from:

new_fill[1] = new_fill[0] = 0;

to

new_fill[1] = 0;

    new_fill[0] = 0;

The problem disappeared after this modification, the outputs become identical.

Thank you again for your time and your suggestion.

Best Regards,

Susan

code_with_problem_41_llvm_2.tar.gz (1.47 KB)

njuffa · February 14, 2012, 2:22am

Thanks for filing the bug. It is good to hear you found a workaround (a quite surprising one, for sure). As for using __popc() for bitcount() that may still be worthy of consideration, as this maps to a hardware instruction on sm_2x and up, and thus may result in better performance.

susangao · February 14, 2012, 2:36am

Thank you for reply.

This modification cannot solve my original problem. It only works in this striped sample code. There are other ways to make similar result with this modification.

I mention it to show that the symptom is out of my understanding… The modification seems to me has no reason to solve this problem.

With these kinds of no reasonable modifications, even it works in this striped code, it bring no help for me get the reason and the solution.

Thank you, and I am waiting for further result from your side on this problem. External Image

Best Regards,
Susan

njuffa · February 14, 2012, 3:08am

Seems I missed that the workaround you identified only works for the stripped version of the code. Maybe the compiler team can suggest a robust workaround once they have identified the root cause. From inspecting the PTX it seemed to me that incorrect code was being generated for bitcount(), which is why I tried __popc() which took care of the problem in my local build.

susangao · February 24, 2012, 11:47pm

Hi,

I got the status update of the bug, it is closed today, the comment says that the next release will contain the fix.

Would you please tell me the root cause and whether it is possible to get a binary for testing my original project?

Thanks again for your kindly help.

Best Regards,
Susan

njuffa · February 25, 2012, 12:56pm

All registered CUDA developers have access to the initial release candidate (RC1) for each new CUDA version, which includes the latest compiler. Questions related to a specific bug are normally handled via the corresponding bug report, so I would suggest you ask there.

Topic		Replies	Views
Problem with CUDA release 4.1, using default LLVM compiler CUDA Programming and Performance	0	1006	February 12, 2012
Cuda 4.1 broke my kernel Upgraded from 4.0 to 4.1 CUDA Programming and Performance	7	1402	January 30, 2012
CUDA 4.1 RC2 is now available CUDA Programming and Performance	11	3025	December 14, 2011
Issue between 4.0 and 4.1 CUDA Programming and Performance	0	755	April 27, 2012
CUDA v4.1 substantially slower than v4.0 CUDA Programming and Performance	10	18261	February 12, 2012
Why is CUDA 4.1RC about 10-15% slower than 4.0? CUDA Programming and Performance	15	2331	December 19, 2011
Questions regarding LLVM CUDA, CUDA Programming and Performance	2	7233	January 4, 2012
Errors compiling the sdk CUDA Programming and Performance	10	26232	March 10, 2009
Where should I file a NVCC bug report? CUDA Programming and Performance	2	1780	March 5, 2009
CUDA 4.1 vs. 3.2 register allocation... CUDA Programming and Performance	6	1542	April 24, 2012

Problem with CUDA release 4.1, using default LLVM compiler

Related topics