'cicc' compilation error and debug flag

Hi,

When compiling my code in release, I stumbled across an obfuscated compilation error:

Stack dump:
0. Running pass ‘NVPTX DAG->DAG Pattern Instruction Selection’ on function ‘@_Z10kernel_bugIfLj4ELj32ELj64EEvP8DataIT_XT0_EXT1_EXT2_EE
nvcc error : ‘cicc’ died due to signal 11 (Invalid memory reference)
nvcc error : ‘cicc’ core dumped

This was tested on Linux with:

GPU: GeForce GT 650M
Driver Version: 313.09
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221

I could not find any piece of information anywhere on this error (it is mentioned here though: [url]nvcc error - CUDA Programming and Performance - NVIDIA Developer Forums).

What really surprised me is the fact that compiling with the “-G” debug flag “solves” the error. How am I supposed to find the actual source of the problem?

Since this error happened in a large project, I cannot provide an example code yet. I will post one as soon as I can.

1 Like

Signal 11 would indicate a memory access out of bounds, which should not happen and would point to a bug inside the compiler. I would suggest the following:

(1) Please double-check that you are running the compiler from the CUDA 5.0 final release (as opposed to one of the 5.0 release candidates).

(2) Please double-check that you are not using older CUDA header files with the CUDA 5.0. There have been multiple reports of compiler crashes caused by the inadvertent use of the CUDA 5.0 compiler with CUDA 4.2 header files, which were resolved by installing the correct CUDA 5.0 files.

If nothing questionable turns up during these checks, it is reasonable to assume that there is a problem with an out-of-bounds memory access inside the compiler. In that case, please file a bug report through the registered developer website, attaching self-contained repro code. Thank you.

Thanks for the information!

(1) I installed CUDA with the Arch Linux community package (this one: [url]https://www.archlinux.org/packages/community/x86_64/cuda/[/url]). The current version (displayed by Arch Linux) is 5.0.35-3. I will try to install it some other way, just in case.

(2) This is actually a recent install, I only installed CUDA 5.0 so this should not be an issue.

I will post a self-contained repro code as soon as I have one. Thanks again for the help!

I am not familiar with the Arch Linux community package. I would suggest downloading NVIDIA’s installation package for the supported Linux distribution of your choice from this website:

[url]https://developer.nvidia.com/cuda-downloads[/url]

Please report bugs through the registered developer website. You can reach it via the following page, for example:

[url]https://developer.nvidia.com/cuda-toolkit[/url]

If you scroll down a bit you can see where it says:

Members of the CUDA Registered Developer Program can report issues and file bugs
Login or Join Today

“Login” and “Join Today” are clickable links. If you are not a registered developer yet, the sign-up process is straightforward and in general a registration request will be approved with in one business day. Let me know should you encounter an undue delay.

I registered and I am waiting for the approval. I also managed to create a repro code with some specific compilation flags. I will post a bug report as soon as I have access to the bug report system.

This was indeed a compiler bug. It should be fixed in the next CUDA release. Thanks again for the help!

Thanks to you.

I also hit this in CUDA 5.0
Although adding -G to nvcc command line did indeed work (considerably shortening the cicc phase)
but -G took my kernel’s run time from about 6.8 milliseconds to about 220 milliseconds.
An alternative work round is to remove -arch=sm_20
Bill
ps: in this case, removing -arch=sm_20 made only about a one percent change to the kernel’s run time

It looks like this can still happen in CUDA 6.0
I have reported it. (The bug ID is: 1600042)
Bill

I’m met with a similar error on CUDA 12, Gentoo Linux,1660Ti mobile; but this time, it’s signal 6, and the only solution is to use -G without -dopt=on, which removes all optimizations, if I understand correctly.
Also, LLVM is complaining about not enough memory, even though the memory isn’t even full. It happens with caffe2’s aten/src/ATen/native/cuda/Sort.cu, from inside pytorch

The specific error messages produced (use cut & paste to post them here) likely convey important information. Is the complaint about running out of system memory, or disk space in a particular partition perhaps? The latter is a more common scenario in my experience. How much system memory is in this system, and how much is used by the compiler use during the build? Is the build itself parallelized, and could memory usage be reduced by reducing the degree of parallelization?

“memory isn’t even full” doesn’t mean much. For a hypothetical scenario, assume that 768 MB of system memory are still available, but that the compiler now needs to allocate (based on some characteristic of the code it is compiling) a block of 1 GB. This allocation would fail even though the “memory isn’t even full”.

Use of -G turns off all compiler optimizations, as you noted.

Failed compile log: #$ _NVVM_BRANCH_=nvvm#$ _SPACE_= #$ _CUDART_=cudart#$ _HERE_=/opt/cuda/bin - Pastebin.com
Successful compile log: nvcc warning : '--device-debug (-G)' overrides '--generate-line-info (-lineinfo) - Pastebin.com
It uses ~7GiB of RAM at most, and I have 28GiB available, and my make flags have -j12 in them, if it’s of any importance

As I said, you may want to try to reduce the level of build parallelism. Maybe start with -j1 to see whether the build succeeds when run in serial fashion.

LLVM ERROR: out of memory
nvcc error   : 'cicc' died due to signal 6 
nvcc error   : 'cicc' core dumped

So it looks like LLVM runs out of memory in some unspecified way, sees no way to continue under these circumstances and then terminates itself abnormally with SIGBART. To the best of my recollection, I have never come across this scenario. Unfortunately, the internet is full of instances of this error message, and I have yet to determine a predominant root cause.

1 Like

Ok, will do and report back
Update: The same situation as the previous fail

Other standard advice when dealing with strange compiler issues is to try the latest available toolchain (CUDA 12.1 at present) to see if things work better with that.

Got it

After updating to latest cuda and cudnn, I get this:

nvcc error   : 'cicc' died due to signal 11 (Invalid memory reference)
nvcc error   : 'cicc' core dumped

I think the usual suggestion at this point would be to create a short, self-contained, complete test case, that reproduces the issue, and file a bug.

I don’t really write cuda code tho……
RIP.

Or should I use the files at hand?