What really surprised me is the fact that compiling with the “-G” debug flag “solves” the error. How am I supposed to find the actual source of the problem?
Since this error happened in a large project, I cannot provide an example code yet. I will post one as soon as I can.
Signal 11 would indicate a memory access out of bounds, which should not happen and would point to a bug inside the compiler. I would suggest the following:
(1) Please double-check that you are running the compiler from the CUDA 5.0 final release (as opposed to one of the 5.0 release candidates).
(2) Please double-check that you are not using older CUDA header files with the CUDA 5.0. There have been multiple reports of compiler crashes caused by the inadvertent use of the CUDA 5.0 compiler with CUDA 4.2 header files, which were resolved by installing the correct CUDA 5.0 files.
If nothing questionable turns up during these checks, it is reasonable to assume that there is a problem with an out-of-bounds memory access inside the compiler. In that case, please file a bug report through the registered developer website, attaching self-contained repro code. Thank you.
(1) I installed CUDA with the Arch Linux community package (this one: [url]https://www.archlinux.org/packages/community/x86_64/cuda/[/url]). The current version (displayed by Arch Linux) is 5.0.35-3. I will try to install it some other way, just in case.
(2) This is actually a recent install, I only installed CUDA 5.0 so this should not be an issue.
I will post a self-contained repro code as soon as I have one. Thanks again for the help!
I am not familiar with the Arch Linux community package. I would suggest downloading NVIDIA’s installation package for the supported Linux distribution of your choice from this website:
If you scroll down a bit you can see where it says:
Members of the CUDA Registered Developer Program can report issues and file bugs
Login or Join Today
“Login” and “Join Today” are clickable links. If you are not a registered developer yet, the sign-up process is straightforward and in general a registration request will be approved with in one business day. Let me know should you encounter an undue delay.
I registered and I am waiting for the approval. I also managed to create a repro code with some specific compilation flags. I will post a bug report as soon as I have access to the bug report system.
I also hit this in CUDA 5.0
Although adding -G to nvcc command line did indeed work (considerably shortening the cicc phase)
but -G took my kernel’s run time from about 6.8 milliseconds to about 220 milliseconds.
An alternative work round is to remove -arch=sm_20
Bill
ps: in this case, removing -arch=sm_20 made only about a one percent change to the kernel’s run time
I’m met with a similar error on CUDA 12, Gentoo Linux,1660Ti mobile; but this time, it’s signal 6, and the only solution is to use -G without -dopt=on, which removes all optimizations, if I understand correctly.
Also, LLVM is complaining about not enough memory, even though the memory isn’t even full. It happens with caffe2’s aten/src/ATen/native/cuda/Sort.cu, from inside pytorch
The specific error messages produced (use cut & paste to post them here) likely convey important information. Is the complaint about running out of system memory, or disk space in a particular partition perhaps? The latter is a more common scenario in my experience. How much system memory is in this system, and how much is used by the compiler use during the build? Is the build itself parallelized, and could memory usage be reduced by reducing the degree of parallelization?
“memory isn’t even full” doesn’t mean much. For a hypothetical scenario, assume that 768 MB of system memory are still available, but that the compiler now needs to allocate (based on some characteristic of the code it is compiling) a block of 1 GB. This allocation would fail even though the “memory isn’t even full”.
Use of -G turns off all compiler optimizations, as you noted.
As I said, you may want to try to reduce the level of build parallelism. Maybe start with -j1 to see whether the build succeeds when run in serial fashion.
LLVM ERROR: out of memory
nvcc error : 'cicc' died due to signal 6
nvcc error : 'cicc' core dumped
So it looks like LLVM runs out of memory in some unspecified way, sees no way to continue under these circumstances and then terminates itself abnormally with SIGBART. To the best of my recollection, I have never come across this scenario. Unfortunately, the internet is full of instances of this error message, and I have yet to determine a predominant root cause.
Other standard advice when dealing with strange compiler issues is to try the latest available toolchain (CUDA 12.1 at present) to see if things work better with that.