Illegal instruction error, but the instruction seems to be okay?

ragerdl · October 7, 2019, 2:49pm

Hi,

Background

When I enable debugging (-G) and some extra code on my end (think extra runtime checks), I’m able to reliably make my program encounter a “CUDA Exception: Warp Illegal Instruction”. Thing is, I don’t get at all how the instruction executed is an illegal instruction. I poked around for combinations of “malloc”, “call.abs.noinc”, “warp illegal instruction”, “cuda”, etc. but haven’t yet found anything helpful. I ran cuda-memcheck ./my_program, and no errors were reported (not that I’m implying my code is bug-free – actually, I’m trying to find the source of a particular bug, but I’ve encountered the illegal instruction exception before introducing that particular bug).

Questions
First off, (1) which instruction is causing the error? Is it the one pointed to by the arrow (call.abs.noinc), or is it the instruction before it (mov r6, r4)?

Secondly, (2) if it’s the call instruction, what’s that formulation mean? I’d think that “abs” would be short for absolute value, and “inc” would be short for increment, but I don’t know how to combine those two concepts. If it’s the mov instruction, then I’m really lost, as the kernel has somewhere between 64 and 135 registers allocated to it.

Details

Below I provide a log. I can also generate the same error and equivalent information using Eclipse Nsight. This is on Oracle Linuxx8664, kernel 4.14.35-1844.3.2.el7uek.x86_64 with CUDA 10.1.

An elided version of the offending line of code is:

an_array_of_pointers[0] = (Foo *) malloc(sizeof(Foo) * (THRESHOLD_1 - THRESHOLD_0));

Thanks,
David

Log

[build]$ cuda-gdb ./my_program
NVIDIA (R) CUDA Debugger
10.1 release
Portions Copyright (C) 2007-2018 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type “show copying”
and “show warranty” for details.
This GDB was configured as “x86_64-pc-linux-gnu”.
Type “show configuration” for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type “help”.
Type “apropos word” to search for commands related to “word”…
Reading symbols from ./my_program…done.
(cuda-gdb) run
Starting program: /home/…/my_program
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
[New Thread 0x7fffef822700 (LWP 93948)]
[New Thread 0x7fffeefd0700 (LWP 93949)]
… program output …
CUDA Exception: Warp Illegal Instruction

Thread 4 “my_program” received signal CUDA_EXCEPTION_4, Warp Illegal Instruction.
[Switching focus to CUDA kernel 11, grid 42, block (22,0,0), thread (80,0,0), device 0, sm 44, warp 3, lane 16]
0x00000000016439c0 in Bar::insert (this=0xfffeffff62f9a620, cat_id=30730, site_id=0) at /home/…/bar.h:114
114

(cuda-gdb) info registers
R0 0x10000 65536
R1 0xffd708 16766728
R2 0xfffeffff -65537
R3 0x10000 65536
R4 0xfffeffff -65537
R5 0x10 16
R6 0x7128aa00 1898490368
R7 0x10000 65536
R8 0x0 0
R9 0x7ffc 32764

pc 0x16439c0 0x16439c0 <Bar::insert(int, int)+7104>

(cuda-gdb) disassemble 0x16439c0
Dump of assembler code for function Bar::insert(int, int):
…
0x00000000016439b0 <+7088>: CALL.ABS.NOINC 0x0
=> 0x00000000016439c0 <+7104>: MOV R6, R4
…
End of assembler dump.
(cuda-gdb)

ragerdl · October 7, 2019, 3:11pm

It may be worth noting that if the instruction pointer is pointing to the next instruction to execute and not the last instruction executed, that it’s possible (though unlikely) that an instruction somewhere else jumped to 0x16439c0. I think it’s unlikely, because the stack backtrace looks appropriate.

andrewrobbins · March 14, 2024, 4:43pm

I’ve just started receiving:

========= Illegal instruction
========= at __cuda_sm70_votesync_ballot+0x50
========= by thread (0,0,0) in block (36,0,0)
========= Device Frame:__ballot_sync(unsigned int, int)+0x150 in /usr/local/cuda/include/sm_30_intrinsics.hpp:134

For a __ballot_sync when setting -G in cuda 12.4. It didn’t give me this error in cuda 12.3 with -G and it works fine in either case when -G is not specified. I’ve checked that all threads in the warp call it, and that the mask is correct, also the code works properly when I don’t set -G. To me this looks like a bug in cuda 12.4.

AKravets · March 15, 2024, 8:18am

HI @andrewrobbins
Thank you very much for reporting the issue. Could you please confirm a few details?

Do you see this error only when running the application under debugger (cuda-gdb) or is it reproducible without the debugger as well?
Could you share the output of the following commands (for the setup, where you see this issue):
- nvcc --version
- nvidia-smi

veraj · April 16, 2024, 6:47am

Hi, @andrewrobbins

Could you provide the repro and the exactly nvcc/driver version you used ?

veraj · June 28, 2024, 10:58am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Making use of an error message CUDA Programming and Performance	5	615	December 19, 2018
An illegal instruction was encountered in Virtual Class Inheritance On Device CUDA Programming and Performance	6	450	March 6, 2023
CUDA-GDB captured "Illegal access to address" exception when I invoke child kernel, but the result is correct when free run CUDA Programming and Performance	6	1725	March 20, 2017
Debugging a "Warp Illegal Instruction" in OpenACC Fortran nvc, nvc++ and nvfortran	1	1225	February 1, 2021
Illegal instruction (error 715) with H100 CUDA Programming and Performance	27	264	September 18, 2024
Anomalies with __device__ functions. Or is cuda-gdb playing stupid? CUDA Programming and Performance	0	3500	November 9, 2011
Correct output with emulation mode, wrong with GPU/Execution CUDA Programming and Performance	6	3328	March 25, 2010
Cuda-gdb doesnt fit hopper?(tested on cutlass-example-48) CUDA-GDB	8	216	August 29, 2024
CUDA Error: an illegal instruction was encountered when use cudaHostAlloc Jetson Xavier NX cuda	6	1048	July 6, 2023
Cuda-gdb crashes upon attaching to a crashed process CUDA-GDB	2	1034	November 8, 2022

Illegal instruction error, but the instruction seems to be okay?

Related topics