Why do I get CUDA_EXCEPTION_10, Device Illegal Address?

I have the following PTX code:

ld.param.u64 %r2, [param_0 + 0];
ld.param.s32 %r3, [param_1 + 0];
ld.global.s32 %r4, [%r2 + 12];
setp.ge.s32 %p0, %r3, %r4;
@%p0 bra L1;
bra L2;

and here is the trace of it in cuda-gdb

Can someone please tell me why I might be geting “CUDA_EXCEPTION_10, Device Illegal Address.” even when I can access the device memory fine in cuda-gdb?

(cuda-gdb) si
0x00007ffff04af688 in samplePTXFunc<<<(1,1,1),(1,1,1)>>> ()
1: x/i $pc
=> 0x7ffff04af688 <samplePTXFunc+8>: MOV R1, c[0x0][0x44]
(cuda-gdb)
0x00007ffff04af690 in samplePTXFunc<<<(1,1,1),(1,1,1)>>> ()
1: x/i $pc
=> 0x7ffff04af690 <samplePTXFunc+16>: MOV R0, c[0x0][0x140]
(cuda-gdb)
0x00007ffff04af698 in samplePTXFunc<<<(1,1,1),(1,1,1)>>> ()
1: x/i $pc
=> 0x7ffff04af698 <samplePTXFunc+24>: IADD R2.CC, R0, 0xc
(cuda-gdb)
0x00007ffff04af6a0 in samplePTXFunc<<<(1,1,1),(1,1,1)>>> ()
1: x/i $pc
=> 0x7ffff04af6a0 <samplePTXFunc+32>: IADD.X R3, RZ, c[0x0][0x144]
(cuda-gdb)
0x00007ffff04af6a8 in samplePTXFunc<<<(1,1,1),(1,1,1)>>> ()
1: x/i $pc
=> 0x7ffff04af6a8 <samplePTXFunc+40>: LD.E R0, [R2]
(cuda-gdb)
Program received signal CUDA_EXCEPTION_10, Device Illegal Address.
0x00007ffff04af6b0 in samplePTXFunc<<<(1,1,1),(1,1,1)>>> ()
1: x/i $pc
=> 0x7ffff04af6b0 <samplePTXFunc+48>: ISETP.LE.AND P0, PT, R0, c[0x0][0x148], PT
(cuda-gdb) p/x $R2
$1 = 0x2058000c
(cuda-gdb) p *(@global int *) 0x2058000c
$2 = 9
(cuda-gdb) p $R0
$3 = 0
(cuda-gdb)

Thanks!

Hi chanakya, CUDA_EXCEPTION_10 (Device Illegal Address) occurs when a CUDA kernel attempts to access an illegal memory address (an address that is not mapped to a page, to be specific).

From your log above, you are tracking this down correctly in cuda-gdb. But there’s one additional piece of information that you need to collect: just after you single-step (using si) into the LD.E R0, [R2] instruction, you will also want to read the value in R3 (p/x $R3). The .E component of instruction means that the address is held in a pair of registers (for a full 40-bit VA access from the SM). So R3 holds the upper 32-bits and R2 holds the lower 32-bits of the address. Because $R2 is legitimate (the 32-bit address is valid), cuda-gdb can read from that address without any problems. But your program is actually accessing the R3,R2 address pair, so that’s why your kernel hits this exception.

From the looks of it, it seems that the input to this program may be the problem. It looks like the value passed in param_0 may have the wrong data in the upper 32-bits.

Please let me know if you are able to resolve this based on the information above.

Thanks for your reply, geoffg. It is very helpful to understand what is going on.

On the host side that sets up kernel arguments, I have the following:

324 ((CUdeviceptr) &_kernelArgBuffer[_bufferOffset]) = arrayArgOnDev;
(cuda-gdb)
336 _bufferOffset += sizeof(arrayArgOnDev);
(cuda-gdb) p/x *(unsigned long long *)&_kernelArgBuffer[0]
$2 = 0x220580038
(cuda-gdb) ptype arrayArgOnDev
type = unsigned long long

Once I break during execution of the kernel,

(cuda-gdb)
0x00007ffff045ff78 in samplePTXFunc<<<(1,1,1),(1,1,1)>>> ()
1: x/i $pc
=> 0x7ffff045ff78 <samplePTXFunc+40>: LD.E R0, [R2]
(cuda-gdb) p/x $R3
$3 = 0x2
(cuda-gdb) p/x $R2
$4 = 0x20580044
(cuda-gdb) p/x $R0
$7 = 0x20580038
(cuda-gdb) p/x $R1
$8 = 0xfffc40
(cuda-gdb) p *(@global int *) 0x220580044
Error: Failed to read global memory at address 0x220580044 on device 0 sm 11 warp 0 lane 0 (error=7).
(cuda-gdb) p *(@global int *) 0x20580044
$5 = 9

I was expecting 0x220580044 to contain 9.

What am I doing wrong? I am using driver version 319 on Geforce GTX 780.

Thanks again!

This is good news – the problem is what we thought. R3 has a value of 2 in it, and that creates an address that is inaccessible (and CUDA-GDB agrees with the execution of your kernel).

That value of 2 is coming from the host side of your program (thanks for posting that piece). As you’ve printed:

(cuda-gdb) p/x *(unsigned long long *)&_kernelArgBuffer[0]
$2 = 0x220580038

Right there, the address matches the value that ends up in R3/R2, which appears to be incorrect. Can you print the value of arrayArgOnDev, and show the snippet where it is allocated?

The value in R2/R3 i.e., 0x220580044 is the correct one to be loaded from. This because 0xc is added to 0x20580038. This addition of the offset is as expected/coded. Pl. see the PTX code in my first post. Let me reproduce the debug session. Just to recap, I am trying to debug the reason for the failure of kernel launch which it appears to be due incorrect access of LD.E “instruction”.

Just side note: I tried using cuMemAlloc to allocate device memory, followed by cuMemcpyHtoD. The behavior was no different.

// Host side

307 status = cuMemHostRegister(arg, argSize, CU_MEMHOSTREGISTER_DEVICEMAP);
(cuda-gdb) n
308 if (status != CUDA_SUCCESS) {
(cuda-gdb)
315 status = cuMemHostGetDevicePointer(&arrayArgOnDev, arg, 0);
(cuda-gdb)
316 if (status != CUDA_SUCCESS) {
(cuda-gdb) p/x arrayArgOnDev
$1 = 0x220580038
(cuda-gdb) n
323 ((CUdeviceptr) &_kernelArgBuffer[_bufferOffset]) = arrayArgOnDev;
(cuda-gdb) n
335 _bufferOffset += sizeof(arrayArgOnDev);
(cuda-gdb) p/x _bufferOffset
$2 = 0x0
(cuda-gdb) p/x ((CUdeviceptr) &_kernelArgBuffer[_bufferOffset])
$3 = 0x220580038
(cuda-gdb) b samplePTXFunc
Breakpoint 2 at 0x7ffff049f600
(cuda-gdb) c
Continuing.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 11, warp 0, lane 0]

// Device code debugging

Breakpoint 2, 0x00007ffff049f600 in samplePTXFunc<<<(1,1,1),(1,1,1)>>> ()
(cuda-gdb) disp/i $pc
1: x/i $pc
=> 0x7ffff049f600 : Cannot disassemble instruction
(cuda-gdb) si
0x00007ffff049f608 in samplePTXFunc<<<(1,1,1),(1,1,1)>>> ()
1: x/i $pc
=> 0x7ffff049f608 <samplePTXFunc+8>: MOV R1, c[0x0][0x44]
(cuda-gdb)
0x00007ffff049f610 in samplePTXFunc<<<(1,1,1),(1,1,1)>>> ()
1: x/i $pc
=> 0x7ffff049f610 <samplePTXFunc+16>: MOV R0, c[0x0][0x140]
(cuda-gdb) p/x $R1
$4 = 0xfffc40
(cuda-gdb) si
0x00007ffff049f618 in samplePTXFunc<<<(1,1,1),(1,1,1)>>> ()
1: x/i $pc
=> 0x7ffff049f618 <samplePTXFunc+24>: IADD R2.CC, R0, 0xc
(cuda-gdb) p/x $R0
$6 = 0x20580038
(cuda-gdb) si
0x00007ffff049f620 in samplePTXFunc<<<(1,1,1),(1,1,1)>>> ()
1: x/i $pc
=> 0x7ffff049f620 <samplePTXFunc+32>: IADD.X R3, RZ, c[0x0][0x144]
(cuda-gdb) p/x $R2
$7 = 0x20580044
(cuda-gdb) si
0x00007ffff049f628 in samplePTXFunc<<<(1,1,1),(1,1,1)>>> ()
1: x/i $pc
=> 0x7ffff049f628 <samplePTXFunc+40>: LD.E R0, [R2]
(cuda-gdb) p/x $R3
$8 = 0x2
(cuda-gdb) si

Program received signal CUDA_EXCEPTION_10, Device Illegal Address.
0x00007ffff049f630 in samplePTXFunc<<<(1,1,1),(1,1,1)>>> ()
1: x/i $pc
=> 0x7ffff049f630 <samplePTXFunc+48>: ISETP.LE.AND P0, PT, R0, c[0x0][0x148], PT
(cuda-gdb) p *(@global int *) 0x20580044
$9 = 9
(cuda-gdb) p *(@global int *) 0x220580044
Error: Failed to read global memory at address 0x220580044 on device 0 sm 11 warp 0 lane 0 (error=7).
(cuda-gdb)

I really appreciate your quick and insightful responses. Thanks!

No problem!

Yes the addition of the offset is expected based on the PTX you’ve provided, no concerns there. Although, we do want to make sure that the addition doesn’t cross the boundary of the allocation. My other concern lies in the value (2) that is held in R3 and is contributing to the address.

From the result of the call to cuMemHostGetDevicePointer, I can see that it’s the full 0x220580038

(cuda-gdb) p/x arrayArgOnDev
$1 = 0x220580038

And that matches the value that is used as the address (in R3/R2) for the LD.E instruction, which is failing.

So, we’ll need to run a few more experiments:

(1) Can you confirm that the kernel fails when running your application outside of the debugger?

(2) Can you show a few more lines on the host side (namely the allocation of ‘arg’), and can you also print argSize (we’d like to make sure that the addition of 12 fits safely within the allocation being done here)?

(3) Can you confirm you’re on a 64-bit system? Is the application compiled with -m32, or is it compiled for native 64-bits?

(4) Can you try the following experiment: just before stepping over the LD.E instruction, please modify the value of R3 to 0 (p $R3 = 0). Then try stepping the LD.E instruction again. We expect this to succeed, since you were able to read the memory from 0x20580044, but not 0x220580044.

One more experiment, (5): Similar to the experiment in (4), but instead of modifying R3 to 0, can you modify R2 to be 0x20580038 (the base of the allocation), and then try stepping the LD.E again? (This will indicate whether the offset is safe)

Also, are you able to post the full source code?

Answers to post #6

(1) Outside of the debugger, although the status returned by cuLaunchKernel of the PTX kernel returns SUCCESS, the subsequent call to cuCtxSynchronize returns with error 700 which as I understand indicates failure of the previous launch call. So, yes, the kernel execution fails outside the debugger. The PTX kernel is generated by a Java-to-PTX compiler. We are implementing a PTX backend as part of the open source OpenJDK Project Sumatra (OpenJDK: Project Sumatra) using Graal (graal/graal: log).

Here is the full PTX kernel generated:

.version 3.0
.target sm_30
.entry testStoreArray1I (
.param .u64 param0,
.param .s32 param1
) {
.reg .s32 %r3;
.reg .s32 %r4;
.reg .s32 %r9;
.reg .s64 %r5;
.reg .s64 %r6;
.reg .s64 %r7;
.reg .s64 %r8;
.reg .u64 %r2;
.reg .pred %p<1>;
.reg .pred %r;
L0:
ld.param.u64 %r2, [param0 + 0];
ld.param.s32 %r3, [param1 + 0];
ld.global.s32 %r4, [%r2 + 12];
setp.ge.s32 %p0, %r3, %r4;
@%p0 bra L1;
bra L2;
L2:
cvt.s64.s32 %r5, %r3;
shl.b64 %r6, %r5,2;
mov.s64 %r7, %r2;
add.s64 %r8, %r7, %r6;
mov.s32 %r9, 42;
st.global.s32 [%r8 + 16], %r9;
ret;
L1:
exit;
}

To test the generated kernel outside of JVM execution, I wrote a C++ program which just sets up the parameters and launches the above kernel. Interestingly, this stand alone program runs on the GPU correctly. The same sequence of CUDA driver API calls do not work when running in the JVM that generates this PTX code.

(2) Please refer to the following cuda-gdb snippets

// Host side debugging

256 int argSize = arg->size() * HeapWordSize;
(cuda-gdb)
[tcsetpgrp failed in terminal_inferior: Inappropriate ioctl for device]
261 if (is_kernel_arg_setup()) {
(cuda-gdb) p argSize
$1 = 56
(cuda-gdb) x/56b arg
0x7ad817038: 1 0 0 0 0 0 0 0
0x7ad817040: 32 11 0 0 9 0 0 0
0x7ad817048: 1 0 0 0 2 0 0 0
0x7ad817050: 3 0 0 0 4 0 0 0
0x7ad817058: 5 0 0 0 6 0 0 0
0x7ad817060: 7 0 0 0 8 0 0 0
0x7ad817068: 9 0 0 0 0 0 0 0

// Device side debugging
(cuda-gdb) si
0x00007ffff04bf608 in testStoreArray1I<<<(1,1,1),(1,1,1)>>> ()
1: x/i $pc
=> 0x7ffff04bf608 <testStoreArray1I+8>: MOV R1, c[0x0][0x44]
(cuda-gdb)
0x00007ffff04bf610 in testStoreArray1I<<<(1,1,1),(1,1,1)>>> ()
1: x/i $pc
=> 0x7ffff04bf610 <testStoreArray1I+16>: MOV R0, c[0x0][0x140]
(cuda-gdb) p/x $R1
$1 = 0xfffc40
(cuda-gdb) si
0x00007ffff04bf618 in testStoreArray1I<<<(1,1,1),(1,1,1)>>> ()
1: x/i $pc
=> 0x7ffff04bf618 <testStoreArray1I+24>: IADD R2.CC, R0, 0xc
(cuda-gdb) p/x $R0
$2 = 0x20580000
(cuda-gdb) x/56b (@global char*) $R0
0x20580000: 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x20580008: 0x20 0x0b 0x00 0x00 0x09 0x00 0x00 0x00
0x20580010: 0x01 0x00 0x00 0x00 0x02 0x00 0x00 0x00
0x20580018: 0x03 0x00 0x00 0x00 0x04 0x00 0x00 0x00
0x20580020: 0x05 0x00 0x00 0x00 0x06 0x00 0x00 0x00
0x20580028: 0x07 0x00 0x00 0x00 0x08 0x00 0x00 0x00
0x20580030: 0x09 0x00 0x00 0x00 0x00 0x00 0x00 0x00
(cuda-gdb)

(3) Yes, it is a 64-bit system. The application is compiled for native 64-bits.

(4) Setting R3 to 0 does seem to load the expected value of 9 to $R0. However, it appears that I see problems with a subsequent

@!P0 ST.E [R2+0x10], R4

trying to store to

(cuda-gdb) p/x $R2
$6 = 0x2058001c
(cuda-gdb) p/x $R1
$7 = 0xfffc40

where as I expect the store to 0x2058001c. But I’ll worry about it later.

(5) Setting R2 to the base of allocation resulted in the same CUDA_EXCEPTION_10.

The file that has the sources that set up the kernel arguments is at http://hg.openjdk.java.net/graal/graal/file/3b178baf3edb/src/gpu/ptx/vm/ptxKernelArguments.cpp. Please look at the function do_array(): Lines 262-270. This version uses cuMemalloc;cuMemcpyHtoD which behaves the same way as cuMemHostRegister;cuMemHostGetDevicePointer as I pointed out earlier.

I am happy to continue further debugging via direct email (if you prefer). I can post a summary of the results/resolution to the forum.

Thanks!

Bharadwaj

Thanks for posting the results of each experiment.

In particular, you’ve confirmed that the entire allocation is not accessible (using the base address yields the same result as base+offset when accessed from the GPU). The size of the allocation also seams reasonable (and an offset of 12 fits within it).

Since you mentioned your standalone C++ program that sets up the arguments and runs this same kernel executes successfully on the GPU, it appears to be something specific within the full application (perhaps freeing the allocation by the time the GPU begins to execute, etc.).

I will follow up you with you on the side on this.

(Posting from a different profile as my earlier profile was inaccessible)

The issue I was facing has been resolved thanks to geoffg’s suggestions and help.

The solution is to look up and invoke appropriate _v2 versions of CUDA driver APIs as the app is 64-bit.

Thanks!