CUDA-GDB captured "Illegal access to address" exception when I invoke child kernel, but the result is correct when free run

Hi all,
I invoke a child kernel to do sorting of thrust, the result is correct.
But when I use CUDA-GDB/Nsingt to check the program, the “Illegal access to address” exception was captured.
I have read related documents, I’m sure the passed arguments are not local variables. So I try to pass
arguments only, and the kernel so nothing. However, the exception still occurs. Only the kernel without argument can
pass the check. Is it a real bug in my program?

Provide some experiments

__global__ void kernel_sorting(int a){
//nothing
}
__device__ int v;
__global__ void kernel_main(....){
int tid = threadIdx.x + blockDim.x * blockIdx.x;
....
if(tid==0)
    kernel_sorting<<<gridDim.x,blockDim.x>>>(v); //pass int
}

GDB show: Illegal access to address (@global)0x70382d540 detected. 
So v is in the global memory
__global__ void kernel_sorting(int *a){
//nothing
}
__device__ int v;
__global__ void kernel_main(....){
int tid = threadIdx.x + blockDim.x * blockIdx.x;
....
if(tid==0)
    kernel_sorting<<<gridDim.x,blockDim.x>>>(&v); //pass address
}
//GDB show: Illegal access to address (@global)0x70382d540 detected. 
//So v is in the global memory
__global__ void kernel_sorting(int *a){
//nothing
}
__device__ int v;
__global__ void kernel_main(int *input){
int tid = threadIdx.x + blockDim.x * blockIdx.x;
....
if(tid==0)
    kernel_sorting<<<gridDim.x,blockDim.x>>>(input); //pass an argument of parent kernel, a buffer in global memory 
}

provide more GDB’s information

Breakpoint 1, kernel_main<<<(1,1,1),(64,1,1)>>> (…) at …/fpg.cu:583
583 kernel_sorting<<<gridDim.x,blockDim.x>>>(&v);
(cuda-gdb) p &v
$1 = (@global int *) 0x70528dc00
(cuda-gdb) c
Continuing.

Illegal access to address (@global)0x703623d40 detected.

Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.
[Switching focus to CUDA kernel 0, grid 22, block (0,0,0), thread (0,0,0), device 0, sm 4, warp 1, lane 0]
0x00007ffff06340a8 in kernel_main<<<(1,1,1),(64,1,1)>>> (…)at …/fpg.cu:583
583 kernel_sorting<<<gridDim.x,blockDim.x>>>(&v);

//The address triggers the exception is not the address of v. Who touch the address?

Here is my suggestion, if you want help:

[url]https://devtalk.nvidia.com/default/topic/999733/cuda-programming-and-performance/function-only-works-correctly-with-cuda-8-0-in-release-mode-/post/5109126/#5109126[/url]

Thanks for your suggestion, I create a project with Nsignt. The case can be reproduce.
I tried cuda-memcheck, but the program can be executed successfully with error.
You can switch the child kernel, all of them cause exception on my PC.
CUDA version : 8.0
GPU driver version:375.26
OS details : OpenSuse 42
the GPU model you are running on : Maxwell

#include <iostream>
#include <numeric>
#include <stdlib.h>
#include <stdio.h>
static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)

/**
 * CUDA kernel that computes reciprocal values for a given vector
 */

__global__ void kernel_sorting1(int *data) {
unsigned idx = blockIdx.x*blockDim.x+threadIdx.x;
	printf("tid:%d\n",idx);

}
__global__ void kernel_sorting2(int *data) {
unsigned idx = blockIdx.x*blockDim.x+threadIdx.x;
	printf("tid:%d\n",idx);

}

__global__ void kernel_sorting3(int data) {
	unsigned idx = blockIdx.x*blockDim.x+threadIdx.x;
	printf("tid:%d\n",idx);

}

__device__ int v;
__global__ void kernel_main(int *data) {
	unsigned idx = blockIdx.x*blockDim.x+threadIdx.x;
	int x =1;
	if (idx==0){
		kernel_sorting1<<<1,64>>>(data);
		//kernel_sorting2<<<1,64>>>(&v);
		//kernel_sorting3<<<1,64>>>(x);
	}
}

int main(void)
{
	int *input;
	cudaMalloc((void**) &input, sizeof(int) * 100 );
	cudaMemset(input, 0, sizeof(int) * 100 );
	kernel_main<<<1,64>>>(input);
	cudaFree(input);
	return 0;
}

NVIDIA (R) CUDA Debugger
8.0 release
Portions Copyright (C) 2007-2016 NVIDIA Corporation
GNU gdb (GDB) 7.6.2
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type “show copying”
and “show warranty” for details.
This GDB was configured as “x86_64-unknown-linux-gnu”.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/
Reading symbols from /home/…/work/pp/Debug/pp…done.
(cuda-gdb) r
Starting program: /home/…/work/pp/Debug/./pp
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
[New Thread 0x7ffff5f31700 (LWP 5103)]
[New Thread 0x7ffff5730700 (LWP 5104)]
[New Thread 0x7ffff4f2f700 (LWP 5105)]

Illegal access to address (@global)0x7032fb540 detected.

Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.
[Switching focus to CUDA kernel 0, grid 3, block (0,0,0), thread (0,0,0), device 0, sm 4, warp 1, lane 0]
0x0000000000d0d2c8 in kernel_main<<<(1,1,1),(64,1,1)>>> (data=0x705380000) at …/src/pp.cu:44
44 kernel_sorting1<<<1,64>>>(data);

I don’t have any trouble running your code on a sm_35 device, cuda 8, CentOS7:

$ nvcc -g -G -arch=sm_35 -o t1303 t1303.cu -rdc=true -lcudadevrt
t1303.cu(32): warning: variable "x" was declared but never referenced

t1303.cu(5): warning: function "CheckCudaErrorAux" was declared but never referenced

t1303.cu(32): warning: variable "x" was declared but never referenced

t1303.cu(5): warning: function "CheckCudaErrorAux" was declared but never referenced

$ cuda-gdb ./t1303
NVIDIA (R) CUDA Debugger
8.0 release
Portions Copyright (C) 2007-2016 NVIDIA Corporation
GNU gdb (GDB) 7.6.2
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/user2/misc/t1303...done.
(cuda-gdb) r
Starting program: /home/user2/misc/./t1303
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff5f8e700 (LWP 837)]
[New Thread 0x7ffff578d700 (LWP 838)]
tid:0
tid:1
tid:2
tid:3
tid:4
tid:5
tid:6
tid:7
tid:8
tid:9
tid:10
tid:11
tid:12
tid:13
tid:14
tid:15
tid:16
tid:17
tid:18
tid:19
tid:20
tid:21
tid:22
tid:23
tid:24
tid:25
tid:26
tid:27
tid:28
tid:29
tid:30
tid:31
tid:32
tid:33
tid:34
tid:35
tid:36
tid:37
tid:38
tid:39
tid:40
tid:41
tid:42
tid:43
tid:44
tid:45
tid:46
tid:47
tid:48
tid:49
tid:50
tid:51
tid:52
tid:53
tid:54
tid:55
tid:56
tid:57
tid:58
tid:59
tid:60
tid:61
tid:62
tid:63
[Thread 0x7ffff578d700 (LWP 838) exited]
[Thread 0x7ffff7fcc740 (LWP 810) exited]
[Inferior 1 (process 810) exited normally]
(cuda-gdb)

It doesn’t appear to me that OpenSUSE 42 is an officially supported linux distro for CUDA:

http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#system-requirements

but I’m not sure that is related to your issue. I don’t think there is anything wrong with your cuda code as posted, and I think your cuda-memcheck results support that claim.

Hi txbob,
Thanks again for your help. Maybe I should try other OS.
Actually, cuda-memcheck always reports “Internal error (7)” on my PC.
Even if I use it to check CUDA sample provided by NVIDIA.
Have you met the case?

========= CUDA-MEMCHECK
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 750 Ti                                                                                                                                                                
 Quick Mode                                                                                                                                                                                  
                                                                                                                                                                                             
 Host to Device Bandwidth, 1 Device(s)                                                                                                                                                       
 PINNED Memory Transfers                                                                                                                                                                     
   Transfer Size (Bytes)        Bandwidth(MB/s)                                                                                                                                              
   33554432                     12753.9                                                                                                                                                      

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     12794.5

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     70159.5

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
========= Internal error (7)
========= No CUDA-MEMCHECK results found

Sounds like there is something wrong with your CUDA install. I don’t know what it is exactly. Previously you said cuda-memcheck reported no errors. On my system, cuda-memcheck reported no errors on your code (and I didn’t get any in cuda-gdb either).

Thanks. Because the internal error always occurs, even if The output is right, so I just ignore the error.
Thanks for your help, changing an OS is easier than debug :)