Segfault on cudaGetDeviceCount ? apparent segfault in cudalib, anyone an idea how that could happen?

ceearem · December 21, 2010, 5:47pm

Hi

I got a strange crash, and just wanted to ask if someone has any hint what to look for as a cause. (I unfortunately dont have a small reproduction code.) It seems that it crashes in the cudalib upon calling cudaGetDeviceCount();. If I start my program with different parameters which are seemingly unrelated to the following code part it doesnt crash.

Here is the apparent problematic code snipped (removing the SAFE_CALL gives the same result). When this code part is called no GPU should be assigned to the process yet. But on the other hand a prior implicit assignment (which I might have overlooked) should have no implications for cudaGetDeviceCount should it?

printf("Debug A\n");

    static int deviceCount=0;

    static bool sharedmode=false;

    if(deviceCount && !sharedmode) return;

    if(deviceCount && sharedmode) {printf("ERROR\n");cudaThreadExit();}

    printf("Debug B %i\n",deviceCount);

CUDA_SAFE_CALL_NO_SYNC( cudaGetDeviceCount(&deviceCount) );

    printf("Debug B1A %i\n",deviceCount);

Output:

Debug A

Debug B 0

[aristarch:17278] *** Process received signal ***

[aristarch:17278] Signal: Segmentation fault (11)

[aristarch:17278] Signal code: Address not mapped (1)

[aristarch:17278] Failing at address: 0x8

[aristarch:17278] [ 0] /lib64/libpthread.so.0 [0x303440eb10]

[aristarch:17278] [ 1] /lib64/libc.so.6 [0x3033872bcc]

[aristarch:17278] [ 2] /lib64/libc.so.6(__libc_malloc+0x6e) [0x3033874cde]

[aristarch:17278] [ 3] /lib64/libc.so.6 [0x303386170a]

[aristarch:17278] [ 4] /usr/lib64/libcuda.so.1 [0x2b717d053eef]

[aristarch:17278] [ 5] /usr/lib64/libcuda.so.1 [0x2b717d0571c3]

[aristarch:17278] [ 6] /usr/lib64/libcuda.so.1 [0x2b717d057925]

[aristarch:17278] [ 7] /usr/lib64/libcuda.so.1 [0x2b717d00f229]

[aristarch:17278] [ 8] /usr/lib64/libcuda.so.1 [0x2b717d00d0e7]

[aristarch:17278] [ 9] /usr/lib64/libcuda.so.1 [0x2b717d051ae5]

[aristarch:17278] [10] /usr/lib64/libcuda.so.1 [0x2b717cfb8b8f]

[aristarch:17278] [11] /usr/lib64/libcuda.so.1(cuInit+0x4c) [0x2b717d06dcfc]

[aristarch:17278] [12] /usr/local/cuda/lib64/libcudart.so.3 [0x2b717d98c1f5]

[aristarch:17278] [13] /usr/local/cuda/lib64/libcudart.so.3 [0x2b717d98d59c]

[aristarch:17278] [14] /usr/local/cuda/lib64/libcudart.so.3(cudaGetDeviceCount+0x4b) [0x2b717d9ad3eb]

[aristarch:17278] [15] ../lmp_openmpi-20-s(CudaWrapper_Init+0x6b) [0x78fb7b]

[aristarch:17278] [16] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS4Cuda9setDeviceEPNS_6LAMMPSE+0x392) [0x50df22]

[aristarch:17278] [17] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS6LAMMPS4initEv+0xa) [0x645c2a]

[aristarch:17278] [18] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS3Run7commandEiPPc+0x9ab) [0x73551b]

[aristarch:17278] [19] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input15execute_commandEv+0x29da) [0x63afda]

[aristarch:17278] [20] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input4fileEv+0x2e8) [0x637f08]

[aristarch:17278] [21] ../lmp_openmpi-20-s(main+0xad) [0x64982d]

[aristarch:17278] [22] /lib64/libc.so.6(__libc_start_main+0xf4) [0x303381d994]

[aristarch:17278] [23] ../lmp_openmpi-20-s(_ZNSt8ios_base4InitD1Ev+0x51) [0x47d239]

Has anyone got an idea?

Lev · December 21, 2010, 6:03pm

ceearem · December 21, 2010, 9:36pm

Aha!! ;-)

ceearem · December 21, 2010, 9:54pm

Actually it also crashes here if I put it in front of the code I have shown so far:

printf("Debug A\n");

    cudaSetDevice(0);

    printf("Debug A1\n");

Output:

Debug A

[aristarch:18869] *** Process received signal ***

[aristarch:18869] Signal: Segmentation fault (11)

[aristarch:18869] Signal code: Address not mapped (1)

[aristarch:18869] Failing at address: 0x8

[aristarch:18869] [ 0] /lib64/libpthread.so.0 [0x303440eb10]

[aristarch:18869] [ 1] /lib64/libc.so.6 [0x3033872bcc]

[aristarch:18869] [ 2] /lib64/libc.so.6(__libc_malloc+0x6e) [0x3033874cde]

[aristarch:18869] [ 3] /lib64/libc.so.6 [0x303386170a]

[aristarch:18869] [ 4] /usr/lib64/libcuda.so.1 [0x2ab63d7bfeef]

[aristarch:18869] [ 5] /usr/lib64/libcuda.so.1 [0x2ab63d7c31c3]

[aristarch:18869] [ 6] /usr/lib64/libcuda.so.1 [0x2ab63d7c3925]

[aristarch:18869] [ 7] /usr/lib64/libcuda.so.1 [0x2ab63d77b229]

[aristarch:18869] [ 8] /usr/lib64/libcuda.so.1 [0x2ab63d7790e7]

[aristarch:18869] [ 9] /usr/lib64/libcuda.so.1 [0x2ab63d7bdae5]

[aristarch:18869] [10] /usr/lib64/libcuda.so.1 [0x2ab63d724b8f]

[aristarch:18869] [11] /usr/lib64/libcuda.so.1(cuInit+0x4c) [0x2ab63d7d9cfc]

[aristarch:18869] [12] /usr/local/cuda/lib64/libcudart.so.3 [0x2ab63e0f81f5]

[aristarch:18869] [13] /usr/local/cuda/lib64/libcudart.so.3 [0x2ab63e0f959c]

[aristarch:18869] [14] /usr/local/cuda/lib64/libcudart.so.3(cudaSetDevice+0x4b) [0x2ab63e11878b]

[aristarch:18869] [15] ../lmp_openmpi-20-s(CudaWrapper_Init+0x2f) [0x78fb3f]

[aristarch:18869] [16] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS4Cuda9setDeviceEPNS_6LAMMPSE+0x392) [0x50df22]

[aristarch:18869] [17] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS6LAMMPS4initEv+0xa) [0x645c2a]

[aristarch:18869] [18] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS3Run7commandEiPPc+0x9ab) [0x73551b]

[aristarch:18869] [19] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input15execute_commandEv+0x29da) [0x63afda]

[aristarch:18869] [20] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input4fileEv+0x2e8) [0x637f08]

[aristarch:18869] [21] ../lmp_openmpi-20-s(main+0xad) [0x64982d]

[aristarch:18869] [22] /lib64/libc.so.6(__libc_start_main+0xf4) [0x303381d994]

[aristarch:18869] [23] ../lmp_openmpi-20-s(_ZNSt8ios_base4InitD1Ev+0x51) [0x47d239]

[aristarch:18869] *** End of error message ***

Dittoaway · December 22, 2010, 3:12pm

printf(“Debug A\n”);
cudaThreadExit();
cudaSetDevice(0);
printf(“Debug A1\n”);

Does this crash?

ceearem · December 23, 2010, 1:48pm

It does:

Debug A

[aristarch:07957] *** Process received signal ***

[aristarch:07957] Signal: Segmentation fault (11)

[aristarch:07957] Signal code: Address not mapped (1)

[aristarch:07957] Failing at address: 0x8

[aristarch:07957] [ 0] /lib64/libpthread.so.0 [0x303440eb10]

[aristarch:07957] [ 1] /lib64/libc.so.6 [0x3033872bcc]

[aristarch:07957] [ 2] /lib64/libc.so.6(__libc_malloc+0x6e) [0x3033874cde]

[aristarch:07957] [ 3] /lib64/libc.so.6 [0x303386170a]

[aristarch:07957] [ 4] /usr/lib64/libcuda.so.1 [0x2b9a0126ceef]

[aristarch:07957] [ 5] /usr/lib64/libcuda.so.1 [0x2b9a012701c3]

[aristarch:07957] [ 6] /usr/lib64/libcuda.so.1 [0x2b9a01270925]

[aristarch:07957] [ 7] /usr/lib64/libcuda.so.1 [0x2b9a01228229]

[aristarch:07957] [ 8] /usr/lib64/libcuda.so.1 [0x2b9a012260e7]

[aristarch:07957] [ 9] /usr/lib64/libcuda.so.1 [0x2b9a0126aae5]

[aristarch:07957] [10] /usr/lib64/libcuda.so.1 [0x2b9a011d1b8f]

[aristarch:07957] [11] /usr/lib64/libcuda.so.1(cuInit+0x4c) [0x2b9a01286cfc]

[aristarch:07957] [12] /usr/local/cuda/lib64/libcudart.so.3 [0x2b9a01ba51f5]

[aristarch:07957] [13] /usr/local/cuda/lib64/libcudart.so.3 [0x2b9a01ba659c]

[aristarch:07957] [14] /usr/local/cuda/lib64/libcudart.so.3(cudaThreadExit+0x3e) [0x2b9a01bc878e]

[aristarch:07957] [15] ../lmp_openmpi-20-s(CudaWrapper_Init+0x2d) [0x78fc1d]

[aristarch:07957] [16] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS4Cuda9setDeviceEPNS_6LAMMPSE+0x392) [0x50df22]

[aristarch:07957] [17] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS6LAMMPS4initEv+0xa) [0x645c5a]

[aristarch:07957] [18] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS3Run7commandEiPPc+0x9ab) [0x7355fb]

[aristarch:07957] [19] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input15execute_commandEv+0x29fc) [0x63affc]

[aristarch:07957] [20] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input4fileEv+0x2e8) [0x637f08]

[aristarch:07957] [21] ../lmp_openmpi-20-s(main+0xad) [0x64985d]

[aristarch:07957] [22] /lib64/libc.so.6(__libc_start_main+0xf4) [0x303381d994]

[aristarch:07957] [23] ../lmp_openmpi-20-s(_ZNSt8ios_base4InitD1Ev+0x51) [0x47d239]

[aristarch:07957] *** End of error message ***

I suspect it is somehow connected to an implicit (and not intended) gpu initialisation prior to that code part dut to memory allocations and or copies (but I am relatively sure not by a kernel). But still I wonder why it crashes at this stage and not earlier.

Cheers

Topic		Replies	Views
memory leak on cudaGetDeviceCount ? CUDA Programming and Performance	1	6735	October 14, 2009
weird segfault with cuda-gdb CUDA Programming and Performance	0	6948	January 23, 2010
Advice on finding illusive CUDA bugs (e.g., segfault on cudaGetDeviceCount)? CUDA Programming and Performance	3	678	October 11, 2017
segmentation fault at the first cudaMalloc with --device-emulation everything was fine CUDA Programming and Performance	10	4359	January 25, 2010
second kernel call results in segmentation fault and other annoying problems CUDA Programming and Performance	6	2187	March 15, 2009
Kernel segfault in the depths of libcuda.so Linux / C++ CUDA Programming and Performance	3	7313	September 12, 2011
cudaGetDevice unreliable CUDA Programming and Performance	2	3045	June 20, 2008
SDK sample code failures only on samples that launch a kernel CUDA Programming and Performance	17	8729	January 7, 2009
segfault with python + libcudart CUDA Programming and Performance	3	6595	November 29, 2010
Cuda libraries have memory errors CUDA Programming and Performance	0	3593	August 20, 2011

Segfault on cudaGetDeviceCount ? apparent segfault in cudalib, anyone an idea how that could happen?

Related topics