Segfault on cudaGetDeviceCount ? apparent segfault in cudalib, anyone an idea how that could happen?

Hi

I got a strange crash, and just wanted to ask if someone has any hint what to look for as a cause. (I unfortunately dont have a small reproduction code.) It seems that it crashes in the cudalib upon calling cudaGetDeviceCount();. If I start my program with different parameters which are seemingly unrelated to the following code part it doesnt crash.

Here is the apparent problematic code snipped (removing the SAFE_CALL gives the same result). When this code part is called no GPU should be assigned to the process yet. But on the other hand a prior implicit assignment (which I might have overlooked) should have no implications for cudaGetDeviceCount should it?

printf("Debug A\n");

    static int deviceCount=0;

    static bool sharedmode=false;

    if(deviceCount && !sharedmode) return;

    if(deviceCount && sharedmode) {printf("ERROR\n");cudaThreadExit();}

    printf("Debug B %i\n",deviceCount);

CUDA_SAFE_CALL_NO_SYNC( cudaGetDeviceCount(&deviceCount) );

    printf("Debug B1A %i\n",deviceCount);

Output:

Debug A

Debug B 0

[aristarch:17278] *** Process received signal ***

[aristarch:17278] Signal: Segmentation fault (11)

[aristarch:17278] Signal code: Address not mapped (1)

[aristarch:17278] Failing at address: 0x8

[aristarch:17278] [ 0] /lib64/libpthread.so.0 [0x303440eb10]

[aristarch:17278] [ 1] /lib64/libc.so.6 [0x3033872bcc]

[aristarch:17278] [ 2] /lib64/libc.so.6(__libc_malloc+0x6e) [0x3033874cde]

[aristarch:17278] [ 3] /lib64/libc.so.6 [0x303386170a]

[aristarch:17278] [ 4] /usr/lib64/libcuda.so.1 [0x2b717d053eef]

[aristarch:17278] [ 5] /usr/lib64/libcuda.so.1 [0x2b717d0571c3]

[aristarch:17278] [ 6] /usr/lib64/libcuda.so.1 [0x2b717d057925]

[aristarch:17278] [ 7] /usr/lib64/libcuda.so.1 [0x2b717d00f229]

[aristarch:17278] [ 8] /usr/lib64/libcuda.so.1 [0x2b717d00d0e7]

[aristarch:17278] [ 9] /usr/lib64/libcuda.so.1 [0x2b717d051ae5]

[aristarch:17278] [10] /usr/lib64/libcuda.so.1 [0x2b717cfb8b8f]

[aristarch:17278] [11] /usr/lib64/libcuda.so.1(cuInit+0x4c) [0x2b717d06dcfc]

[aristarch:17278] [12] /usr/local/cuda/lib64/libcudart.so.3 [0x2b717d98c1f5]

[aristarch:17278] [13] /usr/local/cuda/lib64/libcudart.so.3 [0x2b717d98d59c]

[aristarch:17278] [14] /usr/local/cuda/lib64/libcudart.so.3(cudaGetDeviceCount+0x4b) [0x2b717d9ad3eb]

[aristarch:17278] [15] ../lmp_openmpi-20-s(CudaWrapper_Init+0x6b) [0x78fb7b]

[aristarch:17278] [16] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS4Cuda9setDeviceEPNS_6LAMMPSE+0x392) [0x50df22]

[aristarch:17278] [17] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS6LAMMPS4initEv+0xa) [0x645c2a]

[aristarch:17278] [18] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS3Run7commandEiPPc+0x9ab) [0x73551b]

[aristarch:17278] [19] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input15execute_commandEv+0x29da) [0x63afda]

[aristarch:17278] [20] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input4fileEv+0x2e8) [0x637f08]

[aristarch:17278] [21] ../lmp_openmpi-20-s(main+0xad) [0x64982d]

[aristarch:17278] [22] /lib64/libc.so.6(__libc_start_main+0xf4) [0x303381d994]

[aristarch:17278] [23] ../lmp_openmpi-20-s(_ZNSt8ios_base4InitD1Ev+0x51) [0x47d239]

Has anyone got an idea?

Aha!! ;-)

Actually it also crashes here if I put it in front of the code I have shown so far:

printf("Debug A\n");

    cudaSetDevice(0);

    printf("Debug A1\n");

Output:

Debug A

[aristarch:18869] *** Process received signal ***

[aristarch:18869] Signal: Segmentation fault (11)

[aristarch:18869] Signal code: Address not mapped (1)

[aristarch:18869] Failing at address: 0x8

[aristarch:18869] [ 0] /lib64/libpthread.so.0 [0x303440eb10]

[aristarch:18869] [ 1] /lib64/libc.so.6 [0x3033872bcc]

[aristarch:18869] [ 2] /lib64/libc.so.6(__libc_malloc+0x6e) [0x3033874cde]

[aristarch:18869] [ 3] /lib64/libc.so.6 [0x303386170a]

[aristarch:18869] [ 4] /usr/lib64/libcuda.so.1 [0x2ab63d7bfeef]

[aristarch:18869] [ 5] /usr/lib64/libcuda.so.1 [0x2ab63d7c31c3]

[aristarch:18869] [ 6] /usr/lib64/libcuda.so.1 [0x2ab63d7c3925]

[aristarch:18869] [ 7] /usr/lib64/libcuda.so.1 [0x2ab63d77b229]

[aristarch:18869] [ 8] /usr/lib64/libcuda.so.1 [0x2ab63d7790e7]

[aristarch:18869] [ 9] /usr/lib64/libcuda.so.1 [0x2ab63d7bdae5]

[aristarch:18869] [10] /usr/lib64/libcuda.so.1 [0x2ab63d724b8f]

[aristarch:18869] [11] /usr/lib64/libcuda.so.1(cuInit+0x4c) [0x2ab63d7d9cfc]

[aristarch:18869] [12] /usr/local/cuda/lib64/libcudart.so.3 [0x2ab63e0f81f5]

[aristarch:18869] [13] /usr/local/cuda/lib64/libcudart.so.3 [0x2ab63e0f959c]

[aristarch:18869] [14] /usr/local/cuda/lib64/libcudart.so.3(cudaSetDevice+0x4b) [0x2ab63e11878b]

[aristarch:18869] [15] ../lmp_openmpi-20-s(CudaWrapper_Init+0x2f) [0x78fb3f]

[aristarch:18869] [16] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS4Cuda9setDeviceEPNS_6LAMMPSE+0x392) [0x50df22]

[aristarch:18869] [17] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS6LAMMPS4initEv+0xa) [0x645c2a]

[aristarch:18869] [18] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS3Run7commandEiPPc+0x9ab) [0x73551b]

[aristarch:18869] [19] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input15execute_commandEv+0x29da) [0x63afda]

[aristarch:18869] [20] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input4fileEv+0x2e8) [0x637f08]

[aristarch:18869] [21] ../lmp_openmpi-20-s(main+0xad) [0x64982d]

[aristarch:18869] [22] /lib64/libc.so.6(__libc_start_main+0xf4) [0x303381d994]

[aristarch:18869] [23] ../lmp_openmpi-20-s(_ZNSt8ios_base4InitD1Ev+0x51) [0x47d239]

[aristarch:18869] *** End of error message ***

printf(“Debug A\n”);
cudaThreadExit();
cudaSetDevice(0);
printf(“Debug A1\n”);

Does this crash?

It does:

Debug A

[aristarch:07957] *** Process received signal ***

[aristarch:07957] Signal: Segmentation fault (11)

[aristarch:07957] Signal code: Address not mapped (1)

[aristarch:07957] Failing at address: 0x8

[aristarch:07957] [ 0] /lib64/libpthread.so.0 [0x303440eb10]

[aristarch:07957] [ 1] /lib64/libc.so.6 [0x3033872bcc]

[aristarch:07957] [ 2] /lib64/libc.so.6(__libc_malloc+0x6e) [0x3033874cde]

[aristarch:07957] [ 3] /lib64/libc.so.6 [0x303386170a]

[aristarch:07957] [ 4] /usr/lib64/libcuda.so.1 [0x2b9a0126ceef]

[aristarch:07957] [ 5] /usr/lib64/libcuda.so.1 [0x2b9a012701c3]

[aristarch:07957] [ 6] /usr/lib64/libcuda.so.1 [0x2b9a01270925]

[aristarch:07957] [ 7] /usr/lib64/libcuda.so.1 [0x2b9a01228229]

[aristarch:07957] [ 8] /usr/lib64/libcuda.so.1 [0x2b9a012260e7]

[aristarch:07957] [ 9] /usr/lib64/libcuda.so.1 [0x2b9a0126aae5]

[aristarch:07957] [10] /usr/lib64/libcuda.so.1 [0x2b9a011d1b8f]

[aristarch:07957] [11] /usr/lib64/libcuda.so.1(cuInit+0x4c) [0x2b9a01286cfc]

[aristarch:07957] [12] /usr/local/cuda/lib64/libcudart.so.3 [0x2b9a01ba51f5]

[aristarch:07957] [13] /usr/local/cuda/lib64/libcudart.so.3 [0x2b9a01ba659c]

[aristarch:07957] [14] /usr/local/cuda/lib64/libcudart.so.3(cudaThreadExit+0x3e) [0x2b9a01bc878e]

[aristarch:07957] [15] ../lmp_openmpi-20-s(CudaWrapper_Init+0x2d) [0x78fc1d]

[aristarch:07957] [16] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS4Cuda9setDeviceEPNS_6LAMMPSE+0x392) [0x50df22]

[aristarch:07957] [17] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS6LAMMPS4initEv+0xa) [0x645c5a]

[aristarch:07957] [18] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS3Run7commandEiPPc+0x9ab) [0x7355fb]

[aristarch:07957] [19] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input15execute_commandEv+0x29fc) [0x63affc]

[aristarch:07957] [20] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input4fileEv+0x2e8) [0x637f08]

[aristarch:07957] [21] ../lmp_openmpi-20-s(main+0xad) [0x64985d]

[aristarch:07957] [22] /lib64/libc.so.6(__libc_start_main+0xf4) [0x303381d994]

[aristarch:07957] [23] ../lmp_openmpi-20-s(_ZNSt8ios_base4InitD1Ev+0x51) [0x47d239]

[aristarch:07957] *** End of error message ***

I suspect it is somehow connected to an implicit (and not intended) gpu initialisation prior to that code part dut to memory allocations and or copies (but I am relatively sure not by a kernel). But still I wonder why it crashes at this stage and not earlier.

Cheers