ceearem
December 21, 2010, 5:47pm
1
Hi
I got a strange crash, and just wanted to ask if someone has any hint what to look for as a cause. (I unfortunately dont have a small reproduction code.) It seems that it crashes in the cudalib upon calling cudaGetDeviceCount();. If I start my program with different parameters which are seemingly unrelated to the following code part it doesnt crash.
Here is the apparent problematic code snipped (removing the SAFE_CALL gives the same result). When this code part is called no GPU should be assigned to the process yet. But on the other hand a prior implicit assignment (which I might have overlooked) should have no implications for cudaGetDeviceCount should it?
printf("Debug A\n");
static int deviceCount=0;
static bool sharedmode=false;
if(deviceCount && !sharedmode) return;
if(deviceCount && sharedmode) {printf("ERROR\n");cudaThreadExit();}
printf("Debug B %i\n",deviceCount);
CUDA_SAFE_CALL_NO_SYNC( cudaGetDeviceCount(&deviceCount) );
printf("Debug B1A %i\n",deviceCount);
Output:
Debug A
Debug B 0
[aristarch:17278] *** Process received signal ***
[aristarch:17278] Signal: Segmentation fault (11)
[aristarch:17278] Signal code: Address not mapped (1)
[aristarch:17278] Failing at address: 0x8
[aristarch:17278] [ 0] /lib64/libpthread.so.0 [0x303440eb10]
[aristarch:17278] [ 1] /lib64/libc.so.6 [0x3033872bcc]
[aristarch:17278] [ 2] /lib64/libc.so.6(__libc_malloc+0x6e) [0x3033874cde]
[aristarch:17278] [ 3] /lib64/libc.so.6 [0x303386170a]
[aristarch:17278] [ 4] /usr/lib64/libcuda.so.1 [0x2b717d053eef]
[aristarch:17278] [ 5] /usr/lib64/libcuda.so.1 [0x2b717d0571c3]
[aristarch:17278] [ 6] /usr/lib64/libcuda.so.1 [0x2b717d057925]
[aristarch:17278] [ 7] /usr/lib64/libcuda.so.1 [0x2b717d00f229]
[aristarch:17278] [ 8] /usr/lib64/libcuda.so.1 [0x2b717d00d0e7]
[aristarch:17278] [ 9] /usr/lib64/libcuda.so.1 [0x2b717d051ae5]
[aristarch:17278] [10] /usr/lib64/libcuda.so.1 [0x2b717cfb8b8f]
[aristarch:17278] [11] /usr/lib64/libcuda.so.1(cuInit+0x4c) [0x2b717d06dcfc]
[aristarch:17278] [12] /usr/local/cuda/lib64/libcudart.so.3 [0x2b717d98c1f5]
[aristarch:17278] [13] /usr/local/cuda/lib64/libcudart.so.3 [0x2b717d98d59c]
[aristarch:17278] [14] /usr/local/cuda/lib64/libcudart.so.3(cudaGetDeviceCount+0x4b) [0x2b717d9ad3eb]
[aristarch:17278] [15] ../lmp_openmpi-20-s(CudaWrapper_Init+0x6b) [0x78fb7b]
[aristarch:17278] [16] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS4Cuda9setDeviceEPNS_6LAMMPSE+0x392) [0x50df22]
[aristarch:17278] [17] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS6LAMMPS4initEv+0xa) [0x645c2a]
[aristarch:17278] [18] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS3Run7commandEiPPc+0x9ab) [0x73551b]
[aristarch:17278] [19] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input15execute_commandEv+0x29da) [0x63afda]
[aristarch:17278] [20] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input4fileEv+0x2e8) [0x637f08]
[aristarch:17278] [21] ../lmp_openmpi-20-s(main+0xad) [0x64982d]
[aristarch:17278] [22] /lib64/libc.so.6(__libc_start_main+0xf4) [0x303381d994]
[aristarch:17278] [23] ../lmp_openmpi-20-s(_ZNSt8ios_base4InitD1Ev+0x51) [0x47d239]
Has anyone got an idea?
ceearem
December 21, 2010, 9:54pm
4
Actually it also crashes here if I put it in front of the code I have shown so far:
printf("Debug A\n");
cudaSetDevice(0);
printf("Debug A1\n");
Output:
Debug A
[aristarch:18869] *** Process received signal ***
[aristarch:18869] Signal: Segmentation fault (11)
[aristarch:18869] Signal code: Address not mapped (1)
[aristarch:18869] Failing at address: 0x8
[aristarch:18869] [ 0] /lib64/libpthread.so.0 [0x303440eb10]
[aristarch:18869] [ 1] /lib64/libc.so.6 [0x3033872bcc]
[aristarch:18869] [ 2] /lib64/libc.so.6(__libc_malloc+0x6e) [0x3033874cde]
[aristarch:18869] [ 3] /lib64/libc.so.6 [0x303386170a]
[aristarch:18869] [ 4] /usr/lib64/libcuda.so.1 [0x2ab63d7bfeef]
[aristarch:18869] [ 5] /usr/lib64/libcuda.so.1 [0x2ab63d7c31c3]
[aristarch:18869] [ 6] /usr/lib64/libcuda.so.1 [0x2ab63d7c3925]
[aristarch:18869] [ 7] /usr/lib64/libcuda.so.1 [0x2ab63d77b229]
[aristarch:18869] [ 8] /usr/lib64/libcuda.so.1 [0x2ab63d7790e7]
[aristarch:18869] [ 9] /usr/lib64/libcuda.so.1 [0x2ab63d7bdae5]
[aristarch:18869] [10] /usr/lib64/libcuda.so.1 [0x2ab63d724b8f]
[aristarch:18869] [11] /usr/lib64/libcuda.so.1(cuInit+0x4c) [0x2ab63d7d9cfc]
[aristarch:18869] [12] /usr/local/cuda/lib64/libcudart.so.3 [0x2ab63e0f81f5]
[aristarch:18869] [13] /usr/local/cuda/lib64/libcudart.so.3 [0x2ab63e0f959c]
[aristarch:18869] [14] /usr/local/cuda/lib64/libcudart.so.3(cudaSetDevice+0x4b) [0x2ab63e11878b]
[aristarch:18869] [15] ../lmp_openmpi-20-s(CudaWrapper_Init+0x2f) [0x78fb3f]
[aristarch:18869] [16] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS4Cuda9setDeviceEPNS_6LAMMPSE+0x392) [0x50df22]
[aristarch:18869] [17] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS6LAMMPS4initEv+0xa) [0x645c2a]
[aristarch:18869] [18] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS3Run7commandEiPPc+0x9ab) [0x73551b]
[aristarch:18869] [19] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input15execute_commandEv+0x29da) [0x63afda]
[aristarch:18869] [20] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input4fileEv+0x2e8) [0x637f08]
[aristarch:18869] [21] ../lmp_openmpi-20-s(main+0xad) [0x64982d]
[aristarch:18869] [22] /lib64/libc.so.6(__libc_start_main+0xf4) [0x303381d994]
[aristarch:18869] [23] ../lmp_openmpi-20-s(_ZNSt8ios_base4InitD1Ev+0x51) [0x47d239]
[aristarch:18869] *** End of error message ***
printf(“Debug A\n”);
cudaThreadExit();
cudaSetDevice(0);
printf(“Debug A1\n”);
Does this crash?
ceearem
December 23, 2010, 1:48pm
6
It does:
Debug A
[aristarch:07957] *** Process received signal ***
[aristarch:07957] Signal: Segmentation fault (11)
[aristarch:07957] Signal code: Address not mapped (1)
[aristarch:07957] Failing at address: 0x8
[aristarch:07957] [ 0] /lib64/libpthread.so.0 [0x303440eb10]
[aristarch:07957] [ 1] /lib64/libc.so.6 [0x3033872bcc]
[aristarch:07957] [ 2] /lib64/libc.so.6(__libc_malloc+0x6e) [0x3033874cde]
[aristarch:07957] [ 3] /lib64/libc.so.6 [0x303386170a]
[aristarch:07957] [ 4] /usr/lib64/libcuda.so.1 [0x2b9a0126ceef]
[aristarch:07957] [ 5] /usr/lib64/libcuda.so.1 [0x2b9a012701c3]
[aristarch:07957] [ 6] /usr/lib64/libcuda.so.1 [0x2b9a01270925]
[aristarch:07957] [ 7] /usr/lib64/libcuda.so.1 [0x2b9a01228229]
[aristarch:07957] [ 8] /usr/lib64/libcuda.so.1 [0x2b9a012260e7]
[aristarch:07957] [ 9] /usr/lib64/libcuda.so.1 [0x2b9a0126aae5]
[aristarch:07957] [10] /usr/lib64/libcuda.so.1 [0x2b9a011d1b8f]
[aristarch:07957] [11] /usr/lib64/libcuda.so.1(cuInit+0x4c) [0x2b9a01286cfc]
[aristarch:07957] [12] /usr/local/cuda/lib64/libcudart.so.3 [0x2b9a01ba51f5]
[aristarch:07957] [13] /usr/local/cuda/lib64/libcudart.so.3 [0x2b9a01ba659c]
[aristarch:07957] [14] /usr/local/cuda/lib64/libcudart.so.3(cudaThreadExit+0x3e) [0x2b9a01bc878e]
[aristarch:07957] [15] ../lmp_openmpi-20-s(CudaWrapper_Init+0x2d) [0x78fc1d]
[aristarch:07957] [16] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS4Cuda9setDeviceEPNS_6LAMMPSE+0x392) [0x50df22]
[aristarch:07957] [17] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS6LAMMPS4initEv+0xa) [0x645c5a]
[aristarch:07957] [18] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS3Run7commandEiPPc+0x9ab) [0x7355fb]
[aristarch:07957] [19] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input15execute_commandEv+0x29fc) [0x63affc]
[aristarch:07957] [20] ../lmp_openmpi-20-s(_ZN9LAMMPS_NS5Input4fileEv+0x2e8) [0x637f08]
[aristarch:07957] [21] ../lmp_openmpi-20-s(main+0xad) [0x64985d]
[aristarch:07957] [22] /lib64/libc.so.6(__libc_start_main+0xf4) [0x303381d994]
[aristarch:07957] [23] ../lmp_openmpi-20-s(_ZNSt8ios_base4InitD1Ev+0x51) [0x47d239]
[aristarch:07957] *** End of error message ***
I suspect it is somehow connected to an implicit (and not intended) gpu initialisation prior to that code part dut to memory allocations and or copies (but I am relatively sure not by a kernel). But still I wonder why it crashes at this stage and not earlier.
Cheers