Potential Bug, cuda-memcheck can someone verify? Program crashing on GPU initialisation with cuda-me


when I wanted to debug a program of mine with cuda-memcheck it kept crashing right at the GPU initialisation. After strippting everything down I ended up with the absolute minimum cuda program that is possible (except for an empty main):

#include <stdio.h>

#include "cuda_runtime.h"

int main()






Its CUDA 4.0 with 275.09.07 drivers on Scientific Linux 6.0:

/usr/wrk/people9/chmu-tph/bug-check/cudamemcheck-init :> nvcc -g cuda-init-simple.cu

/usr/wrk/people9/chmu-tph/bug-check/cudamemcheck-init :> nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2011 NVIDIA Corporation

Built on Thu_May_12_11:09:45_PDT_2011

Cuda compilation tools, release 4.0, V0.2.1221

/usr/wrk/people9/chmu-tph/bug-check/cudamemcheck-init :> nvidia-smi -a | grep Driver

Driver Version                  : 275.09.07

    Driver Model

    Driver Model

    Driver Model

/usr/wrk/people9/chmu-tph/bug-check/cudamemcheck-init :> nvidia-smi -s

COMPUTE mode rules for GPU 0: 1

COMPUTE mode rules for GPU 1: 1

COMPUTE mode rules for GPU 2: 2

/usr/wrk/people9/chmu-tph/bug-check/cudamemcheck-init :> cuda-memcheck ./a.out



========= Error: process didn't terminate successfully

========= ERROR SUMMARY: 0 errors

/usr/wrk/people9/chmu-tph/bug-check/cudamemcheck-init :> ./a.out



/usr/wrk/people9/chmu-tph/bug-check/cudamemcheck-init :> g++ --version

g++ (GCC) 4.4.4 20100726 (Red Hat 4.4.4-13)

Copyright (C) 2010 Free Software Foundation, Inc.

Dies ist freie Software; die Kopierbedingungen stehen in den Quellen. Es

gibt KEINE Garantie; auch nicht für MARKTGÄNGIGKEIT oder FÜR SPEZIELLE ZWECKE.

/usr/wrk/people9/chmu-tph/bug-check/cudamemcheck-init :> cat /etc/issue

Scientific Linux release 6.0 (Carbon)

Can someone reproduce that behaviour?



Add [font=“Courier New”]return 0;[/font] as the last statement in main() - otherwise it’s indeed correct that the process did not terminate successfully.
I’m surprised that you compiler does not complain about main being declared to return an int, yet having no return statement.


If you take a look at the output, the run using cuda-memcheck does not print out “Done” as it does without cuda-memcheck. So I think it really crashes on the cudaThreadSynchronize(). Indeed in the larger program I get a segfault message which is traced back too the threadSynchronize in the cuda-runtime and then to some cuda library call.

Btw. Adding “return 0” doesnt change anything, its all the same output.



One more addition: the problem does not seem to occur on another machine of mine with 2 GTX295 and a 8400GS running CentOS-5.5, cuda3.2, and gcc 4.1.2-48
It does occur on the other machine (or rather machines) using cuda3.2 and cuda4.0 either with ScientificLinux-6.0 (gcc 4.4.4-13) or CentOS-5.6 (gcc 4.1.2-50). Those machines have two 470GTX and a GT220. It doesnt matter whether I use compute exclusive mode or not. A cudaSetDevice(x) in the code does not change the behaviour.
All are using 275.09.07 driver.


Btw, here is the callback trace of the full program:

/usr/wrk/people9/chmu-tph/workspace/LAMMPS_CUDA/USER-CUDA/Examples :> mpirun -np 1 cuda-memcheck …/…/…/LAMMPS-Jul2011/src/lmp_openmpi-20-d-g -sf cuda < in.eam.cuda
LAMMPS (19 Jul 2011)


USER-CUDA mode is enabled

CUDA: Activate GPU

[perseus09:19943] *** Process received signal ***
[perseus09:19943] Signal: Segmentation fault (11)
[perseus09:19943] Signal code: Address not mapped (1)
[perseus09:19943] Failing at address: 0x170
[perseus09:19943] [ 0] /lib64/libpthread.so.0() [0x38c6a0f4c0]
[perseus09:19943] [ 1] /usr/lib64/libcuda.so.1(+0xfe287) [0x7f58917e2287]
[perseus09:19943] [ 2] /usr/lib64/libcuda.so.1(+0xf66ce) [0x7f58917da6ce]
[perseus09:19943] [ 3] /usr/lib64/libcuda.so.1(+0x16b0a8) [0x7f589184f0a8]
[perseus09:19943] [ 4] /usr/lib64/libcuda.so.1(+0x1602ee) [0x7f58918442ee]
[perseus09:19943] [ 5] /usr/lib64/libcuda.so.1(+0x160219) [0x7f5891844219]
[perseus09:19943] [ 6] /usr/lib64/libcuda.so.1(+0x17f6dc) [0x7f58918636dc]
[perseus09:19943] [ 7] /usr/lib64/libcuda.so.1(+0x176021) [0x7f589185a021]
[perseus09:19943] [ 8] /usr/lib64/libcuda.so.1(+0xccd5b) [0x7f58917b0d5b]
[perseus09:19943] [ 9] /usr/lib64/libcuda.so.1(+0x17cd51) [0x7f5891860d51]
[perseus09:19943] [10] /usr/local/cuda/lib64/libcudart.so.4(+0x206a6) [0x7f58914b36a6]
[perseus09:19943] [11] /usr/local/cuda/lib64/libcudart.so.4(+0x20bfd) [0x7f58914b3bfd]
[perseus09:19943] [12] /usr/local/cuda/lib64/libcudart.so.4(+0x215e4) [0x7f58914b45e4]
[perseus09:19943] [13] /usr/local/cuda/lib64/libcudart.so.4(+0x16826) [0x7f58914a9826]
[perseus09:19943] [14] /usr/local/cuda/lib64/libcudart.so.4(+0x926a) [0x7f589149c26a]
[perseus09:19943] [15] /usr/local/cuda/lib64/libcudart.so.4(cudaThreadSynchronize+0x137) [0x7f58914cd367]
[perseus09:19943] [16] …/…/…/LAMMPS-Jul2011/src/lmp_openmpi-20-d-g(CudaWrapper_Init+0x265) [0x7e85c5]
[perseus09:19943] [17] …/…/…/LAMMPS-Jul2011/src/lmp_openmpi-20-d-g(_ZN9LAMMPS_NS4Cuda11acceleratorEiPPc+0x3ee) [0x51952e]
[perseus09:19943] [18] …/…/…/LAMMPS-Jul2011/src/lmp_openmpi-20-d-g(_ZN9LAMMPS_NS5Input15execute_commandEv+0x18ef) [0x63f67f]
[perseus09:19943] [19] …/…/…/LAMMPS-Jul2011/src/lmp_openmpi-20-d-g(_ZN9LAMMPS_NS5Input4fileEv+0x2e8) [0x63d698]
[perseus09:19943] [20] …/…/…/LAMMPS-Jul2011/src/lmp_openmpi-20-d-g(main+0xad) [0x64f4ad]
[perseus09:19943] [21] /lib64/libc.so.6(__libc_start_main+0xfd) [0x38c621ec5d]
[perseus09:19943] [22] …/…/…/LAMMPS-Jul2011/src/lmp_openmpi-20-d-g(_ZNSt8ios_base4InitD1Ev+0x49) [0x47f2b9]
[perseus09:19943] *** End of error message ***
========= Error: process didn’t terminate successfully
========= ERROR SUMMARY: 0 errors

Two questions:


What happens if you execute cudaFree(0) before the cudaThreadExit call?

What happens in CUDA 4.0 if you replace the cudaThreadExit call with cudaDeviceExit?

  1. There is no cudaThreadExit call in the code, only cudaThreadSynchronize (in order to initialise the GPU at a defined point in the code).

  2. same as 1



The perils of tablet autocompletion. I didn’t notice the changes it was making while I was replying. I meant cudaDeviceSynchronize and cudaThreadSynchronize, but otherwise the questions still stand.

To expand a little - (1) tests whether doing something that explicitly establishes a context before trying to perform a synchronization call makes any difference, and (2) tests whether the new API calls behave differently to the now deprecated ones. For what it is worth, I cannot reproduce the crash in cuda-memchk running on 64 bit linux (Ubuntu 10.04LTS) with 4.0rc2 or 4.0 final.

Using cudaFree or cudaDeviceSynch didnt change anything, but upgrading to 275.21 drivers did.
So it probably was a bug in the driver …


Hi ceearem. I am having trouble reproducing the issue you see as well. Does your issue recur with the 270.41 driver? Also could you include more information about the system where this issue occurs : how much system memory is present, and how much video memory on each card ?

The error does not occur under 270.41 drivers (I just tried it out, first going to 270.41 -> no error, then 275.09.07 -> error, then 275.21 -> no error).

System specs (ScientificLinux 6.0):

/usr/wrk/people9/chmu-tph/bug-check/cudamemcheck-init :> gcc --version

gcc (GCC) 4.4.4 20100726 (Red Hat 4.4.4-13)

/usr/wrk/people9/chmu-tph/bug-check/cudamemcheck-init :> nvcc --version

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2011 NVIDIA Corporation

Built on Thu_May_12_11:09:45_PDT_2011

Cuda compilation tools, release 4.0, V0.2.1221

/usr/wrk/people9/chmu-tph/bug-check/cudamemcheck-init :> uname -a

Linux perseus09.physik.tu-ilmenau.de 2.6.32-131.2.1.el6.x86_64 #1 SMP Thu Jun 2 09:49:26 CDT 2011 x86_64 x86_64 x86_64 GNU/Linux

/usr/wrk/people9/chmu-tph/bug-check/cudamemcheck-init :>


GPU 0:3:0

Product Name                : GeForce GTX 470

Device Id               : 6CD10DE

Total                   : 1279 MB

GPU 0:4:0

Product Name                : GeForce GTX 470

Device Id               : 6CD10DE

Total                   : 1279 MB

GPU 0:5:0

Product Name                : GeForce GT 220

Device Id               : A2010DE

Total                   : 511 MB

Compute Mode doesnt matter (checked 0 0 0, 1 1 2). The code crashes on all three GPUs.

Its not the cudaThreadSynchronize btw, putting a single cudaMalloc in the code gives the same result (all runs fine without cuda-memcheck, but crashes with cuda-memcheck [which does not report any error]).

Adding an explicit cudaSetDevice does not change the behaviour.

If you got any more questions let me know.


@tera In C99 and C++, if control reaches the end of main without returning an integer, 0 is implicitly returned, and it is not Undefined Behavior (for C, see section of http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf; for C++ see http://eel.is/c++draft/basic.start.main#5)