random kernel execution failure with unknown error CUDA programming on Linux

Ladies & Gentlemen around here,

I have the following problem with using CUDA.

Recently, I’ve ported an SPH code to GPU; namely, parts which evaluates gravity and SPH forces. Gravity is currently computed as a particle-particle interaction, which has the complexity O(N^2), but SPH forces are done in the following way: first, nearest neighbours (ngb) are found by means of kD-tree, and then these ngb are used to compute forces.

kD-tree is build on the host, but the walk is done on the GPU. Once a list of ngb for each particle is build in the device memory, I used to run SPH loops, to compute density, forces and other quantities which are only influenced by the ngb. The sped up is great, up to 10x even w/o cache-optimised kD-tree.

Apart from all these great number, I have a huge problem. The code is unstable in the following sense: now-and-then CUDA throws something like this after kernel execution: “Cuda error: Kernel execution failure! xxx.cu line xx unkown error” (I call CUT_CHECk_ERROR(“Kernel execution failure!” after each kernel). The strange part is that these errors are random in nature, and are not reproducible. That is, if I take a system state on which kernel fails and run it again, error disappears and the system evolves further until another kernel failure occur, and so forth.

As a result the code is quite unstable, but with the possibility to restart it from last snapshot, allows me to evolve the system.

I would appreciate any help that could help me to localise & eliminate the error. Is it something wrong with my code, or this is CUDA driver error, etc. Incidentally, code runs great, w/o any breaks, in device emulation mode, though 10x slower.

I may give some partial listing of the code, if necessary. If so, please let me know which parts would be most suitable.

System config is: Debian 4, 2.6.21 C2D 3.0GHz x86_64, CUDA 1.1, no X running, 8800Ultra

thanks you all for help!

Cheers,
Evghenii

About the behavior you are seeing: welcome to the club: http://forums.nvidia.com/index.php?showtopic=59188

If you can create a minimal test case and post the code here, that would be great. See my test case in that forum post: I just call the same kernel with the same data 100,000 times and see how long it is before the error occurs.

Still, it might be a bit premature to assume you are seeing the same CUDA bug. Writing past the end of a memory array can cause this behavior too. You can check your application by running it through valgrind in emulation mode.

Also, how different is the neighbor list in SPH from Molecular Dynamics? You may be interested in our paper where we generate the neighbor list for MD on the GPU. It isn’t published yet, but you can get the preprint from the journal http://dx.doi.org/10.1016/j.jcp.2008.01.047 or our website: http://www.ameslab.gov/hoomd/.

Thanks for quick reply!

It’s not a trivial task to post a simple example code similar to one you’ve posted at this forum. The reason is that I have kd-tree built on host, and tree-walk is done on GPU; but, I’ll see what I can do.

However, the code seems to perform quite well, i.e. w/o crashes, in artificial situation. That is, if I use the last snapshot prior the crash and execute ~10k iteration of the GPU part (include kd-tree construction and tree walk + SPH & gravity forces + slight randomisation), no problems occur at all. Loop of 10k iterations successfully completes.

In a real simulation, however, the CUDA part of the code crashes onces every ~100 iterations.

[quote]

Moreover, one other PC with 8800GTX, code runs like a charm w/o any problems (Dual Xeon 3.4GHz HT enabled, Debian 4 2.6.18 i686, CUDA 1.0);

[\quote]

Update: Actually, code crashes there, but not with Cuda error: unkown error, but with SIGSEGV. Here is GDB output:

Advancing on the GPU: first half <<<<<

  sizeof(cuda_sph_body)= 96

    building kd-tree ...  done in 0.00573802 sec

    copying data to the device ...  done in 0.00105 sec

      solve_range ...

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread -1214511424 (LWP 12256)]

0xb7db107e in cuTexRefSetAddress () from /usr/local/cuda/lib/libcuda.so

(gdb) back

#0 0xb7db107e in cuTexRefSetAddress () from /usr/local/cuda/lib/libcuda.so

#1 0xb7d9f6bf in cuTexRefSetAddress () from /usr/local/cuda/lib/libcuda.so

#2 0xb7da60bf in cuTexRefSetAddress () from /usr/local/cuda/lib/libcuda.so

#3 0xb7d4de40 in cudaMemcpyToSymbol () from /usr/local/cuda/lib/libcudart.so

#4 0xb7d4a00f in cudaLaunch () from /usr/local/cuda/lib/libcudart.so

#5 0x0809f665 in host_solve_range ()

#6 0x08096eb9 in GPUsph_advance_first ()

#7 0x0804b023 in advance () at advance.f:55

#8 0x080609b1 in mainit () at main.f:155

#9 0x08060299 in main$main_$BLK () at main.f:63

It appears, that main fortran code does something bad, which result in crash of CUDA part. Will try to debug to figure out if this is true.

The perhaps explains, that when the code crashes with “unkown error”, no Xid message reported in dmesg, suich as Xid 13 if segfault occurs on GPU, (system on which the code crashes is C2D 3.0GHz, Debian 4 2.6.21 x86_64, 8800Ultra, CUDA 1.1, NVIDIA-x86_64-169.04 driver).

Evghenii

PS: There is no difference between neighbour lists in MD & SPH. That is, I need to find neighbours (ngb) of a particle within sphere of a given radius. I can also use grid-based search as you’ve described in your paper. The difference is that in MD the size of the computational box is roughly the same, I guess. In SPH, when applied to astrophysical problems, this is not true, and some particles can be far away from the system, not to mention large density contrast in various parts of the computational volume. In other words, size of the computational box grows with time, and grid-based ngb search routines become incredible slow (10x slower if not more, even allowing for multi-grid ngb search algorithms). The kd-tree approach, though inherently slower than grid-based on fixed comp.volume sizes, becomes way more superior when size of the box increases, and gas distribution becomes highly asymmetric and non-uniform in density.

Snipped due to updated post above :)

Further progress, perhaps will of help to solve the problem.

I run the code through valgrind. The output is presented below. Does anybody has some ideas on what is invalid read of size 8, and why this occurs in the middle of the simulation.

Could this problem be generated by the kernel running on GPU, even if there is no Xid report in dmesg?

The error does not appear in the emulation mode. Running the code in emulation mode generates zero problems, either via valgrind or not.

Thanks for help!

Evghenii

Advancing on the GPU: first half <<<<<
sizeof(cuda_sph_body)= 96
building kd-tree … done in 0.711116 sec
copying data to the device … done in 0.00889897 sec
solve_range … ==22462==
==22462== Invalid read of size 8
==22462== at 0x13FD7EC0: (within /usr/lib/libcuda.so.169.04)
==22462== by 0x13FE2EB6: (within /usr/lib/libcuda.so.169.04)
==22462== by 0x13FCB340: (within /usr/lib/libcuda.so.169.04)
==22462== by 0x13FC2250: cuLaunchGridAsync (in /usr/lib/libcuda.so.169.04)
==22462== by 0x1433FD86: (within /usr/local/cuda/lib/libcudart.so.1.1)
==22462== by 0x1432FAF8: cudaLaunch (in /usr/local/cuda/lib/libcudart.so.1.1)
==22462== by 0x45BB85: host_solve_range (in /home/egaburov/simulations/simtest/GPUsph_sph)
==22462== by 0x453096: GPUsph_solve_range(dev_data_struct&) (GPUsph_advance.cpp:606)
==22462== by 0x454C64: GPUsph_advance_first(int, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, int*, int*, int*) (GPUsph_advance.cpp:187)
==22462== by 0x4554B3: gpu_sph_advance_first_ (GPUsph_advance.cpp:648)
==22462== by 0x4048F0: advance_ (advance.f:55)
==22462== by 0x41C2B7: mainit_ (main.f:155)
==22462== Address 0x0 is not stack’d, malloc’d or (recently) free’d
==22462==
==22462== Invalid read of size 8
==22462== at 0x14D5F9AD: cxa_begin_catch (in /usr/lib/libstdc++.so.6.0.8)
==22462== by 0x1432FEB2: cudaLaunch (in /usr/local/cuda/lib/libcudart.so.1.1)
==22462== by 0x45BB85: host_solve_range (in /home/egaburov/simulations/simtest/GPUsph_sph)
==22462== by 0x453096: GPUsph_solve_range(dev_data_struct&) (GPUsph_advance.cpp:606)
==22462== by 0x454C64: GPUsph_advance_first(int, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, int*, int*, int*) (GPUsph_advance.cpp:187)
==22462== by 0x4554B3: gpu_sph_advance_first
(GPUsph_advance.cpp:648)
==22462== by 0x4048F0: advance
(advance.f:55)
==22462== by 0x41C2B7: mainit_ (main.f:155)
==22462== by 0x41BA30: MAIN__ (main.f:63)
==22462== by 0x4041A1: main (in /home/egaburov/simulations/simtest/GPUsph_sph)
==22462== Address 0x7FEFFE310 is not stack’d, malloc’d or (recently) free’d
==22462==
==22462== Invalid read of size 8
==22462== at 0x14D5FA3E: cxa_end_catch (in /usr/lib/libstdc++.so.6.0.8)
==22462== by 0x1432FEB7: cudaLaunch (in /usr/local/cuda/lib/libcudart.so.1.1)
==22462== by 0x45BB85: host_solve_range (in /home/egaburov/simulations/simtest/GPUsph_sph)
==22462== by 0x453096: GPUsph_solve_range(dev_data_struct&) (GPUsph_advance.cpp:606)
==22462== by 0x454C64: GPUsph_advance_first(int, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, int*, int*, int*) (GPUsph_advance.cpp:187)
==22462== by 0x4554B3: gpu_sph_advance_first
(GPUsph_advance.cpp:648)
==22462== by 0x4048F0: advance
(advance.f:55)
==22462== by 0x41C2B7: mainit_ (main.f:155)
==22462== by 0x41BA30: MAIN__ (main.f:63)
==22462== by 0x4041A1: main (in /home/egaburov/simulations/simtest/GPUsph_sph)
==22462== Address 0x7FEFFE310 is not stack’d, malloc’d or (recently) free’d
==22462==
==22462== Invalid read of size 8
==22462== at 0x14A9CA10: Unwind_DeleteException (in /lib/libgcc_s.so.1)
==22462== by 0x1432FEB7: cudaLaunch (in /usr/local/cuda/lib/libcudart.so.1.1)
==22462== by 0x45BB85: host_solve_range (in /home/egaburov/simulations/simtest/GPUsph_sph)
==22462== by 0x453096: GPUsph_solve_range(dev_data_struct&) (GPUsph_advance.cpp:606)
==22462== by 0x454C64: GPUsph_advance_first(int, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, int*, int*, int*) (GPUsph_advance.cpp:187)
==22462== by 0x4554B3: gpu_sph_advance_first
(GPUsph_advance.cpp:648)
==22462== by 0x4048F0: advance_ (advance.f:55)
==22462== by 0x41C2B7: mainit_ (main.f:155)
==22462== by 0x41BA30: MAIN__ (main.f:63)
==22462== by 0x4041A1: main (in /home/egaburov/simulations/simtest/GPUsph_sph)
==22462== Address 0x7FEFFE318 is not stack’d, malloc’d or (recently) free’d
done in 0.076726 sec
Cuda error: Kernel execution failed! in file ‘host_advance.cu’ in line 78 : unknown error.

It looks like there is some bug in cuLaunchGridAsync in CUDA 1.1 that you seem to trigger. I would contact a NVIDIA employee or someone with developer access to file a BUG.

Interesting progress on your incarnation of this bug. I don’t know that I ever tried to run my code through valgrind when not emulation mode, I will have to do so.

And making a minimal test case is hard, so I understand. The one I posted took quite a while to come up with. Some very simple changes to the kernel can make the problem disappear.

I can see how a grid based approach wouldn’t work in astrophysics. The sheer range of length scales would probably cause you to run out of memory creating the grid, so I can see why the k-d tree is preferable.

I rewrote integration part of the legacy fortran 77 code in C++, and the bug is gone. Everything runs stably for thousands of iterations (f77 version crashed systematically after every ~100 iterations).

I am not sure whether it is f77 issue (linking f77 with nvcc) or not. Funny, that everything works fine in the device-emulation mode. If I’ll solve the problem, will post the solution here.

evghenii

Dear All,

After playing with CUDA for a quite a while, and writing half a dozen of CUDA projects, I managed to get this bug few more times, and, hoera, I figured out what causes it.

The bug is indeed somewhere in CUDA library (ver <= 1.1), and is triggered when there are too many calls to cudaMalloc/cudaFree. In some of my CUDA applications, I allocate memory on device on need basis, since the amount of memory to be allocated is not know at the beginning, and it varies. Or if it is the library, it is allocated/freed at every call to this library; the time spent allocating/deallocating memory is however a negligible fraction of compute time.

After some time, this scenario triggers a SEGFAULT, which upon debugging brought me into libcudart.a. Once I make sure that the cudaMalloc/cudaFree is called only at the very beginning/end end of the application, these SEGFAULTs disappeared.

One of these days, hopefully, I’ll find time to install CUDA v2.0beta, and check if the bug is still there. The funny part is that application has to be run for a quite a while before SEGFAULT is triggered, which made the debugging very annoying.

TO NVIDIA developers: What kind of information I could provide you, to help to fix this?

Cheers,
Evghenii

If this reproduces with the CUDA_2.0-beta, then please provide a test app which reproduces the problem along with an nvidia-bug-report.log