NVPROF randomly fails (no kernels/API calls, code 139) on *the exact same code*

I’m running nvprof on my code, and ~70% of the time it will quit, saying no kernels or API calls, “application received signal 139.” And sometimes it will work. I am not oversubscribing the GPU’s memory, nor the shared memory nor registers. I’m literally compiling and running nvprof repeatedly without changing anything.

I can’t replicate this behavior in cuda-memcheck; it runs without a problem every time. And when I run my code sans memcheck or nvprof, it also runs fine.

I thought the issue was related to transfers from global (device) memory, but now I have no way of knowing because nvprof is not behaving deterministically. I was trying the good-ol’ comment out all but one line method.

What on earth could cause this to happen?

I think again that this is a memory issue. I think somehow the compiler/whatever isn’t catching that I’m trying to access an un-allocated address. Indeed, if I try to access the Nth, (N+1)th, etc., element of the length-N array (living in device memory) that I pass to the kernel, no error is thrown. My guess is that the positioning of these arrays in device memory is occasionally close enough to “the edge” to throw errors. By increasing N by a large factor, the rate of NVPROF failing becomes almost 100%.

I have no idea how this is happening. I have wrapped the contents of my kernel in a conditional to ensure that all accessed array indices are less than N. It makes no difference. I don’t know how I can 1) be able to access unallocated memory, 2) not have any threads corresponding to these illegal indices, but 3) still have NVPROF crash randomly.

For context, I am doing a 3D integration of partial differential equations, and the change to the code which has induced this behavior was increasing the dimension of the thread block to cover all elements needed by spatial derivatives (stencils) while only updating the threads inside this “halo.” I believe this is the standard/optimal practice (sure beats manually loading the halo elements, as I was before), and yes I am certain that my blocks have the correct dimension, but the dimension of my grid is set as if the blocks were not including the halo. I have additionally checked whether blockIdx.x * (dimension of the inner/enclosed block) is equal to the dimension of the full lattice.

Any insight or suggestions would be greatly appreciated.

cuda-memcheck can help with illegal/out-of-bounds accesses, for both global and shared mem

Like I said, memcheck runs perfectly every time.

If you access the n+1th element in a length n array, cuda-memcheck will catch that. If cuda-memcheck throws no errors, it’s likely that your code is not making any out-of-bounds accesses.

if you google “application received signal 139” you may get some things worth reading.

Thanks, but I’ve read every single result on google, with nothing helping. For all I’ve seen, nothing would make sense other than this being a bug with nvprof, but I’m not expert, of course.

perhaps you should file a bug at developer.nvidia.com

however, that is unlikely to go anywhere unless you can provide a reproducible test case