I think again that this is a memory issue. I think somehow the compiler/whatever isn’t catching that I’m trying to access an un-allocated address. Indeed, if I try to access the Nth, (N+1)th, etc., element of the length-N array (living in device memory) that I pass to the kernel, no error is thrown. My guess is that the positioning of these arrays in device memory is occasionally close enough to “the edge” to throw errors. By increasing N by a large factor, the rate of NVPROF failing becomes almost 100%.
I have no idea how this is happening. I have wrapped the contents of my kernel in a conditional to ensure that all accessed array indices are less than N. It makes no difference. I don’t know how I can 1) be able to access unallocated memory, 2) not have any threads corresponding to these illegal indices, but 3) still have NVPROF crash randomly.
For context, I am doing a 3D integration of partial differential equations, and the change to the code which has induced this behavior was increasing the dimension of the thread block to cover all elements needed by spatial derivatives (stencils) while only updating the threads inside this “halo.” I believe this is the standard/optimal practice (sure beats manually loading the halo elements, as I was before), and yes I am certain that my blocks have the correct dimension, but the dimension of my grid is set as if the blocks were not including the halo. I have additionally checked whether blockIdx.x * (dimension of the inner/enclosed block) is equal to the dimension of the full lattice.
Any insight or suggestions would be greatly appreciated.