Obviously you would want to limit the use of printf() to maybe just one or at most a few threads, e.g. just have thread index 0 print. It is also possible to increase the size of the buffer used by device-side printf().
In addition, it is probably a good idea to reduce the problem size, and thus presumably the grid configuration. Likewise, reduce the iteration count. You might also be able to able to selectively disable pieces of code with #ifdef and narrow down the location of the problem by bisecting the code in this manner. Or change the random numbers used in your code to well-known fixed numbers during debugging.
I would focus on the out-of-bounds access for now, as you already know that this problem exists in the code. Out-of-bounds accesses that are not of the off-by-1 type are often the result of operating on huge array indices, caused by mixing signed and unsigned data in index computations, for example.
Not sure how you got to this state of affairs. It may boil down to bugs in your code, bugs in the compiler, bugs in cuda-memcheck, or a combination thereof. Given that the CUDA development tools are quite robust these days, my initial hypothesis would be the source of the trouble are bugs in the code itself.
My recommended software development strategy (not just for CUDA, but in general), is to develop test scaffolding concurrently or ahead of the code, start small and build the code base incrementally with continuous, automated test coverage. That way any bugs can often be limited to the last code increment, and one avoids the issue of having to track down a bug “de novo” in a code based of several thousand lines or more. Even if the worst case occurs, it is possible to find the bug(s) with a systematic approach using classical debugging techniques, it may just be pretty painful. Been there, done that, got the t-shirt :-)