You know the old saying: “MCVE or it didn’t happen!” :-)
Seriously, it is totally unclear (to me, anyhow) under what circumstances this occurs, and I can’t recall having seen this before. It is possible that there are code generation errors in the backend that occasionally construct illegal machine instruction encodings (perhaps due to an uninitialized variable, or access out of bound error inside PTXAS), but without a repro case it seems impossible to diagnose whether that is the case or not.
I assume that by “trouble even finding where it’s failing” you mean that you have trouble finding the specific instruction inside a particular kernel that triggers the exception, rather than identifying which kernel is affected?
If you run a release build under the CUDA debugger, wouldn’t it be able to at least pinpoint the general region of kernel code where this illegal instruction occurs?
Have you tried disassembling the SASS for the kernel in question? If the root cause is a bad instruction encoding produced by the compiler backend, one would expect either the disassembler to complain about it, or the assembly language output to contain a noticeably “weird” instruction of some kind.
How reproducible is this issue? Is there a possibility that it is a transient error cause by a flipped bit in some memory location? Also, is the app maybe using the driver interface to download code to the GPU manually, and is it possible it could be loading a corrupted binary image?
I can run into these in sass programming if I set an illegal stall count (like try to dual issue an instruction that can’t be) or if the generated op code is bad. I can’t recall how these behave under cuda-memcheck.
You could try stepping through the code in the sass debugger.
Or if you have code that reliably generates this bug I’d just submit it to nvidia. It’s probably an issue with ptxas.
It’s good to read that you were able to track down the root cause of this, but I am a bit puzzled how an errant SMEM pointer triggers an “illegal instruction” exception.
The kernel was mixing a PTX “named barrier” [ bar.sync(barrier,threads) ] with a regular __syncthreads() and it looks like barrier 0 was used in both cases which is not what I would ever have wanted.
Misuse of bar.sync() that results in bad arrival counts is probably a hard error under certain conditions.
The kernels are generated with a program and only this particular configuration was emitting the experimental bar.sync() PTX.
I hadn’t run this code in a long long time and totally missed this code was being emitted.
Replacing the experimental bar.sync(b,t) ops with __syncthreads() fixed the problem (for real this time).