Diagnosing launch failures due to resource constraints

Is there a way to ask CUDA why a launch failed due to being out of resources?

I know that often register use is the culprit, but in my case the kernel uses 63 registers, and it’s still failing on a compute capability 2.1 device with a block size of 32,1,1. I’ve used the occupancy calculator, and everything seems okay, so either there’s a corner case not covered, or I’m using it wrong.

So, I’d like a way to ask the system what’s going on. Is there a way to do that?