cuFuncSetBlockShape failure

cuFuncSetBlockShape(kernel, 32, 16, 1) fails with invalid value, and I can’t understand why.

I’m running a Quadro 2000:

Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64

I seem to be well under the thread per block limit, and am not violating the dimension limit.
What am I missing here?

One possibility is that you’re getting an error from a previous (asynchronous) cuda function call – I’d advise adding error checking code on every cuda call to see if the error is upstream, but only getting detected on this one.

Also, I can’t help but notice that cuFuncSetBlockShape is deprecated (and that was from the driver API wasn’t it?). Are you limited in your alternatives? IIRC The Runtime API is now fully interoperable with the Driver API, and is much more convenient to use.

Sure, the error check was my first thought, but no it is indeed this call that causing the error. The whole app is already a mix of driver or runtime APIs, but the class that contains this call generally works, and I don’t want to rewrite it because of one strange error. Anyway I’d like to ensure I’m not misunderstanding (yet again) some esoteric aspect of CUDA, even on a deprecation.