Using cudaOccupancyMaxActiveBlocksPerMultiprocessor with function acquired with cuModuleGetFunction

I am trying to use the cudaOccupancyMaxActiveBlocksPerMultiprocessor() with a function/kernel pointer i got using the cuModuleGetFunction(), but am getting INVALID device function.

I have reproduced the same error/behaviour using the vectorAddDrv.

I added the following line (plus an include to cuda_runtime_api.h):

assert(cudaOccupancyMaxActiveBlocksPerMultiprocessor ( &blocks, vecAdd_kernel, 256, 0 ) == cudaSuccess);

Right after the existing cuModuleGetFunction call (line 114-115 in my sources):

checkCudaErrors(cuModuleGetFunction(&vecAdd_kernel, cuModule, "VecAdd_kernel"));

Running with cuda-gdb i get the following line:

warning: Cuda API error detected: cudaOccupancyMaxActiveBlocksPerMultiprocessor returned (0x62)

The code corresponds to cudaErrorInvalidDeviceFunction, which is strange as the kernel/function is valid and runnable.

As the documentation does not mention any corner cases, i assumed i could use cudaOccupancyMaxActiveBlocksPerMultiprocessor with a function/kernel pointer returned from cuModuleGetFunction, is this not the case?

(I am using CUDA 10.1)

Thanks in advance

why not use the driver API function for this? vectorAddDrv is a driver API code, runtime API functions start with cuda… driver API functions start with cu… (but not cuda, of course)

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__OCCUPANCY.html#group__CUDA__OCCUPANCY