I am trying to use the cudaOccupancyMaxActiveBlocksPerMultiprocessor() with a function/kernel pointer i got using the cuModuleGetFunction(), but am getting INVALID device function.
I have reproduced the same error/behaviour using the vectorAddDrv.
I added the following line (plus an include to cuda_runtime_api.h):
assert(cudaOccupancyMaxActiveBlocksPerMultiprocessor ( &blocks, vecAdd_kernel, 256, 0 ) == cudaSuccess);
Right after the existing cuModuleGetFunction call (line 114-115 in my sources):
checkCudaErrors(cuModuleGetFunction(&vecAdd_kernel, cuModule, "VecAdd_kernel"));
Running with cuda-gdb i get the following line:
warning: Cuda API error detected: cudaOccupancyMaxActiveBlocksPerMultiprocessor returned (0x62)
The code corresponds to cudaErrorInvalidDeviceFunction, which is strange as the kernel/function is valid and runnable.
As the documentation does not mention any corner cases, i assumed i could use cudaOccupancyMaxActiveBlocksPerMultiprocessor with a function/kernel pointer returned from cuModuleGetFunction, is this not the case?
(I am using CUDA 10.1)
Thanks in advance