[SOLVED] Overhead launching a kernel in Python. Do you all experience it?

I wrote a shared library that I am now testing on different programming languages/environments.
A friend is testing it in Python. We open the shared library, call simple functions that show compute capability and name, which respond immediately, but when it comes to the computation, a kernel seems to take anything between 4,5 to 5 seconds to complete even if the dataset has just 4 float elements.

I was reading here: https://github.com/numba/numba/issues/3003
It looks like people experience the overhead in varying degrees. Since Python is not really my thing, maybe you guys have experience mitigating it (if at all possible) or just conclude that it is what it is, want to hear your opinions. If you’ve done the same in Matlab, I’d like to hear about it too.

I’ve done a fair amount of fiddling with numba, pycuda, and also writing a shared library and interfacing with python ctypes

I’ve not witnessed anything like an arbitrary 4-5 second delay. The issue you linked from numba is complaining about a 200 microsecond overhead, not anything like 4-5 seconds.

The only thing that comes to mind is JIT compilation delay. Make sure that the library contains code for the relevant GPU architectures, to avoid JIT effects. The JIT effects can be quite bad if you are linking to a library such as CUBLAS, from an older CUDA version that does not have support for the GPU you are running on (e.g. CUDA 6, on a Pascal GPU, or CUDA 7 on a Volta GPU, or CUDA 8 on a Turing GPU). In that case the JIT delays can be quite long.

Thanks, Robert.
I am going to check my compilation if it is the case that there is no code for the target GPU.
The toolkit I used is CUDA 9.1 compiled with clang 4 on an Ubuntu 16.04, dynamically linking to glibc and statically to cuFFT and libstdc++.
Running on a RHEL 6.x (which required glibc 2.14 to be compiled separately as it comes natively with 2.12) and Python 3.5 if I am not mistaken. But I will investigate the code generation as you suggest, as I couldn’t really find anything online with this much overhead, so there is definitely a problem here.

It seems that we were using improper types and letting the interpreter try to guess/convert, so input and output float arrays were changed to:

n = int(4e1)
a = np.arange(n, dtype=np.float32)
b = np.ones(n, dtype=np.float32)
ac = a.ctypes.data_as(ctypes.POINTER(ctypes.c_float))
bc = b.ctypes.data_as(ctypes.POINTER(ctypes.c_float))

Then it worked fine.
I don’t have the wrong code anymore to compare, but it goes now.