Accessing cudaLaunchCooperativeKernel api from python (pycuda, cupy, etc..?)

To date I’ve run most of my Cuda kernels from Pycuda, but now have need to run coopoerative groups to sync grids, which requires to use the cudaLaunchCooperativeKernel api.

Unfortunately PyCuda would seem to be unable, and not have a future in this regard.

Is there another route I can take on this, and keep my host code in python?

Kernels which are executed in the same cuda stream are serialized.
You could therefore split your kernel into two kernels and launch them one after another.

i.e. instead of

__global__ void kernel(){
   //part A
   gridsync();
   //part B
}
...
cudaLaunchCooperativeKernel(kernel)

you can do

__global__ void kernelA(){
   //part A
}

__global__ void kernelB(){
   //part B
}
...
cudaLaunchKernel(kernelA)
cudaLaunchKernel(kernelB)

You can do pretty much any combination of python and CUDA using python ctypes. There are various examples, including showing kernel launches, although probably not any that show a cooperative kernel launch.

Thank you Both!