Returning to host code in pyCUDA after asynchronous kernel launch [pyCUDA]

I am trying to launch a kernel in pyCUDA and then terminate the kernel by writing to a GPU global memory location. Here is a simple example kernel that I would like to be able to terminate at some point after it enters the infinite while loop:

__global__ void countUp(u16 *inShot, u64 *counter) {
  while(inShot[0]) {
    counter[0]++;
  }
}

From what I have read about streams in CUDA, I should be able to launch this kernel after creating a stream and it will be non-blocking on the host, ie. I should be able to do stuff on the host after this kernel is launched and is running. I compile the above kernel to a cubin file and launch it in pyCUDA like so:

import numpy as np
from pycuda import driver, compiler, gpuarray, tools
# -- initialize the device
import pycuda.autoinit

strm1 = driver.Stream()

h_inShot = np.zeros((1,1))
d_inShot = gpuarray.to_gpu_async(h_inShot.astype(np.uint16), stream = strm1)
h_inShot = np.ones((1,1))
h_counter = np.zeros((1,1))
d_counter = gpuarray.to_gpu_async(h_counter.astype(np.uint64), stream = strm1)

testCubin = "testKernel.cubin"
mod = driver.module_from_file(testCubin)
countUp = mod.get_function("countUp")

countUp(d_inShot, d_counter,
        grid = (1, 1, 1),
        block = (1, 1, 1),
        stream = strm1
        )

Running this script causes the kernel to enter an infinite while loop for obvious reasons. Launching this script from the ipython environment does not seem to return control to the host after the kernel launch (I can’t input new commands as I guess its waiting for the kernel to finish). I would like control to return to the host so that I can change the value in GPU global memory pointer d_inShot and have the kernel exit the while loop. Is this even possible and if so, how do I do it in pyCUDA? Thanks.

do you have additional device calls after the kernel launch - a supposed asynchronous memory copy from device to host, that is not acting asynchronously perhaps?

“I can’t input new commands as I guess its waiting for the kernel to finish”

can’t you input new commands, or does the input have no effect?
there is a difference

seems the answer is discovered already:

[url]python - Returning to host code in pyCUDA after asynchronous kernel launch - Stack Overflow