What's the fastest way for the CPU to update a variable in the GPU memory when a kernel is running?

I tried to use pinned memory, but it is extremely slow, as the GPU kernel periodically poll that variable. I tried to use cudaMemcpyAsync with a different stream, but it seems the running kernel cannot see the update when it’s running.

I’ve successfully used the cudaMemcpyAsync with a different stream method, although it can be challenging if your device is in WDDM mode.