CUDA kernels are inherently executed asynchronously. It isn’t until you perform a synchronization on a CUDA stream or with cudaDeviceSynchronize() that synchronization occurs.
If you are using numpy for your operations today, you may want to look into cupy, which is like the CUDA version of numpy.