Assuming the kernel in question does not use the memory address being copied from, am I able to run a kernel while also copy data from the device- to host- memory, and vice versa? I mean will there be any major performance penalty? Any other considerations (other than ensuring not copying data being worked on by kernels)?
Yes you can, basically. What you’re looking for is are called async functions, such as cudaMemcpyAsync, and you want to use CUDA Streams. The CUDA streams are a way to help synchronise ansynchronous stuff. Hope that gets you going in the right direction.
Cheers. :)