Hey, I had a random question that I am pretty sure the answer to but haven’t found EXACTLY the definitive answer. My question is this:
Is it possible for me to have the CPU muck around with global memory WHILE the kernel is running? That is, lets say I have some kernel threads that do some reads and writes to the global memory. Is it possible for the CPU to do reads and writes to the same memory locations at the same time (well not exactly the same time, but while the kernel is still executing). What exactly does the CPU do when it launches a kernel? Is the CPU serial execution basically blocked until the kernel is done?
Thanks for the help!
Kernel launch is asynchronous, which means that the CUDA runtime library queues up your kernel for execution with the driver, then returns immediately so that your program can keep running while the GPU does its thing. There is a maximum queue depth (that appears to vary between devices), so if you queue up ~24 kernel launches in a row, your program will block until a kernel has finished running.
Even though kernel launch is asynchronous, some operations will force your process to block until the kernel is complete. The most common of these is a cudaMemcpy(), which will automatically wait until all previous kernel launches are complete before running to ensure you do not read or write to device memory while it is being used. (As that is usually the behavior you want.) cudaMemcpy() is synchronous and only returns after the memory copy has completed.
CUDA supports the concept of “streams”, where a stream is a sequence of GPU operations that execute in order. If you create two streams and run kernel launches and CUDA memory copies (using cudaMemcpyAsync to avoid the blocking behavior) on different streams, then kernels can execute at the same time as memory copies, assuming your device and operating system support that. Modifying the same device memory locations being accessed by a kernel is possible, I believe, but almost certainly a bad idea. Overlapping kernel execution and memory copies is usually done with some kind of double-buffering scheme.
This is explained in more detail in the programming guide in section 3.2.6, “Asynchronous Concurrent Execution.”