I’m curious, what happens when issuing a cudaMalloc, cudaMemcpy, or cudaFree while a kernel is executing on the device? Does the memory call block until the kernel completes, or does CUDA allow concurrent host and device memory accesses? Thanks for any answers :)
cudaMalloc and cudaFree require the GPU to complete all pending work before completing (for the most part). cudaMemcpyAsync can overlap with kernels in different streams, which implies that the developer is guaranteeing that you are not accessing the same region from a memcpy and a kernel simultaneously (unless they’re both only reading).
OK, I see in the CUDA programming guide how to use streams, but only a way to separate the streams that cudaMemcpy acts on. Is there any way to use streams with cudaMallocs or cudaFrees?
Not that I know of.
You don’t find function calls like cudaMallocAsync() but you can find cudaMemcpyAsync()
After all cudaMalloc() usually takes no time and it is not worth to overlap it with something else.