You can see that all of these are processing in serial.
My question is that, do I need to use multiple threads in the host to launch all of these concurrently? (I’m using single host thread, and launch these in a for loop)
Or I have to replace all of pageable memory with pinned memory?
You need to use pinned memory and cudaMemcpyAsync.
In nsight systems, you can switch to the Expert System View . There you can filter for CUDA Synchronous Memcpy and CUDA Async Memcpy with Pageable Memory , and see the given suggestions.
The following APIs use PAGEABLE memory which causes asynchronous CUDA memcpy operations to block and be executed synchronously. This leads to low GPU utilization.
Suggestion: If applicable, use PINNED memory instead.
The following are synchronous memory transfers that block the host. This does not include host to device transfers of a memory block of 64 KB or less.
I’ve been struggling with a similar issue. See here. If you’re - like me - working under Windows, this answer might be interesting for you. I accepted this answer as solution although it’s more of an explanation while it didn’t help me solve the problem.
If you’re interested in a couple of workarounds, see the following conversation in the linked thread.