Why I dont see the overlap of D2H operations and kernels

Hi Experts,

I am analyzing my kernels with nsys.
My device is RTX4060.

The green items in the image are H2D, the blue ones are cuFFT operations, and red ones are D2H.

All of them are async and I used 5 different streams except default stream.

You can see that all of these are processing in serial.

My question is that, do I need to use multiple threads in the host to launch all of these concurrently? (I’m using single host thread, and launch these in a for loop)

Or I have to replace all of pageable memory with pinned memory?

Thanks!

You need to use pinned memory and cudaMemcpyAsync.

In nsight systems, you can switch to the Expert System View . There you can filter for CUDA Synchronous Memcpy and CUDA Async Memcpy with Pageable Memory , and see the given suggestions.

The following APIs use PAGEABLE memory which causes asynchronous CUDA memcpy operations to block and be executed synchronously. This leads to low GPU utilization.

Suggestion: If applicable, use PINNED memory instead.

The following are synchronous memory transfers that block the host. This does not include host to device transfers of a memory block of 64 KB or less.

Suggestion: Use cudaMemcpy*Async() APIs instead.

I’ve been struggling with a similar issue. See here. If you’re - like me - working under Windows, this answer might be interesting for you. I accepted this answer as solution although it’s more of an explanation while it didn’t help me solve the problem.

If you’re interested in a couple of workarounds, see the following conversation in the linked thread.