What is the purpose that use asynchronous CUDA APIs

According to the CUDA documentation, CUDA has provided a mechanism that makes calling thread yield when waiting for results from GPU. With that mechanism and synchronous CUDA APIs, the CPU time is saved and the programming seems much easier.
So why should we use asynchronous CUDA APIs?

What asynchronous APIs and thread yielding mechanisms are you referring to?

This from the documentation of CUDA Driver API

  • CU_CTX_SCHED_YIELD: Instruct CUDA to yield its thread when waiting for results from the GPU. This can increase latency when waiting for the GPU, but can increase the performance of CPU threads performing work in parallel with the GPU.

There is no direct relation between synchronization policy and whether or not an API call is synchronous.

I’m not questioning about synchronization policy between kernels or host and GPU.
Since using synchronous API calls may not cost much CPU time, why we should use asynchronous API calls rather than synchronous API calls

Asynchronous calls allow overlapping memory transfers and compute kernel, and hiding latency. See How to Overlap Data Transfers in CUDA C/C++ | NVIDIA Developer Blog

What do you mean by “Since using synchronous API calls may not cost much CPU time”? I am not sure I can completely follow your arguments. CU_CTX_SCHED_YIELD does not improve performance of API calls. It can allow better CPU utilization (by other threads)

That’s what I’d like to do.

That means that using synchronous API calls can block the thread but release the CPU. From the perspective of CPU utilization, I think synchronous API and asynchronous API show little difference.

That’s what I overlooked but it answers my question.

However, there comes another question(Should I open another topic?)
What are the pros and cons of the two paradigms depicted by the figure below


I assume that in Paradigm 2 the host function will be executed serially after Synchronize Stream. Pros and cons depend on the use-case. What does your image source say about the two options?

There are no any specific use-cases. All I want is a programming paradigm for better CPU utilization of heavily multithreaded system