This is the output of pgaccelinfo on a RTX 3090. What does this Async Engines with a value of 2 denote?
The same command on an A100 shows 3 for Async Engines.
Is this in anyway related to the async(x) clause in OpenACC? If not, how many async queues can i create on a device?
Hi dhrubajyoti98,
Is this in anyway related to the async(x) clause in OpenACC? If not, how many async queues can i create on a device?
No, at least not directly. The pgaccelinfo/nvaccelinfo is querying the CUDA Device properties.
Where “Async Engines” is the result from the “asyncEngineCount” property with the following definition:
asyncEngineCount is 1 when the device can concurrently copy memory between host and device while executing a kernel. It is 2 when the device can concurrently copy memory between host and device in both directions and execute a kernel at the same time. It is 0 if neither of these is supported.
Basically it saying that if you use an OpenACC update directive with an async clause, the memory transfers can be run asynchronously to running an OpenACC compute region (parallel/kernels) when using different async queues (CUDA streams). The “2” means that this can be done both when copying to and from the device.
-Mat
Thanks.
So given a particular NVIDIA GPU, how many async queues can I create? Most of the tutorials online show nothing more than 2.
So given a particular NVIDIA GPU, how many async queues can I create?
The async queues map to CUDA Streams so effectively have no limit. There’s probably a hard limit (like in the 1000s) but it would be well beyond what could effectively be used.
Instead, you’ll want to look at the maximum number of concurrent kernels that can be launched. This ranges from 16 to 128 depending on the compute capability of your device. See: Programming Guide :: CUDA Toolkit Documentation
Though typically only a few queues/streams are used by a program. If your kernels are larger and able to utilize the full device, then it’s unlikely they will be beneficial. Subsequent kernels can only begin once the previous kernel begins to free up device resource resulting in little overlap. It’s only if you have smaller kernels that concurrency can be achieved.
I unusual attempt to have a kernels with enough work to fill a GPU. Only if algorithmically the work is too small do I investigate using concurrent kernels.
Also, creating streams does have overhead which can start to dominate performance if you use too many.
The most effective use of stream is for interleaving data movement and compute, which may “hide” much if not all the data movement time.
-Mat
ok. Thank you.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.