OpenACC "pgaccelinfo" output: meaning of Async Engines

dhrubajyoti98 · November 30, 2022, 8:51am

Screenshot from 2022-11-30 14-19-28
This is the output of pgaccelinfo on a RTX 3090. What does this Async Engines with a value of 2 denote?
The same command on an A100 shows 3 for Async Engines.
Is this in anyway related to the async(x) clause in OpenACC? If not, how many async queues can i create on a device?

MatColgrove · November 30, 2022, 5:26pm

Hi dhrubajyoti98,

Is this in anyway related to the async(x) clause in OpenACC? If not, how many async queues can i create on a device?

No, at least not directly. The pgaccelinfo/nvaccelinfo is querying the CUDA Device properties.

Where “Async Engines” is the result from the “asyncEngineCount” property with the following definition:

asyncEngineCount is 1 when the device can concurrently copy memory between host and device while executing a kernel. It is 2 when the device can concurrently copy memory between host and device in both directions and execute a kernel at the same time. It is 0 if neither of these is supported.

Basically it saying that if you use an OpenACC update directive with an async clause, the memory transfers can be run asynchronously to running an OpenACC compute region (parallel/kernels) when using different async queues (CUDA streams). The “2” means that this can be done both when copying to and from the device.

-Mat

dhrubajyoti98 · November 30, 2022, 5:32pm

Thanks.
So given a particular NVIDIA GPU, how many async queues can I create? Most of the tutorials online show nothing more than 2.

MatColgrove · November 30, 2022, 6:06pm

So given a particular NVIDIA GPU, how many async queues can I create?

The async queues map to CUDA Streams so effectively have no limit. There’s probably a hard limit (like in the 1000s) but it would be well beyond what could effectively be used.

Instead, you’ll want to look at the maximum number of concurrent kernels that can be launched. This ranges from 16 to 128 depending on the compute capability of your device. See: CUDA C++ Programming Guide

Though typically only a few queues/streams are used by a program. If your kernels are larger and able to utilize the full device, then it’s unlikely they will be beneficial. Subsequent kernels can only begin once the previous kernel begins to free up device resource resulting in little overlap. It’s only if you have smaller kernels that concurrency can be achieved.

I unusual attempt to have a kernels with enough work to fill a GPU. Only if algorithmically the work is too small do I investigate using concurrent kernels.

Also, creating streams does have overhead which can start to dominate performance if you use too many.

The most effective use of stream is for interleaving data movement and compute, which may “hide” much if not all the data movement time.

-Mat

dhrubajyoti98 · November 30, 2022, 6:23pm

ok. Thank you.

system · December 14, 2022, 6:23pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OpenACC async max number of streams Legacy PGI Compilers	0	4348	May 2, 2014
Concurrent Kernel Execution CUDA Programming and Performance	6	2175	July 13, 2011
How used my four gpu node Legacy PGI Compilers	6	4630	April 21, 2018
Overlapping kernel and data execution Legacy PGI Compilers	2	3485	March 8, 2013
launch kernels in parallel? CUDA Programming and Performance	16	24013	July 29, 2010
Maximum number of operations in a stream CUDA Programming and Performance	7	1153	June 24, 2025
Concurrent kernel execution CUDA Programming and Performance	2	402	March 26, 2024
Multithreading and OpenCL what does really happen if... CUDA Programming and Performance	4	21206	March 30, 2010
CUDA/OpenCL runs multiple GPUs sequentially CUDA Programming and Performance	16	19362	November 26, 2015
What is the actual limit on simultaneously running threads? Asin, is it possible for more than one b CUDA Programming and Performance	20	2510	September 16, 2010

OpenACC "pgaccelinfo" output: meaning of Async Engines

Related topics