I am a little bit confused about these terminologies ?
Is it possible to compare these ones with Work group, warps and threads ?
Thanks.
I am a little bit confused about these terminologies ?
Is it possible to compare these ones with Work group, warps and threads ?
Thanks.
PGI’s current implementation when targeting NVIDIA’s GPUs is to map a “gang” to a CUDA block, “worker” to thread%y, and “vector” to thread%x.
However, if there was a different target such as a multi-core x86 system, the mapping would be very different.
One of the benefits of OpenACC is that it allow you the programmer to abstract away the details of the underlying architecture. This allows you to focus on the parallelism and not how to map parallelism to a particular device, thus giving greater performance portability.
Think of “gang” as course grain parallelism where the gangs work independently of each other and may not synchronize. “vector” is the finest granularity with an individual instruction operating on multiple pieces of data (SIMD/SIMT). “worker” is between the two and allows for grouping of vectors.
You might find this section of the OpenACC Best Practices Guide helpful.
*Note that the link to the best practices guide from 2015 was no longer valid, so I updated to the new document, May 2024.