Q1. I saw from the OpenACC lecture video that maximum gang size allowed for NVIDIA devices is 1024. But doesn’t an SMX have maximum of 2048 threads? (Kepler architecture)
On Tesla devices, the maximum number of threads per block is 1024 with the maximum number of threads an SMX may execute being 2048 threads.
What makes the gang size restricted?
The gang size itself is not restricted. However, the OpenACC implementation may use a different value that what is specified by a users based on the limitations of the target device. For example, if you specify a vector length of 128, the compiler would be able to accommodate this length for a NVIDIA Tesla or AMD Radeon, but reduce the length when targeting an x86 256-bit AVX processor.
If I want to fully utilize an SMX, do I have to use 2 gangs of size 1024?
You could, assuming you have not used up other resources such as registers or shared memory. Though since you can have up to 16 blocks (an OpenACC gang maps to a CUDA block when targeting a Tesla), I’d recommend using smaller block sizes, such as 128.
Are the gangs automatically distributed to multiple SMXs?
Yes, this is done by the CUDA device driver.
For example, if there are 12 SMXs in a device and I launch 24 gangs of size 1024, are all SMXs fully occupied?
Yes, but again there are other limiting factors such as register usage and shared memory which may limit the number of threads that can be run on an SMX. Doing a web search for the term “CUDA Occupancy” will give you more detailed information.
Q3. What is the maximum gang size and minimum vector size in AMD Cores Next devices?
The minimum vector size for any target is 1. Though if you go below the wavefront size (64) then you’ll have idle threads.
The PGI compiler will use a maximum a maximum vector length of 256 when targeting a Radeon (-ta=radeon).
Q4. Assuming that other conditions are identical, which loop is better?
Neither is very good. Though if I had to venture a guess between these two loops, #2 would most likely be better since the vector loop, “i”, accesses A’s stride-1 (contiguous in memory) dimension.
If you can, the best option would be to parallelize both loops:
!$ACC LOOP GANG WORKER
DO j = 1, 10000
!$ACC LOOP VECTOR(32)
DO i = 1, 32
A(i, j) = i + j
For NVIDIA devices “worker” maps to the “y” dimension of the thread block while “vector” maps the “x” dimension.
To see if I was correct, I wrote a small example using each of there three loops. The result:
#1 (vector/seq): 9808.6 ms,
#2 (seq/vector): 180.3 ms
#3 (gang,worker/vector): 16.7 ms
So, yes. The poor data access pattern in #1 severely hurts the loop’s performance.
Hope this helps,