Questions about 'vector' and 'gang'

Q1.
I saw from the OpenACC lecture video that maximum gang size allowed for NVIDIA devices is 1024.

But doesn’t an SMX have maximum of 2048 threads? (Kepler architecture)

What makes the gang size restricted?

If I want to fully utilize an SMX, do I have to use 2 gangs of size 1024?

Q2.
Are the gangs automatically distributed to multiple SMXs?

For example, if there are 12 SMXs in a device and I launch 24 gangs of size 1024, are all SMXs fully occupied?

Q3.
What is the maximum gang size and minimum vector size in AMD Cores Next devices?

Q4.
Assuming that other conditions are identical, which loop is better?

!$ACC LOOP VECTOR(32)
DO i = 1, 32
!$ACC LOOP SEQ
DO j = 1, 10000
A(j, i) = j + i
ENDDO
ENDDO

!$ACC LOOP SEQ
DO j = 1, 10000
!$ACC LOOP VECTOR(32)
DO i = 1, 32
A(i, j) = i + j
ENDDO
ENDDO

I suppose the second code requires more do loop overhead. But I’m not sure if a CUDA core is good at executing complex tasks (a long do loop in the first code).

Hi CNJ,

Q1. I saw from the OpenACC lecture video that maximum gang size allowed for NVIDIA devices is 1024. But doesn’t an SMX have maximum of 2048 threads? (Kepler architecture)

On Tesla devices, the maximum number of threads per block is 1024 with the maximum number of threads an SMX may execute being 2048 threads.

What makes the gang size restricted?

The gang size itself is not restricted. However, the OpenACC implementation may use a different value that what is specified by a users based on the limitations of the target device. For example, if you specify a vector length of 128, the compiler would be able to accommodate this length for a NVIDIA Tesla or AMD Radeon, but reduce the length when targeting an x86 256-bit AVX processor.

If I want to fully utilize an SMX, do I have to use 2 gangs of size 1024?

You could, assuming you have not used up other resources such as registers or shared memory. Though since you can have up to 16 blocks (an OpenACC gang maps to a CUDA block when targeting a Tesla), I’d recommend using smaller block sizes, such as 128.

Q2.
Are the gangs automatically distributed to multiple SMXs?

Yes, this is done by the CUDA device driver.

For example, if there are 12 SMXs in a device and I launch 24 gangs of size 1024, are all SMXs fully occupied?

Yes, but again there are other limiting factors such as register usage and shared memory which may limit the number of threads that can be run on an SMX. Doing a web search for the term “CUDA Occupancy” will give you more detailed information.

Q3. What is the maximum gang size and minimum vector size in AMD Cores Next devices?

The minimum vector size for any target is 1. Though if you go below the wavefront size (64) then you’ll have idle threads.

The PGI compiler will use a maximum a maximum vector length of 256 when targeting a Radeon (-ta=radeon).

Q4. Assuming that other conditions are identical, which loop is better?

Neither is very good. Though if I had to venture a guess between these two loops, #2 would most likely be better since the vector loop, “i”, accesses A’s stride-1 (contiguous in memory) dimension.

If you can, the best option would be to parallelize both loops:

!$ACC LOOP GANG WORKER
DO j = 1, 10000 
 !$ACC LOOP VECTOR(32)
 DO i = 1, 32 
 A(i, j) = i + j 
 ENDDO 
ENDDO

For NVIDIA devices “worker” maps to the “y” dimension of the thread block while “vector” maps the “x” dimension.


To see if I was correct, I wrote a small example using each of there three loops. The result:
#1 (vector/seq): 9808.6 ms,
#2 (seq/vector): 180.3 ms
#3 (gang,worker/vector): 16.7 ms

So, yes. The poor data access pattern in #1 severely hurts the loop’s performance.

Hope this helps,
Mat

Q1.

Then, what happens if I launch 64 workers with vector length 32 in a gang?

Does this separate into 2 blocks of 1024 threads? Or does the number of worker reduces to 32?

Our code is really sensitive to this, because we have to tally each gang’s and worker’s data individually. If the compiler does not use values that we have designated manually, the data can be ruined due to simultaneous access.

It would be the best if the atomic directive works for array elements.

Q3.

I didn’t get a clear answer for the maximim gang size in AMD Cores Next.

An AMD Cores Next CU can interleave 40 wavefronts, which results in 2560 threads per CU. So, is the maximum gang size 2560?

Q4.

Why the first loop’s data access pattern is bad?

It seems that both loops’ data access patterns are proper; j is read first in first loop.

Am I missing something?

Q1. Then, what happens if I launch 64 workers with vector length 32 in a gang?

For Tesla devices, “worker” maps to the threadidx%y dimension. So you’d be requesting a block size of 32x64. Of course this is too big, so the compiler would need to reduce the number of workers.

Does this separate into 2 blocks of 1024 threads? Or does the number of worker reduces to 32?

Reduce.

Our code is really sensitive to this, because we have to tally each gang’s and worker’s data individually. If the compiler does not use values that we have designated manually, the data can be ruined due to simultaneous access.

Are you trying perform an inner loop reduction? If so, OpenACC does have a loop reduction clause which can be quite useful.

!$ACC LOOP GANG WORKER
DO j = 1, N 
  sum = 0
 !$ACC LOOP VECTOR reduction(+:sum)
 DO i = 1, M 
   sum = sum + i 
 ENDDO 
 A(j) = sum
ENDDO



It would be the best if the atomic directive works for array elements.

Atomics work on array elements so long it’s an array of an intrinsic type. You just can’t use them on the array itself.

For example:

 integer :: f(n)
 real :: r(m), x
...
 !$acc parallel loop
  do i = 1, n
   j = mod(f(i),m)
   !$acc atomic update
    r(j+1) = r(j+1) + x
   !$acc end atomic
  enddo



Q3. I didn’t get a clear answer for the maximim gang size in AMD Cores Next. An AMD Cores Next CU can interleave 40 wavefronts, which results in 2560 threads per CU. So, is the maximum gang size 2560?

No. On my Tahiti based Radeon, the maximum workgroup size is 256 so the max vector length would be 256. Other AMD architectures might have different workgroup sizes, but I’m not sure. You can use the PGI ‘pgaccelinfo’ utility to check. To fully utilize a CU, multiple workgroups are run.

Q4. Why the first loop’s data access pattern is bad?
It seems that both loops’ data access patterns are proper; j is read first in first loop. Am I missing something?

Because you want the “vector” loop index to be the stride-1 dimension. In the first example you have the “I” loop as the vector but “j” as the stride-1 dimension.

Threads in a warp are SIMT, single instruction multiple threads, meaning that every thread will execute the same instruction at the same time. If one fetches memory, they all do. If the memory is contiguous then all the memory can be brought into the cache. If it’s not contiguous, then you’ll get memory divergence where each thread must wait for the others to fetch memory.

  • Mat

Do you mean that I can use atomic directive to a single array element?

Does it also hold if the array holding that element is allocatable?

Do you mean that I can use atomic directive to a single array element?

Yes, provided that it’s an intrinsic type. So using an integer array element is fine, but an array of a user defined type is not.

Does it also hold if the array holding that element is allocatable?

Yes. The element itself can’t be allocated, but the array can.

  • Mat