Mapping between OpenACC and CUDA parallelism levels

I’d like to know what the mapping between the OpenACC and CUDA parallelism levels are. Michael Wolfe from PGI states in a presentation that the mapping is as follows:

  • gang => thread block
  • worker => warp
  • vector => thread

But that this mapping is “loose” and “not strict”.

  1. What does that mean? Why is there not more clarity/more guarantees on how the mapping occurs?

  2. Furthermore, the mapping doesn’t make sense to me. Given that the warp size is 32 threads on an NVIDIA GPU (hardware defined), a loop with gang and vector parallelism with 100 gangs and 256 vector lanes (vector_length(256)) would spawn 100 thread blocks, each with 256 threads as verified in my code in the profiler. In the OpenACC execution model, there would only be one worker, but in CUDA/on the hardware there would be 256 / 32 = 8 warps, so I don’t see the correspondence between warps and workers.

EDIT to question 3) below: I just saw this in the OpenACC documentation for the vector_length clause: “There may be implementation-defined limits on the allowed values for the vector length expression.” Which is the case for NVIDIA GPUs, and which explains why the mapping doesn’t actually hold perfectly in this case.

3) In the previous example, the 256 vector lanes will not execute in SIMD lockstep. Instead, there will be 8 groups of 32 threads executing in SIMD/SIMT lockstep. So isn’t there a divergence between the SIMD parallelism we are activating in OpenACC and the actual SIMD/SIMT parallelism on the hardware?

What does that mean? Why is there not more clarity/more guarantees on how the mapping occurs?

OpenACC is meant to target a generic accelerator. How that is mapped to a particular target device is implementation dependent. This allows for great flexibility and performance portability.

  1. I don’t see the correspondence between warps and workers.

Worker is a group of vectors which conceptionally maps to a CUDA warp. Our actual implementation maps a vector to threadidx%x and worker to threadidx%y.

Hope this helps,
Mat

Hi Mat!

Yes indeed. My question was how the PGI compiler implements this on NVIDIA GPUs.

Worker is a group of vectors which conceptionally maps to a CUDA warp. Our actual implementation maps a vector to threadidx%x and worker to threadidx%y.

OK, got it. Some follow-up questions:

  1. 3D thread blocks are never used then (threadidx%z) by the PGI implementation? Or could this be the case if vector_length and/or num_workers is greater than the maximum x- or y-dimension of a block (1024)?

  2. I’m guessing that gangs map to thread blocks. Could a 2D grid be launched if num_gangs is greater than the maximum x-dimension of a grid (2^31-1 since CC3.0), and a 3D grid would be launched if num_gangs is greater than the maximum x-dimension of a grid multiplied by the maximum y-dimension of a grid (65535*2^31-1 = (2^16-1)*2^31 -1 = 2^47 - 2^31 - 1)?

I would have preferred to verify these hypotheses before asking these questions but I can’t access our workstation at the moment.

Thanks Mat!

  1. 3D thread blocks are never used then (threadidx%z) by the PGI implementation? Or could this be the case if vector_length and/or num_workers is greater than the maximum x- or y-dimension of a block (1024)?

As part of the PGI Accelerator model you could do this by adding “vector” clauses to each loop. The OpenACC standards committee decided to make this illegal and instead add “worker”. However, we still support the PGI Accelerator model behavior with the “kernels” construct.

For example, the following loop will use all three threadidx dimensions:

!$acc kernels
!$acc loop gang vector
      do k=1,nsys
!$acc loop gang vector
      do j=1,n
!$acc loop vector
      do i=1,n
        A(i,j,k) = B(j,i,k)
      enddo
      enddo
      enddo
!$acc end kernels
... as shown in the -Minfo=accel output 
     30, Loop is parallelizable
         Accelerator kernel generated
         26, !$acc loop gang, vector(4) ! blockidx%y threadidx%z
         28, !$acc loop gang, vector(2) ! blockidx%x threadidx%y
         30, !$acc loop vector(64) ! threadidx%x



  1. I’m guessing that gangs map to thread blocks. Could a 2D grid be launched if num_gangs is greater than the maximum x-dimension of a grid (2^31-1 since CC3.0), and a 3D grid would be launched if num_gangs is greater than the maximum x-dimension of a grid multiplied by the maximum y-dimension of a grid (65535*2^31-1 = (2^16-1)*2^31 -1 = 2^47 - 2^31 - 1)?

If you put too big of a value in num_gangs for a particular target, the runtime will scale it back to the largest amount for that target.

Also, we strip mine the loop. In other words, we add a strided loop to the generated compute kernel so that each kernel can process more than one element. If you reach the max grid dimension, then each kernel just gets more work.

\

  • Mat