Mapping between OpenACC and CUDA parallelism levels

LO_UZH · April 15, 2015, 9:15am

I’d like to know what the mapping between the OpenACC and CUDA parallelism levels are. Michael Wolfe from PGI states in a presentation that the mapping is as follows:

gang => thread block
worker => warp
vector => thread

But that this mapping is “loose” and “not strict”.

What does that mean? Why is there not more clarity/more guarantees on how the mapping occurs?
Furthermore, the mapping doesn’t make sense to me. Given that the warp size is 32 threads on an NVIDIA GPU (hardware defined), a loop with gang and vector parallelism with 100 gangs and 256 vector lanes (vector_length(256)) would spawn 100 thread blocks, each with 256 threads as verified in my code in the profiler. In the OpenACC execution model, there would only be one worker, but in CUDA/on the hardware there would be 256 / 32 = 8 warps, so I don’t see the correspondence between warps and workers.

EDIT to question 3) below: I just saw this in the OpenACC documentation for the vector_length clause: “There may be implementation-defined limits on the allowed values for the vector length expression.” Which is the case for NVIDIA GPUs, and which explains why the mapping doesn’t actually hold perfectly in this case.

3) In the previous example, the 256 vector lanes will not execute in SIMD lockstep. Instead, there will be 8 groups of 32 threads executing in SIMD/SIMT lockstep. So isn’t there a divergence between the SIMD parallelism we are activating in OpenACC and the actual SIMD/SIMT parallelism on the hardware?

MatColgrove · April 15, 2015, 3:22pm

What does that mean? Why is there not more clarity/more guarantees on how the mapping occurs?

OpenACC is meant to target a generic accelerator. How that is mapped to a particular target device is implementation dependent. This allows for great flexibility and performance portability.

I don’t see the correspondence between warps and workers.

Worker is a group of vectors which conceptionally maps to a CUDA warp. Our actual implementation maps a vector to threadidx%x and worker to threadidx%y.

Hope this helps,
Mat

LO_UZH · April 16, 2015, 9:00am

Hi Mat!

Yes indeed. My question was how the PGI compiler implements this on NVIDIA GPUs.

Worker is a group of vectors which conceptionally maps to a CUDA warp. Our actual implementation maps a vector to threadidx%x and worker to threadidx%y.

OK, got it. Some follow-up questions:

3D thread blocks are never used then (threadidx%z) by the PGI implementation? Or could this be the case if vector_length and/or num_workers is greater than the maximum x- or y-dimension of a block (1024)?
I’m guessing that gangs map to thread blocks. Could a 2D grid be launched if num_gangs is greater than the maximum x-dimension of a grid (2^31-1 since CC3.0), and a 3D grid would be launched if num_gangs is greater than the maximum x-dimension of a grid multiplied by the maximum y-dimension of a grid (65535*2^31-1 = (2^16-1)*2^31 -1 = 2^47 - 2^31 - 1)?

I would have preferred to verify these hypotheses before asking these questions but I can’t access our workstation at the moment.

Thanks Mat!

MatColgrove · April 16, 2015, 7:02pm

3D thread blocks are never used then (threadidx%z) by the PGI implementation? Or could this be the case if vector_length and/or num_workers is greater than the maximum x- or y-dimension of a block (1024)?

As part of the PGI Accelerator model you could do this by adding “vector” clauses to each loop. The OpenACC standards committee decided to make this illegal and instead add “worker”. However, we still support the PGI Accelerator model behavior with the “kernels” construct.

For example, the following loop will use all three threadidx dimensions:

!$acc kernels
!$acc loop gang vector
      do k=1,nsys
!$acc loop gang vector
      do j=1,n
!$acc loop vector
      do i=1,n
        A(i,j,k) = B(j,i,k)
      enddo
      enddo
      enddo
!$acc end kernels
... as shown in the -Minfo=accel output 
     30, Loop is parallelizable
         Accelerator kernel generated
         26, !$acc loop gang, vector(4) ! blockidx%y threadidx%z
         28, !$acc loop gang, vector(2) ! blockidx%x threadidx%y
         30, !$acc loop vector(64) ! threadidx%x

I’m guessing that gangs map to thread blocks. Could a 2D grid be launched if num_gangs is greater than the maximum x-dimension of a grid (2^31-1 since CC3.0), and a 3D grid would be launched if num_gangs is greater than the maximum x-dimension of a grid multiplied by the maximum y-dimension of a grid (65535*2^31-1 = (2^16-1)*2^31 -1 = 2^47 - 2^31 - 1)?

If you put too big of a value in num_gangs for a particular target, the runtime will scale it back to the largest amount for that target.

Also, we strip mine the loop. In other words, we add a strided loop to the generated compute kernel so that each kernel can process more than one element. If you reach the max grid dimension, then each kernel just gets more work.

\

Mat

Topic		Replies	Views
gangs, worker and vector Legacy PGI Compilers	1	3059	July 29, 2015
gang,worker and vector in openacc Legacy PGI Compilers	1	1920	April 27, 2012
OpenACC: Fine tuning accelerator performance nvc, nvc++ and nvfortran	5	1405	March 18, 2021
how gang and vector parallelization of a loop map to the GPU Legacy PGI Compilers	5	8065	February 26, 2014
Computing multiple elements per thread in OpenACC Legacy PGI Compilers	3	2454	May 17, 2013
Help understanding gang and vector specification Legacy PGI Compilers	1	2424	November 26, 2012
Optimize runtime Legacy PGI Compilers	3	2570	April 17, 2018
Questions about 'vector' and 'gang' Legacy PGI Compilers	5	7056	February 10, 2016
gang and worker Legacy PGI Compilers	3	2338	May 7, 2013
OpenACC Gang-Vector Performance Legacy PGI Compilers	4	3696	June 18, 2015

Mapping between OpenACC and CUDA parallelism levels

Related topics