I’d like to know what the mapping between the OpenACC and CUDA parallelism levels are. Michael Wolfe from PGI states in a presentation that the mapping is as follows:
- gang => thread block
- worker => warp
- vector => thread
But that this mapping is “loose” and “not strict”.
What does that mean? Why is there not more clarity/more guarantees on how the mapping occurs?
Furthermore, the mapping doesn’t make sense to me. Given that the warp size is 32 threads on an NVIDIA GPU (hardware defined), a loop with gang and vector parallelism with 100 gangs and 256 vector lanes (vector_length(256)) would spawn 100 thread blocks, each with 256 threads as verified in my code in the profiler. In the OpenACC execution model, there would only be one worker, but in CUDA/on the hardware there would be 256 / 32 = 8 warps, so I don’t see the correspondence between warps and workers.
EDIT to question 3) below: I just saw this in the OpenACC documentation for the vector_length clause: “There may be implementation-defined limits on the allowed values for the vector length expression.” Which is the case for NVIDIA GPUs, and which explains why the mapping doesn’t actually hold perfectly in this case.
3) In the previous example, the 256 vector lanes will not execute in SIMD lockstep. Instead, there will be 8 groups of 32 threads executing in SIMD/SIMT lockstep. So isn’t there a divergence between the SIMD parallelism we are activating in OpenACC and the actual SIMD/SIMT parallelism on the hardware?