OpenACC parallel loop gang, vector

TAO_T.CH · May 2, 2019, 5:03pm

Hi, recently i’m trying to transfer CUDA code into OpenACC code. And i have a question for OpenACC parallel loop directive.

Given a source CUDA code as follow:

// kernel1, thread level parallelism
__device__ double kernel1(..)
{
	cal_func(); // each thread execute this function simultaneously
}

// obtain kernel size
int blocksize = 64;
int gridsize = N / blocksize;
if (n % blocksize != 0)
	gridsize++;

kernel1<gridsize, blocksize, stream_id>(..);

I tried to write a simple OpenACC version code:

#pragam acc routine seq
cal_func();

#pragma acc parallel vector vector_length(64)
for (size_t i = 0; i < N; ++i)
{
	cal_func();
}

But it turns out that the OpenACC code is translated into CUDA kernel with dimension (1, 64), where only 1 grid is called containing 2 warps. This is not what i expected, since the OpenACC code will only launch a threadblock to perform function sequentially.

So my question is, is there a simple way to launch a OpenACC vector-level loop with fixed size and multiple threadblocks? I haven’t find such directives.

Thanks in advance.
Tao

MatColgrove · May 2, 2019, 6:12pm

Hi Tao,

So my question is, is there a simple way to launch a OpenACC vector-level loop with fixed size and multiple threadblocks? I haven’t find such directives.

I’m not 100% clear on what you’re asking since a vector only loop wouldn’t use multiple thread blocks. Adding just vector forces the schedule to not use gang level parallelism.

The OpenACC gang schedule maps to a CUDA block so to get multiple thread blocks you would add “gang” to your loop schedule.

#pragma acc parallel loop gang vector vector_length(64)
for (size_t i = 0; i < N; ++i)

With “gang vector”, the compiler will distribute the outer loop into chunks sized to the vector length, with each vector (thread within the block) executing one iteration of the chunk. Depending on the number of gangs, each gang will execute one or more of the chunks.

Conceptually, it’s similar to strip-mining where the compiler adds an inner loop of a given chunk size. Something like:

// gang loop
for (size_t i = 0; i < N; i+=vector_length)
    // vector loop with length of 64
    for (size_t j = i; j < i+vector_length; ++j) {
       if (j < N) {
      ..

Note, if you want a fixed number of gangs (blocks), use the “num_gangs()” clause.

Hope this helps,
Mat

TAO_T.CH · May 3, 2019, 8:37am

mkcolg:

Hi Tao,

So my question is, is there a simple way to launch a OpenACC vector-level loop with fixed size and multiple threadblocks? I haven’t find such directives.

I’m not 100% clear on what you’re asking since a vector only loop wouldn’t use multiple thread blocks. Adding just vector forces the schedule to not use gang level parallelism.

The OpenACC gang schedule maps to a CUDA block so to get multiple thread blocks you would add “gang” to your loop schedule.
#pragma acc parallel loop gang vector vector_length(64)
for (size_t i = 0; i < N; ++i)
With “gang vector”, the compiler will distribute the outer loop into chunks sized to the vector length, with each vector (thread within the block) executing one iteration of the chunk. Depending on the number of gangs, each gang will execute one or more of the chunks.

Conceptually, it’s similar to strip-mining where the compiler adds an inner loop of a given chunk size. Something like:
// gang loop
for (size_t i = 0; i < N; i+=vector_length)
    // vector loop with length of 64
    for (size_t j = i; j < i+vector_length; ++j) {
       if (j < N) {
      ..
Note, if you want a fixed number of gangs (blocks), use the “num_gangs()” clause.

Hope this helps,
Mat

Hi, Mat,

Thanks for reply.

This is exactly what i want.

Tao

garcfd · December 7, 2023, 4:22pm

How do you know what is the optimum number of gangs for a given vector length?
should they be roughly equal?

!$acc parallel loop gang vector vector_length(16) num_gangs(16)

MatColgrove · December 7, 2023, 5:21pm

An OpenACC gang maps to a CUDA block and vector maps to a CUDA thread in the x dimension. Worker is a thread in the y dimension.

In CUDA, threads are grouped in what’s called a “warp” where a warp contains 32 threads. Hence the vector_length should be at least 32. If it’s lower, like 16, then the warp will still have 32 threads, just 16 will be wasted. Additional vectors should be added in increments of 32 with the max being 1024.

How do you know what is the optimum number of gangs for a given vector length?

Without the “num_gangs” clause, the number of gangs is set dynamically at runtime based on the loop trip count and often the best schedule. Though what’s optimal will heavily depend on the code.

Topic		Replies	Views
Mapping between OpenACC and CUDA parallelism levels Legacy PGI Compilers	3	6544	April 16, 2015
Help understanding gang and vector specification Legacy PGI Compilers	1	2389	November 26, 2012
OpenACC: Fine tuning accelerator performance nvc, nvc++ and nvfortran	5	1229	March 18, 2021
Questions about "parallel" and "loop" Legacy PGI Compilers	1	2619	August 5, 2015
how gang and vector parallelization of a loop map to the GPU Legacy PGI Compilers	5	8018	February 26, 2014
Questions about 'vector' and 'gang' Legacy PGI Compilers	5	7013	February 10, 2016
paralle + independent and kernels + vector_length() Legacy PGI Compilers	5	4029	August 20, 2012
Computing multiple elements per thread in OpenACC Legacy PGI Compilers	3	2433	May 17, 2013
questions about #threads Legacy PGI Compilers	5	4081	August 3, 2015
Loop "too deeply nested" and "data dependency Legacy PGI Compilers	9	10575	November 27, 2017

OpenACC parallel loop gang, vector

Related topics