OpenACC parallel loop gang, vector

Hi, recently i’m trying to transfer CUDA code into OpenACC code. And i have a question for OpenACC parallel loop directive.

Given a source CUDA code as follow:

// kernel1, thread level parallelism
__device__ double kernel1(..)
{
	cal_func(); // each thread execute this function simultaneously
}

// obtain kernel size
int blocksize = 64;
int gridsize = N / blocksize;
if (n % blocksize != 0)
	gridsize++;

kernel1<gridsize, blocksize, stream_id>(..);

I tried to write a simple OpenACC version code:

#pragam acc routine seq
cal_func();

#pragma acc parallel vector vector_length(64)
for (size_t i = 0; i < N; ++i)
{
	cal_func();
}

But it turns out that the OpenACC code is translated into CUDA kernel with dimension (1, 64), where only 1 grid is called containing 2 warps. This is not what i expected, since the OpenACC code will only launch a threadblock to perform function sequentially.

So my question is, is there a simple way to launch a OpenACC vector-level loop with fixed size and multiple threadblocks? I haven’t find such directives.

Thanks in advance.
Tao

Hi Tao,

So my question is, is there a simple way to launch a OpenACC vector-level loop with fixed size and multiple threadblocks? I haven’t find such directives.

I’m not 100% clear on what you’re asking since a vector only loop wouldn’t use multiple thread blocks. Adding just vector forces the schedule to not use gang level parallelism.

The OpenACC gang schedule maps to a CUDA block so to get multiple thread blocks you would add “gang” to your loop schedule.

#pragma acc parallel loop gang vector vector_length(64)
for (size_t i = 0; i < N; ++i)

With “gang vector”, the compiler will distribute the outer loop into chunks sized to the vector length, with each vector (thread within the block) executing one iteration of the chunk. Depending on the number of gangs, each gang will execute one or more of the chunks.

Conceptually, it’s similar to strip-mining where the compiler adds an inner loop of a given chunk size. Something like:

// gang loop
for (size_t i = 0; i < N; i+=vector_length)
    // vector loop with length of 64
    for (size_t j = i; j < i+vector_length; ++j) {
       if (j < N) {
      ..

Note, if you want a fixed number of gangs (blocks), use the “num_gangs()” clause.



Hope this helps,
Mat

Hi, Mat,

Thanks for reply.

This is exactly what i want.

Tao