Hi, recently i’m trying to transfer CUDA code into OpenACC code. And i have a question for OpenACC parallel loop directive.
Given a source CUDA code as follow:
// kernel1, thread level parallelism
__device__ double kernel1(..)
{
cal_func(); // each thread execute this function simultaneously
}
// obtain kernel size
int blocksize = 64;
int gridsize = N / blocksize;
if (n % blocksize != 0)
gridsize++;
kernel1<gridsize, blocksize, stream_id>(..);
I tried to write a simple OpenACC version code:
#pragam acc routine seq
cal_func();
#pragma acc parallel vector vector_length(64)
for (size_t i = 0; i < N; ++i)
{
cal_func();
}
But it turns out that the OpenACC code is translated into CUDA kernel with dimension (1, 64), where only 1 grid is called containing 2 warps. This is not what i expected, since the OpenACC code will only launch a threadblock to perform function sequentially.
So my question is, is there a simple way to launch a OpenACC vector-level loop with fixed size and multiple threadblocks? I haven’t find such directives.
So my question is, is there a simple way to launch a OpenACC vector-level loop with fixed size and multiple threadblocks? I haven’t find such directives.
I’m not 100% clear on what you’re asking since a vector only loop wouldn’t use multiple thread blocks. Adding just vector forces the schedule to not use gang level parallelism.
The OpenACC gang schedule maps to a CUDA block so to get multiple thread blocks you would add “gang” to your loop schedule.
#pragma acc parallel loop gang vector vector_length(64)
for (size_t i = 0; i < N; ++i)
With “gang vector”, the compiler will distribute the outer loop into chunks sized to the vector length, with each vector (thread within the block) executing one iteration of the chunk. Depending on the number of gangs, each gang will execute one or more of the chunks.
Conceptually, it’s similar to strip-mining where the compiler adds an inner loop of a given chunk size. Something like:
// gang loop
for (size_t i = 0; i < N; i+=vector_length)
// vector loop with length of 64
for (size_t j = i; j < i+vector_length; ++j) {
if (j < N) {
..
Note, if you want a fixed number of gangs (blocks), use the “num_gangs()” clause.
An OpenACC gang maps to a CUDA block and vector maps to a CUDA thread in the x dimension. Worker is a thread in the y dimension.
In CUDA, threads are grouped in what’s called a “warp” where a warp contains 32 threads. Hence the vector_length should be at least 32. If it’s lower, like 16, then the warp will still have 32 threads, just 16 will be wasted. Additional vectors should be added in increments of 32 with the max being 1024.
How do you know what is the optimum number of gangs for a given vector length?
Without the “num_gangs” clause, the number of gangs is set dynamically at runtime based on the loop trip count and often the best schedule. Though what’s optimal will heavily depend on the code.