Questions about "parallel" and "loop"

I want to make sure that I understand these two directive in a correct way since the generated CUDA code is sort of unreadable.

#pragma acc parallel
{
code_1;
}

By only using parallel directive here, compiler would launch a kernel for this code section, and all the threads would execute the same “code_1” here, is it correct ? Even though it might be redundant.
However, how many threads would be launched for this kernel if without configuring the grid layout.

So multiple “parallel” would indicate launched multiple kernels ?

If “parallel” means a kernel launch, what does it mean by a nested “parallel” inside a
“parallel” region ?

#pragma acc parallel num_gangs(1) num_workers(32) vector_length(32)
{

}

This directive would launch a kernel with only one block, each block with 32*32 threads, is it correct ?

  1. Now considering only one “loop” inside the “parallel”
    #pragma acc parallel
    {
    #pragma acc loop
    for (…)

}

What would the compiler do in this case in CUDA language ? Distributing the loop iteration with multiple blocks with one thread or one block with many threads ?

If I want to launch 1024 thread with a single block, and distribute each loop iteration into each single thread, can I write some thing like this :

#pragma acc parallel num_gangs(1) num_workers(32) vector_length(32)
{
…,
…,
#pragma acc loop vector
for (…)

}



Also, what would happen if I replace the “vector” after the “loop” with “gang” or “worker”?


4.

Is there any reasonable way for debugging if the serial code is correct while the generated parallel version gave wrong answer? This might be biggest concern while using this tool.

By only using parallel directive here, compiler would launch a kernel for this code section, and all the threads would execute the same “code_1” here, is it correct?

No. Assuming “code_1” is sequential code or a call, then a kernel will be generated but run sequentially by a single gang/vector. If “code_1” contains a loop, the compiler may auto-vectorize it. Though, only a single gang would be used.

So multiple “parallel” would indicate launched multiple kernels ?

Yes, if they are not nested.

If “parallel” means a kernel launch, what does it mean by a nested “parallel” inside a “parallel” region ?

Nest parallel regions occur when one “parallel” region is placed inside another “parallel” region. Every vector would then launch a new compute region. Note that we haven’t found a good use case for nested parallelism so have put off implementing this feature.

The idea is based off of CUDA’s Dynamic Parallelism:

This directive would launch a kernel with only one block, each block with 32*32 threads, is it correct ?

For an NVIDIA device, yes.

What would the compiler do in this case in CUDA language ? Distributing the loop iteration with multiple blocks with one thread or one block with many threads ?

It could be either. With “loop”, you’ve just told the compiler that it should parallelize the loop, but not how to schedule it. The compiler will determine the schedule based upon the characteristics of the loop.

Most likely though, it will divide the loop across multiple gangs and vectors.

If I want to launch 1024 thread with a single block, and distribute each loop iteration into each single thread, can I write some thing like this :
#pragma acc parallel num_gangs(1) num_workers(32) vector_length(32)

Yes.

Though, you’d want to add “worker” to your loop directive. Otherwise you’ll only get 32 vectors since you haven’t specified where the worker parallelism is. You could also remove “num_workers” and then set the vector length to 1024.

#pragma acc loop worker vector

Actually there’s good chance the compiler will combine the two and create a vector of 1024 anyway. This help reduce the address computation required when accessing arrays. “worker” is better for use with nested for loops.

Also, what would happen if I replace the “vector” after the “loop” with “gang” or “worker”?

If you use gang there, you’d have a gang loop with no vector parallelism. Given that “num_gangs=1”, then you’d have a single gang with a single vector.

Replacing “worker” would have little effect except instead of block of 32x1 threads, you’d have a block of 1x32 threads.

Is there any reasonable way for debugging if the serial code is correct while the generated parallel version gave wrong answer? This might be biggest concern while using this tool.

Supported debuggers are TotalView, Allinea DDT, and cuda-gdb, with the caveat that some low level information may not translate back to your source code. I’ll also use cuda-memcheck when hunting down memory access problems.

Finding race conditions, such as those in your code, can be more difficult. There’s not tool that I’m aware of to find race conditions on an accelerator. There’s a few tools for multi-core CPU, but that’s only for a few threads and I’ve not found them very useful. I, like most folks I know, rely on experience when hunting down race conditions.

Hope this helps,
Mat