OpenACC no parallelisation with ta=multicore

Hi,

I am new to OpenACC and I am trying to get the multicore parallelisation running which currently does not give any runtime improvement of the program compared to a working OpenMP implementation of the same program.

The loop I want to parallelize looks like this:

#pragma acc parallel loop
for (int y = 0; y < yDim; y++)
{
    for (int x = 0; x < xDim; x++)
    {
        int color = determineColor(x, y, points, numPoints, max_res);
        imageBuffer[x][y] = color;
    }
}

I am using the pgcc compiler with the following settings:

pgcc -Minfo -ta=multicore

I know that the inner loop probably wont be parallelized but I was expecting at least some improvement from the outer loop but so far no improvement compared to a sequential run of the program or a parallel run with OpenMP was measurable

1 Like

Hi Blob,

Can you please provide a minimal reproducing example? Unfortunately there’ not enough information here for us to help.

-Mat

voronoi.c (3.2 KB) main.c (3.2 KB)

hey these two files should get the program running and give a general idea how it looks. Basically it makes Voronoi-Diagrams. These two files are a bit older version but the problem remains the same

Might just be a binding issue. When I

  • fix the header file,
  • un-comment out the OpenMP pragma,
  • compile twice building an OpenMP and OpenACC binaries
  • set OMP_NUM_THREADS=20 and ACC_NUM_CORES=20, so only a single socket is used
  • use “taskset -c 0-19 a.out” to bind,

then the run each binary, the times are roughly the same and consistent between runs. At least when I run, going cross socket (i.e 40 cores) leads to severe run t run variation, presumably due to memory being allocated on a single NUMA node. Unfortunately, numactl isn’t installed on the the system I’m using, otherwise I’d try interleaving the memory to see if that helped the run-to-run variance.

-Mat

thanks for the answer, I am working on a windows machine so I cant use taskset and I know this might sound really dumb but where do I set the ACC_NUM_CORES in the code? I tried using acc_set_num_cores(int) in the first line of my main function but it changed nothing in the runtime. If you could give me a code example of how it should look like that would help me a lot.

1 Like

Apologies if I wasn’t clear. ACC_NUM_CORES and OMP_NUM_THREADS are environment variables so would be set in your Windows command shell via “set ACC_NUM_CORES=20”, or if you’re using the bash shell via “export ACC_NUM_CORES=20”.

If I am running on a single GPU, how do I set the number of threads (or parallel processes)?
is it ACC_NUM_CORES or OMP_NUM_THREADS.
Thanks Giles.

These only set the number of host threads to use.

The number of threads on the GPU would be the product of the number of gangs times the number of workers and vectors. So you could explicit set this via the “num_gangs”, “num_workers”, and “vector_length” clauses, but I wouldn’t recommend it. While these clauses can be useful under certain circumstances, in general it’s best to let the compiler use as many threads as possible depending on the parallel loop trip counts and the target architecture.