OpenACC: Best way to parallelize nested DO loops (continued)

Yes - you’re right, I just included that ‘snippet’ for the record, to show which method worked best for my final code. The initial testing that I was doing using all your (6) suggestions was just a very simple nested do-loop with one equation in it. Then I had to use that method on the full LBM equations where there are 8 sets of nested do-loops, inside one iterator. So it was more difficult to get that working compared with the initial demo case. Got there eventually though with a calculated speedup of x103 compared with serial CPU approach.

Hi Mat,

what is the correct compiler flag for an NVIDIA A10G GPU please? (is it -gpu=cc86 ??)
I have previously been using -gpu=cc75 for a Tesla T4, but now I need to run on something with more RAM

Also how would I run on multiple GPUs, since I tried export ACC_NUM_CORES=4
but it didnt seem to make any difference

Thanks Giles.

Hi Giles,

I’m not familiar with the A10G card since it’s a AI card and my team focuses on HPC devices like the A100. Looking at CUDA - Wikipedia, an A10 is compute capability 8.6, but I’m not sure if a A10G is different.

Try running ‘nvaccelinfo’ which will show you the compute capability of the card.

Also how would I run on multiple GPUs, since I tried export ACC_NUM_CORES=4
but it didnt seem to make any difference

ACC_NUM_CORES sets the number of CPU cores to use when targeting multicore cpus (-acc=multicore). For multiple GPUs, you need to use something like MPI.

For a basic tutorial please see: Using OpenACC with MPI Tutorial Version 23.11 for ARM, OpenPower, x86

Jiri also has a presentation which focus on the MPI side. Mine focused more on OpenACC. See: https://on-demand.gputechconf.com/gtc/2015/presentation/S5711-Jiri-Kraus.pdf

A web search for MPI and OpenCC will give many other examples.

-Mat