Yes - you’re right, I just included that ‘snippet’ for the record, to show which method worked best for my final code. The initial testing that I was doing using all your (6) suggestions was just a very simple nested do-loop with one equation in it. Then I had to use that method on the full LBM equations where there are 8 sets of nested do-loops, inside one iterator. So it was more difficult to get that working compared with the initial demo case. Got there eventually though with a calculated speedup of x103 compared with serial CPU approach.
Hi Mat,
what is the correct compiler flag for an NVIDIA A10G GPU please? (is it -gpu=cc86 ??)
I have previously been using -gpu=cc75 for a Tesla T4, but now I need to run on something with more RAM
Also how would I run on multiple GPUs, since I tried export ACC_NUM_CORES=4
but it didnt seem to make any difference
Thanks Giles.
Hi Giles,
I’m not familiar with the A10G card since it’s a AI card and my team focuses on HPC devices like the A100. Looking at CUDA - Wikipedia, an A10 is compute capability 8.6, but I’m not sure if a A10G is different.
Try running ‘nvaccelinfo’ which will show you the compute capability of the card.
Also how would I run on multiple GPUs, since I tried export ACC_NUM_CORES=4
but it didnt seem to make any difference
ACC_NUM_CORES sets the number of CPU cores to use when targeting multicore cpus (-acc=multicore). For multiple GPUs, you need to use something like MPI.
For a basic tutorial please see: Using OpenACC with MPI Tutorial Version 23.11 for ARM, OpenPower, x86
Jiri also has a presentation which focus on the MPI side. Mine focused more on OpenACC. See: https://on-demand.gputechconf.com/gtc/2015/presentation/S5711-Jiri-Kraus.pdf
A web search for MPI and OpenCC will give many other examples.
-Mat