I do have automatic arrays.
to get it to run I already had to set:
setenv PGI_ACC_CUDA_HEAPSIZE 67000000
Is that the same thing?
I tried setting NV_ACC_CUDA_HEAPSIZE
from 67000000 to 500000000
But that did not fix it. I may try removing the automatic arrays.
Thanks,
Jacques
Yes, although the older âPGIâ prefix is deprecated. âNVCOMPILERâ is the of official prefix for environment variables, but I prefer the abbreviated âNVâ which is also acceptable.
Hi Mat,
In subroutine mynn_tendencies I changed all 19 automatic arrays to arrays in the calling sequence and, in the calling routine, put them in a private clause. That sped up the entire main loop from 1.33 seconds to .90 seconds which is .68% of 1.33. Now Iâm going to look for other subroutines with automatic arrays.
Thanks for the great tip!
Jacques
1 Like
Hi Mat,
I removed all the automatic array and it sped up by xx%. I donât know whatâs taking the remaining time but I wonder if it is private arrays. I have a kernels directive on the main loop that I time and that kernels directive specifies 180 private arrays, most dimensioned 128 and some dimensioned 128,10.
Hi Matt,
I removed all the automatic array and it sped up by 4X. I donât know whatâs taking the remaining time but I wonder if it is private arrays. I time the main loop. On the main loop I have a kernels directive which specifies 180 private arrays, most dimensioned (128) and some dimensioned (128,10). Does that take a lot of start-up time?
Thanks,
Jacques
Well, the private arrays do need to get allocated. Normally the overhead time is not significant, but 180 arrays could take awhile. I personally havenât used this many. Granted the device memory should get re-used and the allocation time only impact the first time the kernel is called.
Have you profiled the code? If not, I suggest profiling using Nsight-Systems with OpenACC tracing enabled (i.e. ânsys profile -o -t cuda,openacc â, optionally add ââstats=trueâ to see the text output). This will should the device memory allocation time.
-Mat