I have recently converted a large code to mostly use “do concurrent” (DC) loops instead of OpenACC parallel loops. Everything works fine, but now I am porting more of the code directly to DC.
I am able to get the code working, but I am not sure it is optimized.
Since DC loops do not have a “default(present)” functionality, it is not always easy to tell if my manual data management is complete, or if the run time is transfering a lot of data back and forth to the GPU.
For small codes, I can simply using ACC_NOTIFY to look for the transfers, or use Nsight systems. However, for this large code, it has always been way easier to use “default(present)” on all my OpenACC loops, in which case the code would crash if I had forgotten to place the data on the device (and the compiler would be very helpful in saying which variable it was).
Now, for DC, I cannot do this, making it very hard to know if my code is optimal.
I would like to request a new compiler flag that would activate “default(present)” behavior on DO CONCURRENT loops (i.e. treat the loop exactly as if it was an OpenACC loop with default(present) on it). Maybe something like “-stdpar=gpu,checkpresent”?