copyin copyout with CUDA Fortran

Recently, when I turn on the flag -Minfo, I see a lot of ‘copyin’ and ‘copyout’ for each functional call.

setup_ryr_2:                                                                                
   2825, Possible copy in and copy out of akrm in call to getcompk_4                                                                                          
   2826, Possible copy in and copy out of akrp in call to getcompk_4    

put_sfu2grid3d_nonuniform_rogue_ryr:                                                   
   2246, Copy in and copy out of sfu_coords in call to valid_sfu_location                                                                                  
   2247, Copy in and copy out of sfu_coords in call to map_loc2grid                                                                                        
   2283, Copy in and copy out of sfu_coords in call to valid_sfu_location                                                                                  
   2284, Copy in and copy out of sfu_coords in call to map_loc2grid

I’m not using Accelerator programming model, but CUDA Fortran. So, my question is how can I optimize the code, written in CUDA Fortran, to specify the compiler when to do copyin and/or copyout?
Can it be done using clauses like in Fortran Accelerator? Please give me an example if it can be done.

Thanks,
Tuan

Hi Tuan,

These have nothing to do with CUDA Fortran or OpenACC. The compiler is tell you that it may in one case and has in another created a temp array to pass in a non-unit 1 array (non-contiguous) array to a subroutine. This can cause performance problems which can only be fixed by passing in contiguous arrays.

  • Mat

Hi Mat,
Based on your response I have a question. Padding is quite often being used in C. So, my question is if I use padding and pass an device array like this in Fortran,

real, device :: matA

allocate(matA(pad+size))

call my_kernel <<<...,...>>> (matA(pad+1:), ...)

should the CUDA Fortran compiler create a duplicate of matA(pad+1:) or it just use the original matA array. Can the compiler detect that matA(pad+1:) is still a contiguous array here?

Thanks,
Tuan

Hi Tuan,

Can the compiler detect that matA(pad+1:*) is still a contiguous array here?

Yes, however, the compiler may still have to create a new descriptor if your interface for this array is assumed-size (*).

should the CUDA Fortran compiler create a duplicate of matA(pad+1:*) or it just use the original matA array.

In CUDA Fortran, it actually can’t duplicate the array. However, since the contiguous test can be done until run time, you may get run time failures if a reshape is needed or the test can determine if it’s contiguous.

  • Mat