OpenACC use_device / OpenMP use_device_ptr / use_device_addr in combination with cuBLAS

Dear NVIDIA-team,

please regard the following test code calling cublasDgemm:

OpenACC:

subroutine test(A,B,C)

  use cublas
  use openacc

  implicit none

  real(kind=8) :: A(:,:), B(:,:), C(:,:)

!$acc data copyin(A,B), copyout(C)
!$acc host_data use_device(A,B,C)
  call cublasDgemm('N','N', size(A,1), size(A,1), size(A,1), 1.0d0, A, size(A,1), B, size(B,1), 0.0d0, C, size(C,1))
!$acc end host_data
!$acc end data

end subroutine test

OpenMP:

subroutine test(A,B,C)

  use cublas
  use openacc

  implicit none

  real(kind=8) :: A(:,:), B(:,:), C(:,:)

!$omp target data map(to:A,B) map(from:C)
!$omp target data use_device_addr(A,B,C)
  call cublasDgemm('N','N', size(A,1), size(A,1), size(A,1), 1.0d0, A, size(A,1), B, size(B,1), 0.0d0, C, size(C,1))
!$omp end target data
!$omp end target data

end subroutine test

When the codes are compiled without use_device* the compiler reports:

     14, Generating copyin(a(:,:)) [if not already present]
         Generating copyout(c(:,:)) [if not already present]
         Generating copyin(b(:,:)) [if not already present]
     16, Possible copy in and copy out of c in call to cublasdgemmcu_hpm
         Possible copy in and copy out of b in call to cublasdgemmcu_hpm
         Possible copy in and copy out of a in call to cublasdgemmcu_hpm

When the host_data use_device directive is used in OpenACC the Possible copy warning disappears. But in the OpenMP case the warning remains even when target data use_device_ptr or target data use_device_addr directives are used. What is the reason for this?

On the other hand, nsys shows that regardless whether use_device is used only 2 copyin and 1 copyout operation is generated.

So I am wondering whether use_device makes sense in these cases at all.

Thanks a lot and regards,

Rene’

In this case, it’s optional. The cublasDgemm interface uses the CUDA Fortran “device” attribute so the compiler knows to pass in the device address of the arrays.

However, if you were using the generic “dgemm” interface, then you’d need the “use_device” to ensure the device, not the host, routine is called.

Though as best practice, I’d recommend still using “use_device” so you’re not relying on an NVIDIA specific CUDA Fortran feature.

Possible copy in and copy out of c in call to cublasdgemmcu_hpm

This is just a warning indicating that the compiler has detected that due to the interface passing the array as assumed-size, it’s possibly non-contiguous. Which in turn can lead to the compiler needing to create a temp array to pass in the argument. This is not occurring in this case, so it’s safe to ignore.

The warning is mostly there if you notice a performance issue, then this is one spot to investigate.

Now why it goes away in OpenACC when “host_data” is used, I’m not sure. My best guess is it’s because engineering disabled the warning down the OpenACC code gen path when it knows it’s passing a device address, but missed disabling it in the OpenMP path.

I can ask them to investigate if it’s an issue.

-Mat

Hi Mat,

thanks a lot for the quick answer. Your answer clarifies the behavior a lot.

But regarding implicit copy messages the OpenMP compilation is also behaving differently than OpenACC in other cases. When compining this piece of code:

!$omp target data map(to: A) map(alloc: B), map(tofrom: C)
!$omp target teams distribute parallel do
  do i = 1, size(A,1)
    B(i) = sqrt(A(i))
  end do
!$omp end target teams distribute parallel do

!$omp target teams distribute parallel do
  do i = 1, size(A,1)
    C(i) = C(i) + B(i)
  end do
!$omp end target teams distribute parallel do
!$omp end target data

the compiler also reports:

     13, !$omp target teams distribute parallel do
         13, Generating "nvkernel_test_mod_test__F1L13_2" GPU kernel
     13, Generating implicit map(tofrom:b(:),a(:))
     19, !$omp target teams distribute parallel do
         12, Generating map(to:a(:))
             Generating map(tofrom:c(:))
             Generating map(alloc:b(:))
         19, Generating "nvkernel_test_mod_test__F1L19_4" GPU kernel
     19, Generating implicit map(tofrom:c(:),b(:),a(:))

When compiling the corresponding OpenACC code with the data region there are not longer any implicit copy messages.

Therefore from users’ point of view it would be a big improvement if the compiler would also prevent those messages. This would ensure that the user is informed that no unnecessary copies are done.

1 Like

Furthermore I observed that when compiling this code:

!$omp target teams loop
  do i = 1, size(A,1)
    B(i) = B(i) + A(i)
  end do
!$omp end target teams loop

the compiler output is very detailed and informative:

     12, !$omp target teams loop
         12, Generating "nvkernel_test_mod_test__F1L12_2" GPU kernel
             Generating NVIDIA GPU code
           14, Loop parallelized across teams, threads(128) ! blockidx%x threadidx%x
         12, Generating Multicore code
           14, Loop parallelized across threads
     12, Generating implicit map(tofrom:b(:),a(:))
     14, Generated vector simd code for the loop
         Vectorized loop was interleaved

on the other hand when parallelizing the same code with directive !$omp target teams distribute parallel do the output is very short and the information about parallelization is missing:

     13, !$omp target teams distribute parallel do
         13, Generating "nvkernel_test_mod_test__F1L13_2" GPU kernel
     13, Generating implicit map(tofrom:b(:),a(:))

But then compiling the code for the CPU -mp instead of mp=gpu the compiler tells more details about the parallelization once again:

     13, !$omp target teams distribute parallel do
         14, Loop parallelized across teams and threads, schedule(static)
     14, Generated vector simd code for the loop

It would be very nice, if the detailed parallelization information could be also enabled for -mp=gpu in combination with directive !$omp target teams distribute parallel do.

Thanks a lot!

1 Like

For the extra implicit map messages, I agree that the expectation would be similar to OpenACC where the implicit map messages aren’t shown when the compute regions are inside of a structured data region. There may be a reason why they occur, but I added a report, TPR #38261, and have asked engineering to investigate.

For the “loop” versus “distribute” messages, this is expected due to how the two constructs work.

“loop” is similar to OpenACC where the compiler at compile time can create the schedule (i.e. the blocks and threads).

However with “distribute”, the schedule is determined at runtime, so no information is presented at compile time. The runtime scheduling is needed in order to support features such as metadirectives which are allowed in “distribute” but not loop.

In general, we recommend using “loop” over “distribute”. It’s more restrictive on what can be used but it’s rare for folks to use things like barrier or metadirectives in their offload kernels. Hence the added overhead of “distribute” may lead to lower performance.

1 Like

Hi Mat,

thanks a lot for issuing the report and the additional explanations!

Regards,

Rene’