"Host array used in CUF kernel"

hello, I have a code like:
1 Module XXX
2 use :: cusparse
3 use :: cublas
4 use :: cudafor
5 implicit none
6 type, extends(FX) :: FX
7 private
8 real(wp), device, allocatable :: F(:)
9 contains

10 end type FX
11 contains
12 subroutine FFX(this, A,B,L,n,nc,nb…)
13 use :: cublas
14 use :: cudafor
15 class(FX),intent(inout) :: this
16 real(wp),device, intent(in) :: A(:),B(:),L(:)
17 integer,device :: n,nc,nb
18 real(wp),device :: q(3)
19 allocate( this%F(n) )
20 this%F =0._wp
21
22 !$cuf kernel do <<< , >>>
23 do i=1, nc
24 do ii=1,nb
25 q=A((i-1) *nc + ii:(i-1) *nc + ii+3)
26 this%F((i-1) *nc + ii:(i-1) *nc + ii+3)=this%F((i-1) *nc + ii:(i-1) *nc + ii+3)+q
27 end do
28 end do
29 end subroutine …
30 end module …

Although the F is defined as device parameter, now I get the error for the line 22 and 26 as below.
And I don’t know why Kernel region is being ignored?

NVFORTRAN-W-0155-Data clause needed for exposed use of pointer this%F$p
NVFORTRAN-S-0155-Kernel region ignored; see -Minfo messages (22)
NVFORTRAN-S-0155-Host array used in CUF kernel - F$f(:) (26)
NVFORTRAN-S-0155-Host array used in CUF kernel - F$f202(:)
NVFORTRAN-S-0155-Host array used in CUF kernel - F$f203(:)
NVFORTRAN-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unable to find associated device pointer
NVFORTRAN/x86-64 Linux 20.7-0: compilation aborted

I would appreciate any help.

Do I need to define a local array for each of do loops and then copy the value to the F?
F((i-1) *nc + 1:(i-1) *nc + nb+3)=Ftemp( 1: nb+3)
but again similar to the original case, In this case I need to have access to F inside the kernel region.

Your main issue is that “this” is a host variable, which contains a device array. So, when you are on the device (inside the cuf kernel) you cannot access F through “this”. There a couple of work-arounds: you can perhaps make “this” a managed variable, so it can be accessed on both host and device. Or, you can somehow cast this%F to a bare pointer, either an F90 pointer or Cray pointer, and access it that way in your loop.

Thanks for your response. I tested other method and instead of calling F using this%F, defined F inside the subroutine but the same error is occurred. as below:

subroutine FFX(this, A,B,L,n,nc,nb…)
use :: cublas
use :: cudafor
real(wp),device, allocatable :: F(:)
real(wp),device, intent(in) :: A(:),B(:),L(:)
integer,device :: n,nc,nb
real(wp),device :: q(3)
allocate( F(n) )
F =0._wp
!$cuf kernel do <<< , >>>
do i=1, nc
do ii=1,nb
q=A((i-1) *nc + ii:(i-1) *nc + ii+3)
F((i-1) *nc + ii:(i-1) *nc + ii+3)=F((i-1) *nc + ii:(i-1) *nc + ii+3)+q
end do
end do
end subroutine …

Error:
NVFORTRAN-S-0155-Host array used in CUF kernel - F$f(:)
NVFORTRAN-S-0155-Host array used in CUF kernel - F$f202(:)

Can you send a complete program which demonstrates the problem that we can reproduce here?

Please find the code at: the issue is in subroutine “update_bendforce”

Best, MB

I am wondering if you found what I am missing here. I am compiling with “nvidia-hpc_sdk_cuda_10.1/20.7”.

Sorry, I am having trouble matching your problem description with the link to the code you sent. Also, the file has a bunch of module dependencies, so I guess I will have to download the entire app to build it. It will take me a bit of time to do.

Oh, Sorry my bad!. I should have left some comments.
So the parameter is defined in line 296 real(wp),managed, allocatable :: Fbnd_d(:)
and it is being called on the line 368 and couple of other places.
Previously I defined it in line 68 and was calling it as pointer this%Fbnd_d. both methods led to similar error.

mpif90 -DUSE_GPU -Mpreprocess -mp -Mcuda=charstring -Mcudalib=cublas,cusolver,cusparse,cufft,curand -Minfo -Mbounds -Minfo=all -traceback -Mchkfpstk -Mchkstk -Mdalign -g -I/usr/include -I/include -I/~/mkl_pgi/include/intel64/lp64 -I …/common/inc -I ./inc -module ./inc -c cuda/sprforce_cumod.cuf -o cuda/sprforce_cumod.o
nvfortran-Warning-CUDA Fortran or OpenACC GPU targets disables -Mbounds
update_force_krnl:
213, CUDA kernel generated
213, !$cuf kernel do <<< (*), (128) >>>
NVFORTRAN-S-0155-Host array used in CUF kernel - fbnd_d$f(:) (cuda/sprforce_cumod.cuf: 384)
NVFORTRAN-S-0155-Host array used in CUF kernel - fbnd_d$f202(:) (cuda/sprforce_cumod.cuf: 384)
NVFORTRAN-S-0155-Host array used in CUF kernel - fbnd_d$f203(:) (cuda/sprforce_cumod.cuf: 384)
NVFORTRAN-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unable to find associated device pointer (cuda/sprforce_cumod.cuf: 384)
NVFORTRAN/x86-64 Linux 20.7-0: compilation aborted
make[1]: *** [cuda/sprforce_cumod.o] Error 2

So, we’ve had a lot of discussion about this here. I think we have 3 recommendations for problems we’ve seen in various versions of your code.

  1. Don’t put scalars on the device if you don’t need to. Especially things like loop bounds, nc, nb like you have above. Let the compiler pass them in from the host, and it will then also be able to make good decisions about the CUDA kernel launch schedule.
  2. CUF Kernels do not support thread-private data. If you want to use small arrays that are private to each thread, CUF Kernels does not really support that. You can either try OpenACC, which allows you to mark variables as thread-private, and can operate on CUDA Fortran device data, or expand the small arrays into scalars.
  3. Array syntax in the CUF Kernels can cause the compiler to insert the creation of temp arrays, where the entire RHS must be evaluated before the LHS is updated. We don’t do a good job of creating temp arrays in CUF Kernels, which is just as well, because you don’t really want to call malloc to dynamically allocate some small space from every thread in your CUDA grid, because it will kill performance. The temp array creation is the cause of the “Host array used in CUF kernel” error.

Regarding point 3 above, when there is a shift in the slice from right- to left-hand side, the compiler will create a temp array, even if there is no aliasing. So instead of:

array(1:3) = array(4:6) + …

do:

do slice = 1, 3
array(slice) = array(3+slice) + …
enddo

The cases like:

array(1:3) = array(1:3) + …

are fine.

Thank you for your notes. So first I will change the loop bound parameters to be Host or managed type. I was thinking about defining thread private data but avoided it.
Still I am a bit confused about the array F or Fbnd_d which defined as device array. Sorry if my question is very basic, but isn’t it true that threads have access to device memory? in that case they have to have access to my large F array.
how does “Do concurrent” differ from current method?
and finally, do you think it would be helpful if I define a small size device array inside the loop and copy its value to the original array?
cuda kernel
do i:1,nc

do j:1,nb
Ftmp(1:nb)=something
end do

Fbnd_d(i-1 * nc+1:i-1 * nc+nb)=Ftmp(1:nb)
end do

I’ll try it. Thanks!

Threads do have access to device memory. If you have a CUF kernel containing:

F(expr1:expr2) = F(expr3:expr4) + ...

and F() here is a device array, the compiler recognizes that different slices of F() are involved on both right- and left-hand sides and will create a temp array to evaluate the right-hand side. This temp array it creates is a host array hence the error.

The issue with accessing Fbnd_d on the device through this%Fbnd_d is that while the Fbnd_d component resides on the device, this resides on the host. So:

!$cuf kernel do <<<*,*>>>
do i = 1, n
  this%Fbnd_d(i) = 0.0
enddo  

needs to access the host-resident this to get to Fbnd_d, hence the error. I get around this type of thing by using an associate block:

associate(F => this%Fbnd_d)
  !$cuf kernel do <<<*,*>>>
  do i = 1, n
    F(i) = 0.0
  enddo
end associate

Hope this helps.

The problem here is Ftmp would need to be a large device array since every thread would be accessing the same Ftmp array. As previously mentioned, CUF kernels are not set up for thread-private data.

The best way around this is to use an explicit do loop rather than slice notation.

Thank you for your detailed explanation. However, I not using this%Fbnd_d style after your previous comment. I defined Fbnd_d as an device array inside the subroutine. and inside first loop which I tent to use CUF kernel for it I want to modify part if Fbnd_d to make per thread calculations independent.
Since my intention is not to paralyze the second loop, I am hoping that each thread runs the second loop in a consequential manner.

One more question also, is the parameters inside the loops. So if inside a loop that I tent to do in parallel, an integer parameter is defined, it will be over written by all threads. is my understanding correct?