"Host array used in CUF kernel"

mboudagh · March 16, 2021, 2:49pm

hello, I have a code like:
1 Module XXX
2 use :: cusparse
3 use :: cublas
4 use :: cudafor
5 implicit none
6 type, extends(FX) :: FX
7 private
8 real(wp), device, allocatable :: F(:)
9 contains
…
10 end type FX
11 contains
12 subroutine FFX(this, A,B,L,n,nc,nb…)
13 use :: cublas
14 use :: cudafor
15 class(FX),intent(inout) :: this
16 real(wp),device, intent(in) :: A(:),B(:),L(:)
17 integer,device :: n,nc,nb
18 real(wp),device :: q(3)
19 allocate( this%F(n) )
20 this%F =0._wp
21
22 !$cuf kernel do <<< , >>>
23 do i=1, nc
24 do ii=1,nb
25 q=A((i-1) *nc + ii:(i-1) *nc + ii+3)
26 this%F((i-1) *nc + ii:(i-1) *nc + ii+3)=this%F((i-1) *nc + ii:(i-1) *nc + ii+3)+q
27 end do
28 end do
29 end subroutine …
30 end module …

Although the F is defined as device parameter, now I get the error for the line 22 and 26 as below.
And I don’t know why Kernel region is being ignored?

NVFORTRAN-W-0155-Data clause needed for exposed use of pointer this%F$p
NVFORTRAN-S-0155-Kernel region ignored; see -Minfo messages (22)
NVFORTRAN-S-0155-Host array used in CUF kernel - F$f(:) (26)
NVFORTRAN-S-0155-Host array used in CUF kernel - F$f202(:)
NVFORTRAN-S-0155-Host array used in CUF kernel - F$f203(:)
NVFORTRAN-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unable to find associated device pointer
NVFORTRAN/x86-64 Linux 20.7-0: compilation aborted

I would appreciate any help.

mboudagh · March 16, 2021, 4:14pm

Do I need to define a local array for each of do loops and then copy the value to the F?
F((i-1) *nc + 1:(i-1) *nc + nb+3)=Ftemp( 1: nb+3)
but again similar to the original case, In this case I need to have access to F inside the kernel region.

bleback · March 16, 2021, 4:55pm

Your main issue is that “this” is a host variable, which contains a device array. So, when you are on the device (inside the cuf kernel) you cannot access F through “this”. There a couple of work-arounds: you can perhaps make “this” a managed variable, so it can be accessed on both host and device. Or, you can somehow cast this%F to a bare pointer, either an F90 pointer or Cray pointer, and access it that way in your loop.

mboudagh · March 16, 2021, 5:04pm

Thanks for your response. I tested other method and instead of calling F using this%F, defined F inside the subroutine but the same error is occurred. as below:
…
subroutine FFX(this, A,B,L,n,nc,nb…)
use :: cublas
use :: cudafor
real(wp),device, allocatable :: F(:)
real(wp),device, intent(in) :: A(:),B(:),L(:)
integer,device :: n,nc,nb
real(wp),device :: q(3)
allocate( F(n) )
F =0._wp
!$cuf kernel do <<< , >>>
do i=1, nc
do ii=1,nb
q=A((i-1) *nc + ii:(i-1) *nc + ii+3)
F((i-1) *nc + ii:(i-1) *nc + ii+3)=F((i-1) *nc + ii:(i-1) *nc + ii+3)+q
end do
end do
end subroutine …
…

Error:
NVFORTRAN-S-0155-Host array used in CUF kernel - F$f(:)
NVFORTRAN-S-0155-Host array used in CUF kernel - F$f202(:)

bleback · March 16, 2021, 5:09pm

Can you send a complete program which demonstrates the problem that we can reproduce here?

mboudagh · March 16, 2021, 5:34pm

Please find the code at: the issue is in subroutine “update_bendforce”

github.com

Mbood/BDpack/blob/Temporary/src/semidilute_bs/cuda/sprforce_cumod.cuf

!%------------------------------------------------------------------------%
!|  Copyright (C) 2013 - 2018:                                            |
!|  Fluid Mechanics Laboratory (Shaqfeh's Group)                          |
!|  Stanford University                                                   |
!|  Material Research and Innovation Laboratory                           |
!|  University of Tennessee-Knoxville                                     |
!|  Author:    Amir Saadat        <asaadat@stanford.edu>                  |
!|  Advisor:   Eric S. G. Shaqfeh <esgs@stanford.edu>                     |
!|             Bamin Khomami      <bkhomami@utk.edu>                      |
!|                                                                        |
!|  This file is part of BDpack.                                          |
!|                                                                        |
!|  BDpack is a free software: you can redistribute it and/or modify      |
!|  it under the terms of the GNU General Public License as published by  |
!|  the Free Software Foundation, either version 3 of the License, or     |
!|  (at your option) any later version.                                   |
!|                                                                        |
!|  BDpack is distributed in the hope that it will be useful,             |
!|  but WITHOUT ANY WARRANTY; without even the implied warranty of        |
!|  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         |

This file has been truncated. show original

Best, MB

mboudagh · March 17, 2021, 3:03pm

I am wondering if you found what I am missing here. I am compiling with “nvidia-hpc_sdk_cuda_10.1/20.7”.

bleback · March 17, 2021, 4:11pm

Sorry, I am having trouble matching your problem description with the link to the code you sent. Also, the file has a bunch of module dependencies, so I guess I will have to download the entire app to build it. It will take me a bit of time to do.

mboudagh · March 17, 2021, 4:16pm

Oh, Sorry my bad!. I should have left some comments.
So the parameter is defined in line 296 real(wp),managed, allocatable :: Fbnd_d(:)
and it is being called on the line 368 and couple of other places.
Previously I defined it in line 68 and was calling it as pointer this%Fbnd_d. both methods led to similar error.

mpif90 -DUSE_GPU -Mpreprocess -mp -Mcuda=charstring -Mcudalib=cublas,cusolver,cusparse,cufft,curand -Minfo -Mbounds -Minfo=all -traceback -Mchkfpstk -Mchkstk -Mdalign -g -I/usr/include -I/include -I/~/mkl_pgi/include/intel64/lp64 -I …/common/inc -I ./inc -module ./inc -c cuda/sprforce_cumod.cuf -o cuda/sprforce_cumod.o
nvfortran-Warning-CUDA Fortran or OpenACC GPU targets disables -Mbounds
update_force_krnl:
213, CUDA kernel generated
213, !$cuf kernel do <<< (*), (128) >>>
NVFORTRAN-S-0155-Host array used in CUF kernel - fbnd_d$f(:) (cuda/sprforce_cumod.cuf: 384)
NVFORTRAN-S-0155-Host array used in CUF kernel - fbnd_d$f202(:) (cuda/sprforce_cumod.cuf: 384)
NVFORTRAN-S-0155-Host array used in CUF kernel - fbnd_d$f203(:) (cuda/sprforce_cumod.cuf: 384)
NVFORTRAN-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unable to find associated device pointer (cuda/sprforce_cumod.cuf: 384)
NVFORTRAN/x86-64 Linux 20.7-0: compilation aborted
make[1]: *** [cuda/sprforce_cumod.o] Error 2

bleback · March 18, 2021, 12:57am

So, we’ve had a lot of discussion about this here. I think we have 3 recommendations for problems we’ve seen in various versions of your code.

Don’t put scalars on the device if you don’t need to. Especially things like loop bounds, nc, nb like you have above. Let the compiler pass them in from the host, and it will then also be able to make good decisions about the CUDA kernel launch schedule.
CUF Kernels do not support thread-private data. If you want to use small arrays that are private to each thread, CUF Kernels does not really support that. You can either try OpenACC, which allows you to mark variables as thread-private, and can operate on CUDA Fortran device data, or expand the small arrays into scalars.
Array syntax in the CUF Kernels can cause the compiler to insert the creation of temp arrays, where the entire RHS must be evaluated before the LHS is updated. We don’t do a good job of creating temp arrays in CUF Kernels, which is just as well, because you don’t really want to call malloc to dynamically allocate some small space from every thread in your CUDA grid, because it will kill performance. The temp array creation is the cause of the “Host array used in CUF kernel” error.

Fortran · March 18, 2021, 1:19am

Regarding point 3 above, when there is a shift in the slice from right- to left-hand side, the compiler will create a temp array, even if there is no aliasing. So instead of:

array(1:3) = array(4:6) + …

do:

do slice = 1, 3
array(slice) = array(3+slice) + …
enddo

The cases like:

array(1:3) = array(1:3) + …

are fine.

mboudagh · March 18, 2021, 2:47pm

Thank you for your notes. So first I will change the loop bound parameters to be Host or managed type. I was thinking about defining thread private data but avoided it.
Still I am a bit confused about the array F or Fbnd_d which defined as device array. Sorry if my question is very basic, but isn’t it true that threads have access to device memory? in that case they have to have access to my large F array.
how does “Do concurrent” differ from current method?
and finally, do you think it would be helpful if I define a small size device array inside the loop and copy its value to the original array?
cuda kernel
do i:1,nc

do j:1,nb
Ftmp(1:nb)=something
end do

Fbnd_d(i-1 * nc+1:i-1 * nc+nb)=Ftmp(1:nb)
end do

mboudagh · March 18, 2021, 2:47pm

I’ll try it. Thanks!

Fortran · March 18, 2021, 8:42pm

Threads do have access to device memory. If you have a CUF kernel containing:

F(expr1:expr2) = F(expr3:expr4) + ...

and F() here is a device array, the compiler recognizes that different slices of F() are involved on both right- and left-hand sides and will create a temp array to evaluate the right-hand side. This temp array it creates is a host array hence the error.

The issue with accessing Fbnd_d on the device through this%Fbnd_d is that while the Fbnd_d component resides on the device, this resides on the host. So:

!$cuf kernel do <<<*,*>>>
do i = 1, n
  this%Fbnd_d(i) = 0.0
enddo

needs to access the host-resident this to get to Fbnd_d, hence the error. I get around this type of thing by using an associate block:

associate(F => this%Fbnd_d)
  !$cuf kernel do <<<*,*>>>
  do i = 1, n
    F(i) = 0.0
  enddo
end associate

Hope this helps.

Fortran · March 18, 2021, 9:12pm

The problem here is Ftmp would need to be a large device array since every thread would be accessing the same Ftmp array. As previously mentioned, CUF kernels are not set up for thread-private data.

The best way around this is to use an explicit do loop rather than slice notation.

mboudagh · March 18, 2021, 9:15pm

Thank you for your detailed explanation. However, I not using this%Fbnd_d style after your previous comment. I defined Fbnd_d as an device array inside the subroutine. and inside first loop which I tent to use CUF kernel for it I want to modify part if Fbnd_d to make per thread calculations independent.
Since my intention is not to paralyze the second loop, I am hoping that each thread runs the second loop in a consequential manner.

One more question also, is the parameters inside the loops. So if inside a loop that I tent to do in parallel, an integer parameter is defined, it will be over written by all threads. is my understanding correct?

Topic		Replies	Views
Declaring local arrays in device code Legacy PGI Compilers	16	9070	June 8, 2012
cuda memory issues Legacy PGI Compilers	13	9510	March 19, 2012
Unknow error when calling device subroutine Legacy PGI Compilers	18	10476	September 1, 2015
An Easy Introduction to CUDA Fortran Technical Blog	7	567	June 21, 2024
cuda fortran module data Legacy PGI Compilers	6	8139	September 9, 2010
More questions about CUDA Legacy PGI Compilers	7	4180	September 28, 2016
Error running simple CUDA Fortran program Legacy PGI Compilers	9	21314	February 26, 2010
Nvfortran: Passing shared arrays of variable size to device subroutine causes memory error nvc, nvc++ and nvfortran cuda	7	82	August 28, 2024
pgfortran hangs on .cuf file with shared memory Legacy PGI Compilers	2	4736	June 11, 2010
Problems with FORTRAN Accelerator and subroutines Legacy PGI Compilers	21	11923	August 17, 2011

"Host array used in CUF kernel"

Related topics