Declaring local arrays in device code

Hi folks,

I’m trying to get round a problem in CUDA Fortran but I’m not having much luck so I’m putting it to the masses. If anyone can suggest anything it would be much appreciated.

Okay, here’s the problem…

I have a device array declared in MODULE scope as A(N,M).

The number of threads in my system is equal to N.

Within device code I want to call a device subroutine and pass only one dimension of that array for each thread i.e.

call subroutine(A(thread,:))

but I get the error:

Array reshaping is not supported for device subprogram calls

So to get around this problem I tried to get each thread to copy it’s vector into a separate 1D array i.e.

B(:)=A(thread,:)

but in declaring B(M) in the device subroutine I get another compiler error:

device arrays may not be automatic

the only solution I can see is to declare B with a fixed large number but I don’t really want to do this as it may waste memory or restrict system size.

Does anyone know a way around my conundrum?

Cheers,
Crip_crop

Crip_crop,

I think both issues are due to the same fact: the device (GPU) cannot allocate its own memory.

When you try and call a subroutine with an array slice, the compiler (most likely?) wants to make a temporary copy and pass a reference to that temp copy to the subroutine. Since the device can’t do so, error.[1]

The second case makes the same issue. Automatic arrays are allocated upon entry and the GPU can’t do that. In my code, I often do just what you are tentative to do which is allocate local, per-thread arrays at compile-time with some maximum fixed size that I can know a priori (number of levels in the system, say, which we know roughly) that I do with the preprocessor in my Makefile. I’m lucky that my “M” is fairly small (O(100)) so I can make those. If your “M” is big…might not work. But, if you can, and you aren’t using much shared memory, you can tell the compiler to prefer L1 cache (make it 48k) which will increase your chances of getting a hit on L1-cached local memory.

But, if you don’t want to, or can’t due to the size of M, the only other thought I have is to pass in the reference to all of A along with the thread number:

call subroutine(A,thread)

and then inside that subroutine, just do all your work on A(thread,:) inside. It’s not ideal, but try it out. You might find there isn’t much of a performance hit at all.

Matt

[1] Note: I think this is also why you can’t do math in subroutine calls:

call subroutine(2*A,...)

since the compiler would try and make a temp array B=2*A and use that.

Cheers Mat that’s really useful.

How would i go about increasing the size of the L1 cache?

Crip_crop

You can do that on a per-device or per-function basis. You can use:

status = cudaDeviceSetCacheConfig(cacheconfig)

status = cudaFuncSetCacheConfig(func, cacheconfig)

where “cacheconfig” is cudaFuncCachePreferNone, cudaFuncCachePreferShared, or cudaFuncCachePreferL1 (pretty self-explanatory). For the FuncSet, func is usually decorated by the module name, so subroutine dothis in module mymodule, would (I think) be mymodule_dothis.

Obviously, it’s probably not best to do the device-wide set if you use near 48k of shared memory in most of your code, but the function version might help.

But, benchmark as always, some codes might respond, some might not. Per the spec, this is only the “preferred” configuration that the user wants. I think the CUDA compiler can always choose its own configuration when it determines its is better. Or when it’s Tuesday. I’m not sure I’ve ever seen how it determines this (probably looks at how many registers spilled into local, &c.).

Oh, and if PGI chimes in here saying differently, believe them more than me!

Matt

Hi Crip_crop,

Matt is correct about the problem being lack of dynamic allocation from device code. The may change with CUDA 5 and the Kepler K20 GPUs, but for now we’re stuck.

Though, another thing to try is using automatic arrays of shared memory. The third argument to the kernel chevron is the size in bytes to dynamically allocate in shared memory. The compiler can then map this dynamic shared memory to device automatic arrays. The glitch being that the automatic arrays are shared by all the threads in the block and the amount of shared memory is relatively small.

  • Mat

Okay I tried implementing the following

      integer::cudaFuncCachePreferL1

!     Set device to prefer L1 cache to shared memory
      istat=cudaDeviceSetCacheConfig(cudaFuncCachePreferL1)

but I’m getting the compiler error:

undefined reference to `cudadevicesetcacheconfig_’

Can you see anything wrong with how I’m calling the function?
Nb. The function is being called in host code.

Cheers,
Crip_crop

Do you have ‘use cudafor’ at the top of the module? And are you compiling -Mcuda?

ETA: Oh, wait, if you have ‘use cudafor’, you shouldn’t need the integer:: declaration.

Matt

Do you have ‘use cudafor’ at the top of the module? And are you compiling -Mcuda?

…yes and yes.

ETA: Oh, wait, if you have ‘use cudafor’, you shouldn’t need the integer:: declaration.

I get the same compiler error both with and without the integer declaration.

Hmm. Are you using an old version of PGI? Because I can state it works for me in the code I’m running with PGI 12.5:

    STATUS = cudaDeviceSetCacheConfig(cudaFuncCachePreferL1)
    if (STATUS /= 0) then 
       write (*,*) "cudaDeviceSetCacheConfig failed: ", cudaGetErrorString(STATUS)
    end if

When I see “undeclared reference” I think you aren’t linking a library, or the library doesn’t contain the function.

Matt

I’m using 11.8… does that not support it?

cudadevicesetcacheconfig was added along with the other new CUDA 4.0 features in 11.7. Though we did not switch over to use CUDA 4.0 by default till 12.0. Hence, if you are using 11.7 through 11.10, compile with -Mcuda=4.0.

  • Mat

That’s it! I was racking my brain to figure out how I was running this with 11.8: it was because I’d moved to 4.0 by default (and hopefully, soon, to 4.1!).

I’d be interested to know if you see any big benchmark differences with the FuncCache usage.

That’s worked a treat, cheers guys.

But, as programming tends to go, one problem solved another problem formed…

I really don’t have a clue what it doesn’t like about this code:

      Module Acceler_formd
      USE cudafor
      implicit none

      parameter(maxatm=DEFMAXATM,maxelmnt=DEFMAXELMNT)

!     GPU specific declarations
      integer,allocatable,device,dimension(:)::ian_d
      integer,allocatable,device,dimension(:)::natorb_d
      integer,allocatable,device,dimension(:)::lowlim_d
      double precision,allocatable,device,dimension(:)::globdens,ftot_d
      integer,allocatable,device,dimension(:,:)::totsubsys_d,
     &     coresubsys_d
      integer,allocatable,device,dimension(:)::subbasis_d
      double precision,allocatable,device,dimension(:,:)::B,
     &     subeval,subnelec
      double precision,allocatable,device,dimension(:,:,:)::coeff,
     &     subevec,subscr1,subdens
      
      integer::maxatm,maxelmnt

      CONTAINS

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

      subroutine formd_cuda(ftot_h,dtot,lowt,natoms,ian_h,natorb_h,
     &     lowlim_h,itr,nbasis,ifact_h,nelecs,maxbasfun,
     &     subsystems,subnelec_h,subbasis_h,totsubsys_h,coresubsys_h)
      
      implicit none

      integer,dimension(maxatm)::ian_h
      integer,dimension(maxelmnt)::natorb_h
      integer,dimension(natoms)::lowlim_h
      integer,dimension(nbasis)::ifact_h
      double precision, dimension(lowt)::ftot_h,dtot
      integer::subsystems,cresidues,bresidues,x,y,I,xend,
     &     xj,xk,maxbasfun
      double precision::temp
      integer,dimension(subsystems,2)::totsubsys_h,coresubsys_h
      integer,dimension(subsystems)::subbasis_h
      double precision,dimension(subsystems,maxbasfun)::B_h,subeval_h,
     &     subnelec_h
      integer::lowt,natoms,itr,counter,nbasis,nelecs,
     &     j,llk,ij,callno
      double precision::ef

!     GPU specific declarations
      integer:: nthreads,blocksize,threadblocks,istat,cuError
      type(dim3)::dimGrid,dimBlock
      character*120 errmsg

!     Set device to prefer L1 cache to shared memory
      istat=cudaDeviceSetCacheConfig(cudaFuncCachePreferL1)


      write(*,*)maxatm,maxelmnt,natoms,lowt,subsystems,maxbasfun

      write(*,*)"1"
      allocate(ian_d(maxatm))
      write(*,*)"2"
      allocate(natorb_d(maxelmnt))
      write(*,*)"3"
      allocate(lowlim_d(natoms))
       write(*,*)"4"
      allocate(globdens(lowt))
      write(*,*)"5"
      allocate(ftot_d(lowt))
      write(*,*)"6"
      allocate(totsubsys_d(subsystems,2))
      write(*,*)"7"
      allocate(coresubsys_d(subsystems,2))
      write(*,*)"8"
      allocate(subbasis_d(subsystems))
      write(*,*)"9"
      allocate(B(subsystems,maxbasfun))
      write(*,*)"10"
      allocate(subeval(subsystems,maxbasfun))
      write(*,*)"11"
      allocate(subnelec(subsystems,maxbasfun))
      write(*,*)"12"
      allocate(subevec(subsystems,maxbasfun,maxbasfun))
      write(*,*)"13"
      allocate(coeff(subsystems,maxbasfun,maxbasfun))
      write(*,*)"14"
      allocate(subscr1(subsystems,maxbasfun,maxbasfun))
      write(*,*)"15"
      allocate(subdens(subsystems,maxbasfun,maxbasfun))
      write(*,*)"16"
           .
           .          .
           .          .
           .          .
           .          .
           .          .
           .

I’m getting the runtime error:

0: ALLOCATE: copyin Symbol Memcpy FAILED:11(invalid argument)

As I can see it I’ve declared the correct arrays as device, allocatable and I’ve checked that all the extent scalars have values. When I run it with the “write” statements in it doesn’t get further that “1”…

any ideas?

Cheers for your help,
Crip_crop

Does the code run without the call to cudaDeviceSetCacheConfig? You need a Fermi card to use this feature (compute capability 2.0), though it should just be a noop on other devices. Doubt it would cause this error, but maybe.

Other then that, I don’t see anything obvious. The error seems to suggest the problem occurs at an memCopy call, not an allocate, so it’s unclear what’s happening. Try running in emulation mode (-Mcuda=emu) and see if the error still occurs. If so , then you can run the code through the debugger to find the error. Else, start commenting out code til you can narrow down the problem.

  • Mat

Does the code run without the call to cudaDeviceSetCacheConfig? You need a Fermi card to use this feature (compute capability 2.0), though it should just be a noop on other devices. Doubt it would cause this error, but maybe.

I’m running it on a Fermi so this shouldn’t be an issue.

When I run it in emulation mode it gives me this compiler error:

/home/mbdx6pn2/work/DivNCon/gpu_div/div/./acceler_formd.f:58: undefined reference to pgf90_dev_mod_alloc03_i8' /home/mbdx6pn2/work/DivNCon/gpu_div/div/./acceler_formd.f:58: undefined reference to pgf90_dev_mod_alloc03_i8’
/home/mbdx6pn2/work/DivNCon/gpu_div/div/./acceler_formd.f:58: undefined reference to pgf90_dev_mod_alloc03_i8' /home/mbdx6pn2/work/DivNCon/gpu_div/div/./acceler_formd.f:58: undefined reference to pgf90_dev_mod_alloc03_i8’
/home/mbdx6pn2/work/DivNCon/gpu_div/div/./acceler_formd.f:58: undefined reference to pgf90_dev_mod_alloc03_i8' acceler_formd.o:/home/mbdx6pn2/work/DivNCon/gpu_div/div/./acceler_formd.f:58: more undefined references to pgf90_dev_mod_alloc03_i8’ follow
/opt/pgi/linux86-64/11.8/libso/libcudaforemu.so: undefined reference to cublasAlloc' /opt/pgi/linux86-64/11.8/libso/libcudaforemu.so: undefined reference to cublasFree’

which is referring to the first line in the following block of code (line 58):

      write(*,*)"2"
      write(*,*)"3"
       write(*,*)"4"
      allocate(globdens(lowt))
      write(*,*)"5"
      allocate(ftot_d(lowt))
      write(*,*)"6"
      allocate(totsubsys_d(subsystems,2))
      write(*,*)"7"
      allocate(coresubsys_d(subsystems,2))
      write(*,*)"8"
      allocate(subbasis_d(subsystems))
      write(*,*)"9"
      allocate(B(subsystems,maxbasfun))
      write(*,*)"10"
      allocate(subeval(subsystems,maxbasfun))
      write(*,*)"11"
      allocate(subnelec(subsystems,maxbasfun))
      write(*,*)"12"
      allocate(subevec(subsystems,maxbasfun,maxbasfun))
      write(*,*)"13"
      allocate(coeff(subsystems,maxbasfun,maxbasfun))
      write(*,*)"14"
      allocate(subscr1(subsystems,maxbasfun,maxbasfun))
      write(*,*)"15"
      allocate(subdens(subsystems,maxbasfun,maxbasfun))
      write(*,*)"16"
      allocate(ian_d(maxatm))
      allocate(natorb_d(maxelmnt))
      allocate(lowlim_d(natoms))

            
               istat=cudathreadsynchronize()
         cuError = cudaGetLastError() 
         if (cuError .ne. 0) then 
           errMsg = cudaGetErrorString(cuError) 
           print *, "allocate"
           print *, trim(errMsg) 
        end if



   

! Copy host arrays to device memory
      
      ftot_d=ftot_h
      globdens=dtot
      B=B_h
      subeval=subeval_h
      

      nthreads=subsystems
      blocksize=1

      if (mod(nthreads, blocksize)==0) then
         threadblocks=nthreads/blocksize
      else
         threadblocks=(nthreads/blocksize)+1
      end if
      
! Create the grid and block dimensions
      dimGrid= dim3(threadblocks, 1, 1)
      dimBlock= dim3 (blocksize, 1, 1)

… although it refers to all the errors being at line 58 which can’t be right as this is a simple write statement.

I’m really stuck here… I’m struggling to isolate the problem because I think the complier may be merging all the allocates together.

When will these compiler error messages become more specific? I’ve been programming with CUDA Fortran now for over a year and this is always what slows me down.

Crip_crop

FYI: The problem was a compiler bug… I was compiling with the flag -mcmodel=medium which isn’t supported in V11.8

Correct. CUDA Fortran didn’t support the medium memory model til 11.10. It came out later since initially there were no cards with over 2GB of memory. Once the 6GB cards came out is when we added it.

  • Mat