Error running simple CUDA Fortran program

At present, I’m looking at converting a program that works with Accelerators to CUDA Fortran. I’m not sure it’ll be useful in the long run, but the experience will be well worth it. Unfortunately, I got caught with an ICE:

PGF90-F-0000-Internal compiler error. unexpected runtime function call

Okay, I’ll probably report that to Support if I can’t figure it out. But then I wondered about whether my PGI setup might be wonky. So I tried this simple program which has worked before:

module assign_mod
   use cudafor
contains
   attributes(global) subroutine assign_kernel(A, N)
      implicit none

      integer, value :: N
      integer, device, dimension(N) :: A
      integer, device :: idx 

      idx = (blockidx%x - 1) * blockdim%x + threadidx%x

      if (idx <= N) A(idx) = blockidx%x * blockdim%x + threadidx%x
   end subroutine assign_kernel
end module assign_mod

program main
   use cudafor
   use assign_mod

   implicit none

   integer, parameter :: n = 32
   integer, allocatable, dimension(:) :: a_host, b_host
   integer, device, allocatable, dimension(:) :: a_device

   type(dim3) :: dimGrid, dimBlock

   integer, parameter :: blocksize = 4

   integer :: i

   dimBlock = dim3(blocksize,1,1)
   dimGrid = dim3(n/blocksize,1,1)

   allocate(a_host(n))
   allocate(b_host(n))

   allocate(a_device(n))

   forall (i=1:n)
      a_host(i) = 99
   end forall

   a_device = a_host

   call assign_kernel<<<dimGrid, dimBlock>>> (a_device, n)

   b_host = a_device

   write (*,"(I2,1X)",advance="no") b_host
   write (*,*)

end program main

But, when I try to compile and run it:

> pgfortran trial.cuf 
> ./a.out 
0: ALLOCATE: 128 bytes requested; status = 35

Can you help me figure out what I’ve done to wreck my CUDA Fortran setup? (Another example: running the cufinfo.cuf example from 10.2 does nothing on first run and then dumps core when you run it again.)

FYI, I’m running 10.2 and my environment looks like:

> env | grep -i pgi
MANPATH=/usr/share/man:/usr/local/share/man:/usr/X11R6/man:/opt/pgi/linux86-64/2010/man
LD_LIBRARY_PATH=/home/mathomp4/lib:/opt/pgi/linux86-64/2010/mpi/mpich/lib:/opt/pgi/linux86-64/2010/cuda/lib:/opt/pgi/linux86-64/2010/cuda/open64/lib:/opt/pgi/linux86-64/2010/lib:/opt/pgi/linux86-64/2010/libso:/opt/cuda/lib64::/home/mathomp4/GMAO-Baselibs-3_1_5/Linux/lib:/opt/pgi/linux86-64/2010/mpi/mpich/lib:/opt/cuda/lib64
PGI=/opt/pgi
PATH=.:/home/mathomp4/bin:/home/mathomp4/cvstools:/home/mathomp4/opengrads:/opt/pgi/linux86-64/2010/bin:/opt/pgi/linux86-64/2010/mpi/mpich/bin:/home/dkokron/play/pdt/pdt-3.15/x86_64/bin:/home/dkokron/play/tau/tau-2.19/x86_64/bin:/home/mathomp4/Fortuna/GEOSagcm/src/GMAO_Shared/GEOS_Util/post:/home/mathomp4/Fortuna/GEOSagcm/src/GMAO_Shared/GEOS_Util/plots:/opt/cuda/bin:/home/mathomp4/bin:/opt/pgi/linux86-64/2010/bin:/opt/pgi/linux86-64/2010/mpi/mpich/bin:/opt/pgi/linux86-64/2010/cuda/bin:/opt/cuda/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/mathomp4/bin
LM_LICENSE_FILE=/opt/pgi/license.dat
PGIABBR=/opt/pgi/linux86-64/2010

Thanks,
Matt

Hi Matt,

PGF90-F-0000-Internal compiler error. unexpected runtime function call

Most likely you’re using an unsupported device intrinsic like FRACTION or EXPONENT. Can you determine which intrinsic is causing the error? I can then push engineering to get this one bumped up in priority.

0: ALLOCATE: 128 bytes requested; status = 35

This is a runtime error meaning that the allocate failed with status 35. I’m assuming this is coming from the device array’s allocate, in which case status 35 is coming from a call to cudaMalloc and means “cudaErrorInsufficientDriver: CUDA runtime is newer than driver”.

On occasion the driver will stop working correctly, so the first thing I’d do is reboot. If that doesn’t work, can you please post your NVIDIA driver version?

cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  195.17  Mon Oct 26 06:19:11 PST 2009
GCC version:  gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)

Thanks,
Mat

I’ll take a look. The code is the code you’ve seen from me before but I’ve had to transform it from F77 into F90-esque code for my sanity. Entirely possible there is an intrinsic I’m missing. I’ve converted some FLOAT and DBLE calls to just pure REAL in case those did it, but I’m still getting the error. All that’s left are more tame MAX, MIN, LOG10, EXP, SQRT, etc. Could the fact I’m still using the old DATA calls to assign arrays (rather than RESHAPE) do it?

0: ALLOCATE: 128 bytes requested; status = 35

This is a runtime error meaning that the allocate failed with status 35. I’m assuming this is coming from the device array’s allocate, in which case status 35 is coming from a call to cudaMalloc and means “cudaErrorInsufficientDriver: CUDA runtime is newer than driver”.

On occasion the driver will stop working correctly, so the first thing I’d do is reboot. If that doesn’t work, can you please post your NVIDIA driver version?

cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 195.17 Mon Oct 26 06:19:11 PST 2009
GCC version: gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)

>

Ah ha. A reboot or more might be needed:

```text
> cat /proc/driver/nvidia/version 
NVRM version: NVIDIA UNIX x86_64 Kernel Module  185.18.14  Wed May 27 01:23:47 PDT 2009
GCC version:  gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)

Is that version of the driver too old for 10.2?

Matt,

All the mathematical intrinsics are supported, unless you are using complex data types which we’re still working on. Data statements are fine as well.

Is that version of the driver too old for 10.2?

The 185 driver supports cards with compute capability 1.3 (Tesla, GTX280, etc) but is for CUDA 2.2. With the 10.2 CUDA Fortran, we use CUDA 2.3. I’ll need to ask my contacts at NVIDIA to see it this is indeed a conflict. Just in case, you can download the latest NVIDIA drivers at Official Advanced Driver Search | NVIDIA.

  • Mat

Welp, I’m sunk, then. I can’t seem to figure it out. Since it’s a big file, I’ll send something to Technical Support rather than copy-paste it here.

Is that version of the driver too old for 10.2?

The 185 driver supports cards with compute capability 1.3 (Tesla, GTX280, etc) but is for CUDA 2.2. With the 10.2 CUDA Fortran, we use CUDA 2.3. I’ll need to ask my contacts at NVIDIA to see it this is indeed a conflict. Just in case, you can download the latest NVIDIA drivers at > Official Advanced Driver Search | NVIDIA> .

Yep, that did it. Needed CUDA 2.3 and the latest drivers. Thanks.

Hi Matt,

Customer support sent me your code and it turns out that the “unexpected runtime function call” was a call to “pgf90_auto_alloc”. This routines handles the allocation of automatic array. So the the compiler should be giving a semantic error since automatics aren’t allowed in device routines. The reason being that a thread can’t call malloc which is required for automatics. I’ve sent a report to engineering (TPR#16653) to have them catch this semantic error. To fix, you’ll need to use fixed sized local arrays.

Also, I was wrong about data statements. Engineering is working on allowing data statements for module variables. But wont be allowed for local device variables. The reason is that CUDA C can only initialize data that has file scope, not local scope. Hence, there’s not yet a way to map local data statements to CUDA C.

Thanks,
Mat

Not sure I understand this one. Does this mean I need to hardwire some array sizes that currently aren’t? Do you have an example of what was wrong and what is correct?

Also, I was wrong about data statements. Engineering is working on allowing data statements for module variables. But wont be allowed for local device variables. The reason is that CUDA C can only initialize data that has file scope, not local scope. Hence, there’s not yet a way to map local data statements to CUDA C.

So, in this case, I should instantiate within the program and figure out a massive RESHAPE? Or, READ/DATA before the CUDA call and pass into the program as an extra input (which, essentially, it is)?

Does this mean I need to hardwire some array sizes that currently aren’t?

You’ll need to use fixed size for your local arrays. The code currently passes in the size.

Do you have an example of what was wrong and what is correct?

Here’s a module that uses an automatic array called “local_data”.

module test_cuda

  contains

  attributes(global) subroutine kernel_1(data,N,NP)

    use cudafor

    implicit none

    real, device, dimension(N) :: data
    integer, value :: N, NP

    integer :: i, j, idx, nthrd
    real :: local_data(NP)

    idx = (blockidx%x-1)*blockdim%x + threadidx%x
    nthrd = blockDim%x * gridDim%x
    do i=idx,N,nthrd
         do j=1,NP 
              local_data(j) = data(i) * j
         end do
    end do

  end subroutine
end module
% pgf90 -c test.cuf 
PGF90-S-0155-device arrays may not be automatic - local_data (test.cuf)
  0 inform,   0 warnings,   1 severes, 0 fatal for kernel_1

The problem is that the size of local_data is not known until run time so needs to be allocated when entering the subroutine. Unfortunately, threads can’t allocate memory so the size of the local arrays needs to be know at compile time. To fix, local_data’s size must be fixed.

module test_cuda

  integer :: maxNP
  parameter (maxNP = 10)

  contains

  attributes(global) subroutine kernel_1(data,N,NP)

    use cudafor

    implicit none

    real, device, dimension(N) :: data
    integer, value :: N, NP

    integer :: i, j, idx, nthrd
    real :: local_data(maxNP)

    idx = (blockidx%x-1)*blockdim%x + threadidx%x
    nthrd = blockDim%x * gridDim%x
    do i=idx,N,nthrd
         do j=1,NP
              local_data(j) = data(i) * j
         end do
    end do

  end subroutine
end module



So, in this case, I should instantiate within the program and figure out a massive RESHAPE? Or, READ/DATA before the CUDA call and pass into the program as an extra input (which, essentially, it is)?

Spot checking, it appears that all the variables where you use a data statement are constants values. In this case, I would make them module variables and add the ‘constant’ attribute to have them placed in constant memory. While your limited in the amount of data that can be stored in constant memory (on my Tesla it’s 64K), constant memory is much faster. You would then set the values in your host code:

module test_cuda

  real, dimension(3), constant :: aig

  contains

  attributes(global) subroutine kernel_1(data,N)

    use cudafor

    implicit none

    real, device, dimension(N) :: data
    integer, value :: N

    integer :: i, j, idx, nthrd

    idx = (blockidx%x-1)*blockdim%x + threadidx%x
    nthrd = blockDim%x * gridDim%x
    do i=idx,N,nthrd
      data(i) = i * aig(1) * aig(2) * aig(3)
    end do

  end subroutine
end module

program test
  
   use cudafor
   use test_cuda

   real, device, dimension(:),allocatable :: dData
   real, dimension(:),allocatable :: data
   integer :: N
   N = 256

   ! update the device's constant memory
   aig(1) = 1.1
   aig(2) = 2.1
   aig(3) = 3.1

   allocate(data(N))
   allocate(dData(N))
   call kernel_1<<<128,2>>>(dData,N)
   data=dData
   print *, data(1), data(2)

   deallocate(data)
   deallocate(dData)
end program test
% pgf90 test.cuf -V10.2
% a.out
    7.161000        14.32200

Hope this helps,
Mat

Ahh. I get this now, I get it. Makes sense thanks to your example and further explanation. I was sort of asking the GPU to do some magical memory allocations.

Spot checking, it appears that all the variables where you use a data statement are constants values. In this case, I would make them module variables and add the ‘constant’ attribute to have them placed in constant memory. While your limited in the amount of data that can be stored in constant memory (on my Tesla it’s 64K), constant memory is much faster. You would then set the values in your host code:

I was indeed planning on using constant memory for this data which was one reason I wanted to try out a CUDA Fortan version since I’m not sure the directive-based generator can use constant memory yet, can it? (At least, I’m pretty sure I can’t direct it to put data in constant.)

However, I seem to have found another ICE, but this one has an error message with it:

ptxas error   : Entry function 'soradcuf' uses too much local data (0x60e0 bytes, 0x4000 max)
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code       0 (src/sorad.cudafor.constant.cuf: 3622)
PGF90/x86-64 Linux 10.2-0: compilation aborted

So it’s saying I’m trying to use 24800 bytes of local data and I only have 8192 bytes max. I’m back at “I have no idea” when it comes to what this means.

FYI, I am using a Tesla S1070, so I have CC 1.3 if you are trying to figure out what hardware resource I’m hitting. The file that throws this has all my “constant” DATA arrays now as in your example (36768 bytes worth) and my count of local data is 25848 bytes using maxnp=72 (which I guess I thought would go into global memory with no attribute).

Thanks for all the help as I struggle with this,
Matt

Hi Matt,

We’ll this one was new for me as well. I knew that there was limits on constant and shared memory, but it appears there’s one on local memory as well. Granted, I haven’t tried to porting over 3500 line subroutine so wouldn’t have hit this limit.

One thing I see, is that you have code which stores the value of a global array into a local array (like “tai(k) = ta(idx,k)”). Can you get rid of the local arrays and use the global arrays? Most likely it will be slower, but the first step is getting it working.

Also, could you break-up the subroutine into multiple kernels?

  • Mat