Nvfortran error

Hi Mat,
I encountered an nvfortran error:nvfortran-f-0155-compiler failed to translate accelerator region (see-minfo message):device compiler exited with error status code,can you help me?

Mat’s on vacation this week. Can you provide your source or more information?

Hello, I have solved this problem because I wrote a piece of code repeatedly, but now I have some problems. I wonder if you can provide some resources? Can nvfortran use the trim function? Does nvfortran support len_ The function of trim and len to calculate string length?

In host code or device code? Yes for host code. I’ll have to check for device code. In general, our device code support for Fortran character data is not complete.

My current code includes the parallel code on the CPU and GPU. In my code, the device code does not recognize trim. It is always recognized as trima, but my host code does not recognize trim. Normally, trim, len, and len_ Trim is an internal function of fortran, shouldn’t it support nvfortran?

My window runs like this:

This fragment of my code is like this:

Under normal circumstances, the result of running my window should be that there is no blank line between 3, 4 and 5. I don’t know what is going on. Can you help me?

I’d really need a test I can compile and run myself. I do not see how wep_set_fold is set. And, line 343 is overwriting what is set in line 342.

Thank you bleblack, I have solved it. When I use nvfortran to compile and run my program, in the kernel function configuration, when the grid value is set to 2, 4, 6, 8, and 12, the running time is almost the same. When the number of threads is changed to 1024, the running time does not change much. What is the matter? The code defined by my kernel function and the running results are as follows:

Hi 1799336883,

Without a reproducing example, we can’t give a definite answer. Though some possible reasons are:

  • The calls to fuzhi_GPU_day and fuzhi_CPU_day dominate the time so the time spent in Cycle_Runoff doesn’t matter
  • There not enough work in Cycle_Runoff so increasing the grid or block dimensions doesn’t matter.
  • There’s some issue with Cycle_Runoff

Have you profile your code with Nsight-Systems and Nsight-Compute to get a more accurate view of the performance?


Hello,Mat, I have solved this problem, and now I have a new problem,I want to ask you some questions. Can nvfortran compile openmp statements? I want to use cuda fortran and openmp to rewrite my fortran serial code under linux.

Can nvfortran compile openmp statements?

I want to use cuda fortran and openmp to rewrite my fortran serial code under linux.

That’s fine, though if you’re wanting to use OpenMP to support multi-GPU programming, I prefer using MPI instead. With OpenMP you end up having to do domain decomposition and it’s more difficult to manage the memory movement. With MPI, domain decomposition is inherent, allows the program to run across multiple nodes, and GPU Aware MPI allows for direct memory transfers between GPUs.


hi,Mat.Do you have a case about cuda fortran and openmp combined programming? I want to learn, but I can’t find any relevant cases.

Off-hand, no sorry. Again it’s rare to use OpenMP with CUDA Fortran given for multi-gpu programming, folks use MPI and for launching concurrent Kernels, you’d use CUDA streams.

What are you trying to achieve by using OpenMP here?

Are you wanting to include CUDA Fortran kernels within OpenMP parallel regions, or will the OpenMP regions be in other areas of the code?


hi,Mat.My code is written in fortran. Now I use openmp for parallel acceleration. I want to use cuda fortran for gpu acceleration on the basis of this openmp. I use openmp to start the main process and cuda as the slave process for calculation. I don’t know whether or how it can be implemented? Could you help me?

Without a clear understanding of what you’re trying to do, it’s difficult to write an example. Better for you to write a simple test showing what you’re attempting, and then I can help you work through issues.

However, I did go ahead an write a very simple example which may or may not represent what you’re looking for. I’m not using OpenMP for multi-gpu programming, more just to show you that a CUDA kernel can be called from within an OpenMP parallel region.

If the kernel is able to fully utilize the GPU, then the OpenMP threads will serialize waiting for their turn to use the device. If the kernel only uses a portion of the GPU, then you’d want to use CUDA streams to have the kernels execute concurrently on the device. There’s no advantage to using OpenMP threads to launch kernels, at least for this example, but does help when parallelizing the CPU code.


!     Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.
! NVIDIA CORPORATION and its licensors retain all intellectual property
! and proprietary rights in and to this software, related documentation
! and any modifications thereto.  Any use, reproduction, disclosure or
! distribution of this software and related documentation without an express
! license agreement from NVIDIA CORPORATION is strictly prohibited.

! An example of single precision matrix multiply
! Build for running without optimizations:
!   nvfortran sgemm.cuf
! Build for running with optimizations:
!   nvfortran -O2 sgemm.cuf

#ifndef NSTREAMS
#define NSTREAMS 4

MODULE saxpy_sgemm

  attributes(device) subroutine saxpy16(a, b, c)
    real, device :: a
    real, dimension(16) :: b
    real, device, dimension(16) :: c
    c = c + a * b
  end subroutine

  attributes(global) subroutine sgemmNN_16x16(a, b, c, m, n, k, alpha, beta)
    real, device :: a(m,*), b(k,*), c(m,*)
    integer, value :: m, n, k
    real, value :: alpha, beta

    real, shared, dimension(17,16) :: bs
    real, device, dimension(16) :: cloc

    inx = threadidx%x
    iny = threadidx%y
    ibx = (blockidx%x-1) * 256
    iby = (blockidx%y-1) * 16

    ia = ibx + (iny-1)*16 + inx
    ib = inx
    ic = ia

    jb = iby + iny
    jc = iby + 1

    cloc = 0.0

    do ik = 1, k, 16
      bs(iny,inx) = b(ib,jb)
      call syncthreads()

      do j = 1, 16
        call saxpy16(a(ia,ik+j-1), bs(1,j), cloc)
      end do

      ib = ib + 16
      call syncthreads()
    end do

    do ii = 1, 16
      c(ic,jc+ii-1) = alpha*cloc(ii) + beta*c(ic,jc+ii-1)
    end do
  end subroutine


subroutine sgemm_cpu(a, b, c, m, n, k, alpha, beta)
  real, dimension(m,k) :: a
  real, dimension(k,n) :: b
  real, dimension(m,n) :: c
  real alpha, beta
  do im = 1, m
    do in = 1, n
      temp = 0.0
      do ik = 1, k
        temp = temp + a(im,ik) * b(ik,in)
      end do
      c(im,in) = alpha*temp + beta*c(im,in)
    end do
  end do
end subroutine

program main
  use cudafor
  use saxpy_sgemm
#ifdef _OPENMP
  use omp_lib
  implicit none
  integer, parameter :: N = 256
  integer, parameter :: NREPS = 1000
  integer :: i, j, k, ii, nargs, istat, ilen, nerrors, m
  integer(kind=cuda_stream_kind),dimension(:), allocatable :: istream
  integer :: sid, nstreams
  real :: time
  ! matrix data
  real, dimension(N,N,N) :: A, B, C, gold
  real, allocatable, device, dimension(:,:,:) :: dA, dB, dC
  real alpha, beta
  type(cudaDeviceProp) :: prop
  type(cudaEvent) :: start, stop
  type(dim3) :: blocks
  type(dim3) :: threads
  character*20 arg
  integer idevice

  nargs = command_argument_count()
  idevice = 0
  do i = 1, nargs
    call get_command_argument(i,arg)
    if ((arg(1:7) .eq. "-device") .and. (i.lt.nargs)) then
      call get_command_argument(i+1,arg)
      read(arg,'(i2)') idevice
    end if
  end do

  istat = cudaSetDevice(idevice)
  istat = cudaGetDeviceProperties(prop,idevice)
  ilen = verify(prop%name, ' ', .true.)
  write (*,900) prop%name(1:ilen), &
                real(prop%clockRate)/1000.0, &

  istat = cudaEventCreate(start)
  istat = cudaEventCreate(stop)

   nstreams = NSTREAMS
   print *, "Using ", nstreams, " CUDA Streams"
   nstreams = 1
   do i=1,nstreams
      istat = cudaStreamCreate(istream(i))

  call random_number(A)
  call random_number(B)


  dA = A
  dB = B
  dC = 0.0

  alpha = 1
  beta = 0
  m = N
  k = N
  blocks = dim3(N/256, N/16, 1)
  threads = dim3(16, 16, 1)

!$omp parallel do
  do ii=1,N
      call sgemm_cpu(A(:,:,ii), B(:,:,ii), gold(:,:,ii), m, N, k, alpha, beta)
  end do

 istat = cudaEventRecord(start, 0)

#if defined(USE_BOTH) || !defined(USE_STREAMS)
!$omp parallel do private(sid)
  do ii=1,N
    sid = mod(ii,NSTREAMS)+1
     call sgemmNN_16x16<<<blocks, threads,0,istream(sid)>>>(dA(:,:,ii), dB(:,:,ii), dC(:,:,ii), m, N, k, alpha, beta)
  istat = cudaEventRecord(stop, 0)
  istat = cudaDeviceSynchronize()
  istat = cudaEventElapsedTime(time, start, stop)
  print *, "TIME: ", time
  C = dC

  nerrors = 0
  do ii=1,N
  do j = 1, N
    do i = 1, N
      if (abs(gold(i,j,ii) - C(i,j,ii)) .gt. 1.0e-4) then
        nerrors = nerrors + 1
      end if
    end do
  end do
  end do

  if (nerrors .eq. 0) then
    print *," Test PASSED"
    print *, " Test FAILED"
    print *,nerrors," errors were encountered"

900 format('\nDevice:',a,', ',f6.1,' MHz clock, ',f6.1,' MB memory.\n')

end program

Run the code without OpenMP using a single CUDA Stream:

% nvfortran -O2 sgemm.CUF -Minfo=mp  ; a.out

Device:NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, ****** MB memory.

 TIME:     26.82781

Run the code with 32 OpenMP threads using a single CUDA Stream:

% nvfortran -O2 sgemm.CUF -Minfo=mp -mp ; a.out
    166, !$omp parallel
    174, !$omp parallel

Device:NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, ****** MB memory.

 TIME:     26.87830

Run the code using 16 CUDA Streams with OpenMP enabled for the CPU only loop:

% nvfortran -O2 sgemm.CUF -Minfo=mp -mp -DUSE_STREAMS -DNSTREAMS=16 ; a.out
    166, !$omp parallel

Device:NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, ****** MB memory.

 Using            16  CUDA Streams
 TIME:     2.149920

Finally, run with 16 CUDA Streams and use OpenMP parallel do on the kernel launch loop.

% nvfortran -O2 sgemm.CUF -Minfo=mp -mp -DUSE_STREAMS -DNSTREAMS=16 -DUSE_BOTH ; a.out
    166, !$omp parallel
    174, !$omp parallel

Device:NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, ****** MB memory.

 Using            16  CUDA Streams
 TIME:     65.98038

For this example, it’s best to use multiple CUDA Streams.

Thank you Mat,I’m learning now. If I have any questions, I’ll ask you again. Thank you for your help

Hi,Mat,I don’t understand why you need to use device when defining data types in the device of the module.
For example: attributes(device) subroutine saxpy16(a, b, c)
real, device :: a

Mat,Why are there negative cuda streams when I use them? What’s the problem?

You don’t. Variables defined in a device or global routine are implicitly have the device attribute. Explicitly adding “device” is optional, but I typically use it for readability.

Why are there negative cuda streams when I use them? What’s the problem?

A “cuda_stream_kind” is really a C pointer, i.e. a 64-bit unsigned integer. Fortran doesn’t have unsigned integers so if you print it, it’s likely to overflow a signed integer and print the value as negative.


I have a project, but I don’t know how to change this part of code? Can you help me? I want to change this to call cuda fortran in openmp. How can I change it?
!$omp parallel IF(conflux_node(L)>1)
!$omp do schedule(guided)private(ix)

        do N = 1,conflux_node(L)      
            ix = sub_conflux(L,N)
            CALL Conflux_Overland (ix,its,conflux_times)
            call Conflux_Gully    (ix,its,conflux_times)
            call Conflux_Gl_2_Rv  (ix,its,conflux_times,conflux_steps)
            CALL Conflux_River    (ix,its,conflux_times,conflux_steps) 

!$omp end do
!$omp end parallel
These subroutines of call are not related to each other.