Hi,
I’ve recently been trying to learn how to code with GPUs with FORTRAN. I’m using the HPC SDK nvfortran compiler compiling code on the Jetson Nano 2GB development kit. I would really appreciate getting some help with the errors I’ve been getting trying to run the following code (compiled with command nvfortran saxpy_gpu.f90
):
module mathOps
contains
attributes(global) subroutine saxpy(x, y, a)
implicit none
real :: x(:), y(:)
real, value :: a
integer :: i, n
n = size(x)
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
if (i <= n) y(i) = y(i) + a*x(i)
end subroutine saxpy
end module mathOps
program testSaxpy
use mathOps
use cudafor
implicit none
integer, parameter :: N = 40000
real :: x(N), y(N), a
real, device :: x_d(N), y_d(N)
type(dim3) :: grid, tBlock
tBlock = dim3(256,1,1)
grid = dim3(ceiling(real(N)/tBlock%x),1,1)
x = 1.0; y = 2.0; a = 2.0
x_d = x
y_d = y
call saxpy<<<grid, tBlock>>>(x_d, y_d, a)
y = y_d
write(*,*) 'Max error: ', maxval(abs(y-4.0))
end program testSaxpy
The program compiles without any complaints, but after running, the program gives the wrong result. Running with cuda-memcheck produces the following output:
$ cuda-memcheck a.out
========= CUDA-MEMCHECK
========= Program hit cudaErrorNoKernelImageForDevice (error 209) due to "no kernel image is available for execution on the device" on CUDA API call to cudaLaunchKernel.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2fdb04]
========= Host Frame:/usr/local/cuda/lib64/libcudart.so.10.2 (cudaLaunchKernel + 0x1ac) [0x4b6a4]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_aarch64/20.7/compilers/lib/libcudafor.so (__pgiLaunchKernel + 0x278) [0x89f80]
=========
Max error: 2.000000
========= ERROR SUMMARY: 1 error
where max error should be 0.0.
I can compile and run matrixMul and simplePrintf sample nvcc programs on the Jetson Nano without issue (after following advice here to sudo chmod a+rw /dev/nvhost-dbg-gpu
. I suspect it’s due to the use of attribute(global) subroutine as I can compile and run the simple device query program described here
I’m running:
Jetpack 4.6, with CUDA 10.2. HPC SDK 20.7 is installed.
nvfortran is run after loading the pre-made modulefile with CUDA_HOME=/usr/local/cuda
and PATH
and LD_LIBRARY_PATH
to also point to /usr/local/cuda
.