Back to problem with CUDA Fortran and R

Hello! I’m using WSL2 and trying to use CUDA Fortran in conjunction with R. I have the multi version of CUDA, with 11.7 and 11.0.
Here is the CUDA compilation:

nvfortran -c -fPIC cuda1a.cuf -o cuda1a.o -ta=tesla:nordc -V22.7 -Mcudalib=cublas -lcuda
nvfortran -shared -fPIC cuda1a.o -o cuda1a.so -ta=tesla:nordc -V22.7 -Mcudalib=cublas -lcuda
                                                                                                

And here is the R output;

dyn.load("cuda1a.so")
Error in dyn.load("cuda1a.so") :
  unable to load shared object '/home/ehodgess/cuda1a.so':
  /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/compilers/lib/libcudaforwrapblas117.so: undefined symbol: cublasSgemvBatched

I also tried:

nvfortran -c -fPIC cuda1a.cuf -o cuda1a.o -ta=tesla:nordc -Mcuda=cuda11.0 -Mcudalib=cublas -lcuda                                                                                                                   nvfortran -shared -fPIC cuda1a.o -o cuda1a.so -ta=tesla:nordc -Mcuda=cuda11.0 -Mcudalib=cublas -lcuda                      

And the R output is:

> dyn.load("cuda1a.so")                                                                                                 > .Fortran("t2",as.integer(50),as.integer(1:50),as.single(0.0))                                                         0: ALLOCATE: 200 bytes requested; status = 100(no CUDA-capable device is detected)              

Finally, here is the CUDA Fortran subroutine:

module mytests                                                                                                          contains                                                                                                                attributes (global) subroutine test1(a)                                                                                 integer, device :: a(*)                                                                                                 !real, device :: a(*)                                                                                                   i = threadIdx%x                                                                                                         a(i) = i + 2                                                                                                            !a(i) = a(i) + 2.0*i                                                                                                    return                                                                                                                  end subroutine test1                                                                                                    end module mytests                                                                                                                                                                                                                              subroutine t2(n,h,xt)                                                                                                   !DEC$ ATTRIBUTES DLLEXPORT :: t2                                                                                        use cudafor                                                                                                             use mytests                                                                                                             integer, allocatable, device :: iarr(:)                                                                                 !real, allocatable, device :: iarr(:)                                                                                   integer n,h(n)                                                                                                          !integer n                                                                                                              !real :: h(n)                                                                                                           real :: xt,x1,x2                                                                                                          type(dim3) :: grid, tBlock                                                                                            istat = cudaSetDevice(0)                                                                                                allocate(iarr(n))                                                                                                       !h = 0;                                                                                                                 iarr = h                                                                                                                x1=0.0;x2=0.0                                                                                                                                                                                                                                     tBlock = dim3(512,1,1)                                                                                                  grid = dim3(ceiling(real(N)/tBlock%x),1,1)                                                                            call cpu_time(x1)                                                                                                       call test1<<<grid,tBlock>>> (iarr)                                                                                                                                                                                                                                                                                                                                      h = iarr                                                                                                                call cpu_time(x2)                                                                                                       deallocate(iarr)                                                                                                                                                                                                                                xt = x2-x1                                                                                                              end subroutine t2                         

Any suggestions much appreciated.

Thanks,
Erin