Calling CUDA C from fortran

Hello,

I am trying to write a code which calling CUDA C code:

I found an olde example on the internet. All the CUDA operations (allocation, cp from H to D, kernel, cp from D to H) are one cu file and the fortran code is just calling one function:

__global__ void vect_add(double *a, int N)
{
   int idx = threadIdx.x;
   if (idx<N) a[idx] = a[idx] * a[idx];
}

// function called from main fortran program
extern "C" void kernel_wrapper_(double *a, int *Np)
{
   double  *a_d;  // declare GPU vector copies

   int blocks = 1;     // uses 1 block of
   int N = *Np;        // N threads on GPU

   // Allocate memory on GPU
   cudaMalloc( (void **)&a_d, sizeof(double) * N );

   // copy vectors from CPU to GPU
   cudaMemcpy( a_d, a, sizeof(double) * N, cudaMemcpyHostToDevice );

   // call function on GPU
   vect_add<<< blocks, N >>>( a_d, N);

   // copy vectors back from GPU to CPU
   cudaMemcpy( a, a_d, sizeof(double) * N, cudaMemcpyDeviceToHost );

   // free GPU memory
   cudaFree(a_d);
   printf("Test 00 \n");
   return;
}

Th fortran code is very simple:

IMPLICIT NONE

integer*4 :: i
integer*4, parameter :: N=8, tpb=8
real*8, Dimension(N) :: a

DO i=1,N
  a(i)=i*1.0
END DO

 print *, ' a = ', (a(i), i=1,N)

  CALL kernel_wrapper(a, N)

 print *, 'a**2=', (a(i), i=1,N)

The cu file is compiled with -c option and then the final compilation is done with gfortran. This code works without problems,.

I tried to break the code for each operation a wrapper, one for allocation one for cp from H to D, one for the kernel , and finally one for the cp from D to H:

extern "C" void cuda_malloc_(double *a_d, int *Np)
{
   int N = *Np;        // N threads on GPU

   // Allocate memory on GPU
   cudaMalloc( (void **)&a_d, sizeof(double) * N );
   printf("Test 01 \n");
   return;
}

extern "C" void cuda_cpy_htod_(double *a, double *a_d, int *Np)
{
   int N = *Np;        // N threads on GPU
   cudaMemcpy( a_d, a, sizeof(double) * N, cudaMemcpyHostToDevice );
   printf("Test 02 \n");

   return;
}
extern "C" void cuda_cpy_dtoh_(double *a_d, double *a ,int *Np)
{
   int N = *Np;        // N threads on GPU
   cudaMemcpy( a, a_d, sizeof(double) * N, cudaMemcpyDeviceToHost );
   printf("Test 04 \n");

   return;
}



extern "C" void cuda_free_(double *a_d)
{
   cudaFree(a_d);
   printf("Test 05 \n");
   return;
}

__global__ void vect_square(double *a, int N)
{
   int idx = threadIdx.x+blockIdx.x*gridDim.x;
   printf("%lf n", a[idx] * a[idx]);
   if (idx<N) a[idx] = a[idx] * a[idx];
}

extern "C" void dev_square_(double *a_d, int *Np, int *tpb)
{
   int N = *Np;        // N threads on GPU
   int threads= *tpb;
   int blocks = (N+threads-1)/threads;
   // call function on GPU
   vect_square<<< blocks, threads>>>(a_d, N);
   printf("Test 04 \n");

   return;
}

In the fortran code I do this:

use iso_c_binding
IMPLICIT NONE

integer*4 :: i
integer*4, parameter :: N=8, tpb=8
real*8, Dimension(N) :: a
type(c_ptr) :: a_d

DO i=1,N
  a(i)=i*1.0
END DO

 call cuda_malloc(a_d, N)

 call cuda_cpy_htod(a, a_d, N)

 call dev_square(a_d, N, tpb)
 DO i=1,N
   a(i)=0.0
 END DO

 call cuda_cpy_dtoh(a_d, a, N)

 call cuda_free(a_d)

 print *, 'a**2=', (a(i), i=1,N)

In this case the wrappers appear to execute, but the CUDA part does does not seem to work.

What I am doing wrong here?

That won’t work. The variable *a_d is passed by value. A copy of the numerical value of the pointer is made for use within the function. Thereafter, modifications to the numerical value of the pointer made within the function body will not be reflected in the calling environment. This kind of issue is covered in a variety of forum posts, here is one, here is another, I’m sure you can find others.

1 Like

Hello,

Thank you for your reply. The two links left more confused. I spent several days before writing in the forum trying to get more on this kind of issue. I understand now that it is a more complicated issue. My final goa is to be able to do something like these guys https://www.researchgate.net/publication/330940860_Three_Dimensional_Pseudo-Spectral_Compressible_Magnetohydrodynamic_GPU_Code_for_Astrophysical_Plasma_Simulation (similar to figure 8), in which they have wrappers and mix openacc with cufft. They call cufft with device pointers allocated by openACC.
My attempt was to try something more basic, but I failed in all my attempts.
I think there is a problem also with the compiler. Call cudaMemGetInfo from frotran code results just just ****

$ srun --account=project_2001659 --partition=gputest  --time=00:15:00 --nodes=1 --ntasks-per-node=1  --gres=gpu:v100:1 ./a.out 
srun: job 9515419 queued and waiting for resources
srun: job 9515419 has been allocated resources
  free: **********
 total: **********
```.

This seems to work for me:

$ cat cuda.cu
#include <cstdio>

extern "C" void cuda_malloc_(double **a_d, int *Np)
{
   int N = *Np;        // N threads on GPU

   // Allocate memory on GPU
   cudaMalloc( a_d, sizeof(double) * N );
   printf("Test 01 \n");
   return;
}

extern "C" void cuda_cpy_htod_(double *a, double **a_d, int *Np)
{
   int N = *Np;        // N threads on GPU
   cudaMemcpy( *a_d, a, sizeof(double) * N, cudaMemcpyHostToDevice );
   printf("Test 02 \n");

   return;
}
extern "C" void cuda_cpy_dtoh_(double **a_d, double *a ,int *Np)
{
   int N = *Np;        // N threads on GPU
   cudaMemcpy( a, *a_d, sizeof(double) * N, cudaMemcpyDeviceToHost );
   printf("Test 04 \n");

   return;
}



extern "C" void cuda_free_(double **a_d)
{
   cudaFree(*a_d);
   printf("Test 05 \n");
   return;
}

__global__ void vect_square(double *a, int N)
{
   int idx = threadIdx.x+blockIdx.x*gridDim.x;
   printf("%f n", a[idx] * a[idx]);
   if (idx<N) a[idx] = a[idx] * a[idx];
}

extern "C" void dev_square_(double **a_d, int *Np, int *tpb)
{
   int N = *Np;        // N threads on GPU
   int threads= *tpb;
   int blocks = (N+threads-1)/threads;
   // call function on GPU
   vect_square<<< blocks, threads>>>(*a_d, N);
   printf("Test 06 \n");

   return;
}
$ cat fort.f90
PROGRAM test
use iso_c_binding
IMPLICIT NONE

integer*4 :: i
integer*4, parameter :: N=8, tpb=8
real*8, Dimension(N) :: a
type(c_ptr) :: a_d

DO i=1,N
  a(i)=i*1.0
END DO
 call cuda_malloc(a_d, N)

 call cuda_cpy_htod(a, a_d, N)

 call dev_square(a_d, N, tpb)
 DO i=1,N
   a(i)=0.0
 END DO

 call cuda_cpy_dtoh(a_d, a, N)

 call cuda_free(a_d)

 print *, 'a**2=', (a(i), i=1,N)
END PROGRAM test
$ nvcc -c cuda.cu
$ gfortran fort.f90 cuda.o -o test -L/usr/local/cuda/lib64 -lcudart -lstdc++
$ cuda-memcheck ./test
========= CUDA-MEMCHECK
Test 01
Test 02
Test 06
1.000000 n4.000000 n9.000000 n16.000000 n25.000000 n36.000000 n49.000000 n64.000000 nTest 04
Test 05
 a**2=   1.0000000000000000        4.0000000000000000        9.0000000000000000        16.000000000000000        25.000000000000000        36.000000000000000        49.000000000000000        64.000000000000000
========= ERROR SUMMARY: 0 errors
$

I’m not really a fortran expert, but in this “functions without an interface” kind of binding that you are using, fortran seems to pass all arguments by their address (i.e. as pointers, which might also be called by reference). If we pass a c_ptr by its address, it seems logical to me that the effective C prototype for that is a pointer-to-pointer, which is, as it happens, exactly what we need to solve the issue I suggested in my previous post.

1 Like

Thank you for trying it. It worked on my supercomputer with slurm

I tried the double **a_d, but I guess I failed at the other parts, cudaMemcpy and cudaMalloc.

I spend so much time on it I can not express my feelings in this moment.

Thank you so much. I hope that in the end I can combine this with openmp using !$omp target data use_device_ptr(...)

Cristian