Using Fortran 90 function in CUDA kernel (Visual Studio 2010)

Relevant Information:

  • I need to implement Fortran 90 functions on the GPU. I’ve set up a basic C runtime that defines /launches a very simple kernel and handles memory allocation / movement to and from the device.

  • I am using Visual Studio 2010 / Intel Fortran Compiler 2013 / CUDA Toolkit v7.5 all on Windows 7 64bit SP1

  • My VS solution is organized as follows: 1 solution with 2 projects. First project is an Intel Visual Fortran Dynamic-Link Library with the functions I wish to call in CUDA C. Second, an NVIDIA CUDA 7.5 Runtime project.

  • I have, according to various other resources, configured Visual Studio for use in making Fortran functions callable in a C runtime (i.e. I can use the Fortran functions I make in my C runtime just fine)

  • The GPU is an NVIDIA K600

Problem:
When I try to compile my C program I receive the following error

error MSB3721: The command ““C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\bin\nvcc.exe” -gencode=arch=compute_20,code="sm_20,compute_20" --use-local-env --cl-version 2010 -ccbin “c:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin” -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\include” -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\include" -G --keep-dir Debug -maxrregcount=0 --machine 32 --compile -cudart static -g -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler “/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o Debug\kernel.cu.obj “C:\LAGO\HF\CUDA\Alpha2\Runtime\kernel.cu”” exited with code 255.

Current Code:
This is the complete C code (kernel.cu) that produces the above error

Note: (This is simply the example ‘kernel.cu’ program given if a new CUDA 7.5 Runtime project is made… only modified to use a Fortran function to do the addition)

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

extern "C" __device__ void FOO ( int *c, int *a, int *b );

cudaError_t addWithCuda(int *c, int *a, int *b, unsigned int size);

__global__ void addKernel(int *c, int *a, int *b)
{
    int i = threadIdx.x;
    FOO( &c[i], &a[i], &b[i] );
}

int main()
{
    const int arraySize     = 5;
    int a[arraySize]        = { 1, 2, 3, 4, 5 };
    int b[arraySize]        = { 10, 20, 30, 40, 50 };
    int c[arraySize]        = { 0 };

    cudaError_t cudaStatus = addWithCuda( c, a, b, arraySize );
    if ( cudaStatus != cudaSuccess ) {
        fprintf( stderr, "addWithCuda failed!" );
        return 1;
    }

    // Reset Device
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    // Quick check
    printf( "%d + %d = %d\n", a[0], b[0], c[0] );
    system("pause");

    return 0;
}

cudaError_t addWithCuda(int *c, int *a, int *b, unsigned int size)
{
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
    cudaError_t cudaStatus;

    // Choose GPU
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!");
        goto Error;
    }

    // Allocate
    cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    // Copy from host to GPU
    cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    // Launch kernel with one thread for each element.
    addKernel<<<1, size>>>(dev_c, dev_a, dev_b);

    // Check errors
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", 
                cudaGetErrorString(cudaStatus));
        goto Error;
    }
    
    // Wait for the kernel to finish
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d\n", cudaStatus);
        goto Error;
    }

    // Copy from GPU to host
    cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

Error:
    cudaFree(dev_c);
    cudaFree(dev_a);
    cudaFree(dev_b);
    
    return cudaStatus;
}

And the Fortran 90 code for the function “FOO( c, a, b)” is:

SUBROUTINE FOO ( c, a, b ) BIND ( C )
    !DEC$ ATTRIBUTES DLLEXPORT :: FOO
    use, intrinsic :: iso_c_binding
    implicit none
    
    integer ( kind = C_INT ), intent ( IN ) :: a, b
    integer ( kind = C_INT ) :: c
    
    c = a  + b
    
    return
END SUBROUTINE ! FOO

Notes:
This code works if I call the function FOO from my CUDA runtime program main, or if I pass the kernel any other C function. The issue is as soon as I attempt to define a kernel using it the program fails to compile.

Request:
I would like to know if I have any glaring issues with this program (in either the code itself or in my description of the CUDA/Fortran/VS configuration). Any suggestions as to how to get this code to compile and run?

Note: I cannot use PGI CUDA Fortran, it is not relevant here

Without using PGI CUDA Fortran, what you’re suggesting is not possible.

Code that runs on the GPU must, among other things, be compiled by a compiler that knows how to generate GPU code. Intel Visual Fortran does not know how to do this, and any machine code, libraries, or other objects created by IVF are not compatible with execution on a CUDA GPU.

You should be able to cause CUDA C/C++ (or PGI CUDA FORTRAN) routines to be usable (called) from IVF Fortran, with appropriate project setup and linking.

Well, that explains a lot.

I was told that Fortran functions called in C generate “C equivalent” code at compile time so as long as the Fortran code is using appropriate C type variables (as with iso_c_bindings).

Was this information wrong?

Generally it is possible for C code for the host (CPU) to call Fortran code for the host (CPU), and vice versa (there may be issues with name decoration, calling conventions, etc but these are all solvable). The important point is this is object code for the host calling other object code for the host, just produced by different compielrs from different HLLs.

As txbob explained, in order to run Fortran code on the device (GPU) you need a Fortran toolchain that produces object code for the GPU. The PGI Fortran compiler can do this, Intel Fortran cannot. In addition, device code, i.e. a kernel running on the GPU, cannot call host code.

What you can do is this: Implement your kernel in CUDA, put a host-side C wrapper around it, and call the wrapper from Fortran using the ISO C bindings. In that scenario you have Fortran host code calling C host code which in turn invokes a GPU kernel.