host, device function: fortran counterpart of _CUDA_ARCH_

canemacchina1 · May 12, 2012, 7:18pm

Hi all,

I’m studying CUDA Fortran for my thesis, and I’ve a question:

Like in C, in Fortran we can define functions (or subroutines) that are both host and device, so functions that are callable from host or device.

In C in this functions if I have to determine who is the caller, I can use CUDA_ARCH macro: it’s defined when the caller is the GPU, otherwise it isn’t if the caller is the CPU. This is an example of use of CUDA_ARCH

__host__ __device__ void function() {
#ifdef __CUDA_ARCH__
        //__CUDA_ARCH_ defined, GPU is the caller
#if __CUDA_ARCH__ >= 200
	//Compute capability >= 2.x
#elif __CUDA_ARCH__ < 200
	//Compute capability < 2.x
#endif
#else
	//_CUDA_ARCH_ not defined, host call the function
#endif
}

So, the question is: in Fortran there is something like this?
Thanks to all!

MatColgrove · May 14, 2012, 4:38pm

Hi canemacchina

in Fortran there is something like this?

While “CUDA_ARCH” is not a predefined macro variable, CUDA Fortran supports preprocessing so you can use it and then set the variable on the command line (i.e. -D__CUDA_ARCH__=200). Note that “_CUDA” is pre-defined when “-Mcuda” is used or the file extension is “.cuf”.

Though, in the case of CUDA Fortran, you may not want to use this method to select the compute capability. The PGI compiler will automatically create multiple versions of your code to target the various compute capabilities. When the binary is run, then the appropriate CC version is used.

The one caveat is that all CC version will use the same user code with the difference being the compiler optimisation applied. If your device kernels are significantly different, then you may want to go the route of selecting the kernel based on the CC version. However, why do it at compile time? It seems to me that you’d want to wait till runtime, call the device properties routine, and then select the appropriate kernel to launch.

Mat

canemacchina1 · May 15, 2012, 2:17pm

Ok, thanks.

Maybe I haven’t well understand, but my question was:
suppose you have to write some utils function for your application, for example one that sort a given array. Suppose you want to use this function both for device or host array. Well, you could do this implementing two different functions, for example dev_arraySort and host_arraySort, that sort an array respectively on device memory or on host memory. So in this case if a kernel sub wants to order an array, it has to call dev_arraySort, and if is the host that would sort an array, it has to call host_arraySort.
My goal instead is write a function called arraySort that is callable both from host and device, able to sort an array in device memory and in host memory. Like i’ve write before, in C I can write:

__host__ __device__ void function() { 
#ifdef __CUDA_ARCH__ 
    //__CUDA_ARCH_ defined, GPU is the caller
    // here the code to sort an array in dev mem.
#else 
   //_CUDA_ARCH_ not defined, host call the function
   // here the code to sort an array in host mem.
#endif 
}

maybe is a strange example or strange way to solve my problem, but I need this information to write about this in my thesis.

Is it possible in fortran or not?

Thanks again.

MatColgrove · May 15, 2012, 3:47pm

Hi canemacchina,

Define flags such as “CUDA_ARCH” are used by the preprocessor to control conditional compilation. So this code can be compiled with the call to the device or the host, but not both.

What you need to do is add a runtime call to the device properties to determine what type of device you are using and then use an if statement to call the appropriate routine. You could also write a generic interface that selects the appropriate version of the routine, based on the argument types, but ultimately there are two different versions of the routine.

In the first draft of CUDA Fortran we did have the concept of a “Unified Binary” where two versions of the kernel would be written, one for the device and one for the host, but the implementation proved too difficult. Hopefully we will be able to add it back at some point.

Note that the PGI Accelerator Model and the PGI implementation of OpenACC API does support unified binary. If your project absolutely requires a single routine that supports both the device and host, then you may wish to consider using directives instead.

Mat

canemacchina1 · May 15, 2012, 6:20pm

Sorry, I haven’t understood yet.

Define flags such as “CUDA_ARCH” are used by the preprocessor to control conditional compilation. So this code can be compiled with the call to the device or the host, but not both.

Is referred to fortran, right?

What you need to do is add a runtime call to the device properties to determine what type of device you are using and then use an if statement to call the appropriate routine.

Ok, but with this solution I can determine compute capability of the device, but what I need is determine if the caller is the host or the gpu…

TheMatt · May 15, 2012, 6:35pm

As far as this goes, I am not sure you can have an “attributes(host,device)” anymore. Back in the old days I think it was allowed, but at the moment I think you need to have two separate routines, one for the host, one for the device.

Matt

MatColgrove · May 17, 2012, 6:23pm

Is referred to fortran, right?

Yes, PGI, like most Fortran compilers, support C style pre-processing.

Ok, but with this solution I can determine compute capability of the device, but what I need is determine if the caller is the host or the gpu…

At least until CUDA 5 comes out, there isn’t support for true calling from a device, so the routine will always be called from the host. You need to add the logic as to which path to take either via an IF-THEN-ELSE control structure or via Generic Interfaces.

I am not sure you can have an “attributes(host,device)” anymore.

That’s what I was mentioning before with the “Unified Binary”. This would be the idea solution for canemacchina, but unfortunately it proved too difficult to implement. Hopefully we can add it back in a future release.

Mat