How to create dummy context in CUDA Fortran?

sWienke · April 20, 2011, 9:21am

Hi,

how can I exclude the device initialization time in my CUDA Fortran program? In CUDA C, I can create a dummy context (e.g. allocate one integer on the device) before the time measurement starts.

The problem is my program structure: I have a C-program P which calls a certain C-function X and P does also the time measurement around this C-function X. The C-function X is called serveral times and all runtimes are aggregated. I cannot modify this part of the program.
The C-function X calls (more or less) a Fortran function Y. The Fortran function Y uses CUDA Fortran (one CUF-file comprising function Y and the kernel). It does the memory allocation and the kernel execution.
In C I could implement a global variable (in the cu-file) of an object that just allocates some memory in its constructor and thus is called only once before time measurement. But how to do in fortran? I tried this in my CUF-file:

module my_globals
integer, device:: cudaHolder = 1
end module

But it didn’t work. Any other suggestions? I am not a native Fortran programmer, so please forgive me if there is a really simple fortran solution.
Sandra

MatColgrove · April 25, 2011, 6:13pm

Hi Sandra,

The simplest thing to do is run the PGI utility pgcudainit in the background. This utility holds your devices open thus removing the initialization time on Linux altogether.

If you can’t use pgcudainit try initializing a device variable before the timing loop (maybe in main?). The device(s) are initialized upon first use, so any operation on a device variable (such as initializing it to zero) will cause the device initialization to occur. No need to launch a kernel.

Mat

sWienke · April 26, 2011, 7:58am

Hi,
Actually I don’t want to use pgcudainit… and as I stated, I cannot modify the main-file (where the timing loop is located) :-(

But as there seem to be no alternatives, I tried pgcudainit and got the following problem while running it in background (-help is not working and there is no man page) and no change in my initialization time:

$ pgcudainit &
[2] 17064
$ pgcudainit called cuInit, now waiting for input
[2]  + suspended (tty input)  pgcudainit
$

What is wrong? Which kind of input does pgcudainit need?

If I run it in foreground and make a return, I get the message “pgcudainit completeing”.

MatColgrove · April 26, 2011, 5:31pm

Hi Sandra,

What is wrong? Which kind of input does pgcudainit need?

pgcudainit will stop when any input is given. Hence, if you put it in the background it will continue to hold open the devices. Hitting return will stop it.

Mat

sWienke · April 27, 2011, 6:29am

Okay. But then the impact of pgcudainit i.e. holding the devices open, is not really big. My first short kernel run decreases only from 3.5 seconds to 2.9 seconds. In contrast, if I create a dummy context with CUDA C (as desribed above), I have only an execution time of 0.03 seconds for the same kernel.
Any ideas?

MatColgrove · April 27, 2011, 6:32pm

Any ideas?

No, sorry. I guess I’d need to see an example. Can you send something to PGI Customer Service (trs@pgroup,com)?

Thanks,
Mat

TheMatt · April 27, 2011, 7:05pm

This might not be useful, but does cudaSetDevice create a context? It might because I know in my code I have to issue a cudaThreadExit call before I can issue cudaSetDevice because (in the past at least), I had to clear all contexts before I initialized my devices.

If so, maybe issuing a cudaSetDevice call might create the context early on. It might be functionally superfluous because you only have one device, but that superfluous call might allow you to create the context and then not have it impact a timer later on. (I use it because my boxes have 2 or 4 GPUs and I’m running with MPI. The 2-3 seconds of time this takes is shown in the timers for this routine, and thus in the overall time of the code, but not in the subsequent CUDA Fortran kernel calls.)

Matt