Using constant memory in Fortran CUDA and with multiple GPUs


I’m developing a program in Fortran CUDA and trying to use/access multiple GPUs from a single host thread. The original code (single GPU only) was using global and constant memory and when adding support for multiple GPUs I could not find a way to specify on which device to place Fortran variables with the “constant” attribute.

I have tried this:

integer, constant :: iconst

DO dev = 0, maxdev
ignore = cudaSetDevice(dev)
iconst = 1

and this compiles and runs but trying to access iconst from a kernel launched on the higher numbered devices results in “unspecified launch failure”.

Is there a way to specify the placement of variables in the constant memory on specific device? I looked through the user manual and “CUDA Fortran for Scientists and Engineers” but there is little information on supporting multile-gpus in general.



Hi Maciej,

Did you setup the Peer-To-Peer communication first? It’s required in order to use GPUDirect.

My article on multi-GPU program using CUDA Fortran has a section on GPUDirect (part 4), including the set-up code. While I don’t use constant memory in this example, I just went back tried adding some variables and it worked as expected. Though, if you continue to encounter issues, let me know and we can work through them,

  • Mat

Hi Mat,

Thanks for your reply.

I’m not sure if GPUDirect is actually relevant to what I am trying to achieve. I understand GPUDirect is required if a kernel running on device 1 is trying to access constant memory on device 0 - is that correct?

What I am trying to do is to have kernel running on dev 0 access constant memory on dev 0 and kernels on dev 1 accessing constant memory on device 1. But it is not clear to me how to specify (we’re talking Fortran here) that a variable declared with attribute(constant) is allocated in the constant memory of device 1 or 2 instead of device 0.

Is there a way of achieving this in Fortran CUDA (PGI 13.2 and CUDA5.0) or is it something that is not currently supported?



I believe there are actually multiple context created hence you need establish Peer-to-peer so you can manage them. Granted, I’ve only done a little work with using multiple GPUs from a single host thread, so there may be an better way, but using Peer-to-Peer seems to work.

Personally, I much prefer using MPI and then establish a single GPU context to each MPI process. Logically I find it easy to manage, cleaner in implementation, and scales better. Of course, you do what’s best for your program.

  • Mat

I could do that but that would just let me access constant memory on dev A from kernel running on devB, therefore negating performance benefits of using constant memory. ;)

That is something that we’ve been thinking about for later. I hoped that a pipelined copy between two GPUs accessd from same host thread would be a bit faster compared to MPI so that we could reap benefits of multiple GPUs even for moderately sized problems.

Thanks for your help anyway. I’ve decided to refactor the code so that scalar constant become kernel arguments passed by value, while array constants will move to global memory.