CUDA constant memory in multi-GPU

Hello.

The constant variables in CUDA Fortran must be defined in the module section and their values should be set at host side.

The kernels that will use the constant memory space should be contained in it as well.

But I’m not sure if this can be applied to multi-GPU programming also.

Module is a bit confusing concept in multi-GPU programming. I can never assume that each GPU has its own copy of a module’s data.

How do I use constant memory in multi-GPU programming?

It would be the best if you can provide me an example.

Hi CNJ,

If you are using MPI, then each MPI process will have it’s own independent copy of the module data. If you are using a single host process with multiple threads (like OpenMP), then there would be a single copy of the module data. This is true whether you’re using CUDA Fortran or just basic Fortran.

CUDA Fortran static device module data is created and initialized at load. This means that if you launch your process and then in the program change the device number, the static module data (including constant data) will be on the wrong device.

For the single-process multi-threaded multi-device program (like OpenMP), this is problematic and static device module data can’t really be used.

For MPI, you can use static device module data provided that you don’t set the device in the program, but rather when you launch the program.

The way I do it is launch the binary from a script which then sets the environment variable “CUDA_VISIBLE_DEVICES” to the rank of each MPI process. This assumes a 1-to-1 correspondence of MPI process to device but you could add logic to the script if this isn’t correct in your case.

For example, here’s the simple script I use with OpenMPI under csh:

% cat run.csh
setenv CUDA_VISIBLE_DEVICES $OMPI_COMM_WORLD_LOCAL_RANK
$1
% mpirun -np 4 ./run.csh ./a.out

Hope this helps,
Mat

Unfortunately I’m using OpenMP.

Our program do use MPI, but its for distributing workloads between nodes, not between cores.

So currently there’s no way to define constant memory in each device like CUDA C?

Hi CNJ,

I decided to go back and see if I can make this work. It looks like we made a change back in 15.5 that helped, and, at least in my toy example, seems to work as it does in CUDA C.

Note that if you have two OpenMP threads sharing the same GPU, they will also share the same constant memory. Hence in this code when I oversubscribe the GPUs, there’s race condition on the constant data when updating it from the host.

% cat testmp.cuf
module data

 use cudafor
 integer,constant :: my_const_data = 100

contains
 attributes(global) subroutine  my_kernel(id)
  integer, value :: id
  print *,"thread ", id, " constant data is: ", my_const_data
 end subroutine my_kernel

end module data

program foo

  use data
  use cudafor
  use omp_lib

  integer :: num_dev, my_dev, cdata
  integer :: istat, thdid
  num_dev = 0
  istat = cudaGetDeviceCount(num_dev);
  if (num_dev .eq. 0) then
    print *, "no cuda devices found!"
    stop 1
  end if
  print *, "num_dev=", num_dev
  cdata = 1000
  my_dev = -1
  my_const_data = cdata
  call my_kernel<<<1,1>>>(my_dev)
  istat = cudaDeviceSynchronize()

!$omp parallel private(thdid,my_dev,cdata)
    thdid = omp_get_thread_num()
    my_dev = mod(thdid,num_dev)
    print *, "thread:", thdid, " Device:", my_dev
    istat = cudaSetDevice(my_dev)
    cdata = thdid*10
    my_const_data = cdata
    call my_kernel<<<1,1>>>(thdid)
    istat = cudaDeviceSynchronize()
!$omp end parallel

end program foo
% pgf90 -fast testmp.cuf -mp -V16.9
% setenv OMP_NUM_THREADS 8
% ./a.out
 num_dev=            8
 thread            -1  constant data is:          1000
 thread:            0  Device:            0
 thread:            6  Device:            6
 thread:            4  Device:            4
 thread:            2  Device:            2
 thread:            1  Device:            1
 thread:            5  Device:            5
 thread: thread             0  constant data is:             0
            3  Device:            3
 thread:            7  Device:            7
 thread             3  constant data is:            30
 thread             7  constant data is:            70
 thread             5  constant data is:            50
 thread             1  constant data is:            10
 thread             6  constant data is:            60
 thread             4  constant data is:            40
 thread             2  constant data is:            20
% setenv OMP_NUM_THREADS 16
% ./a.out
 num_dev=            8
 thread            -1  constant data is:          1000
 thread:            0  Device:            0
 thread:            4  Device:            4
 thread:            6  Device:            6
 thread:            1  Device:            1
 thread:           13  Device:            5
 thread:           12  Device:            4
 thread             0  constant data is:             0
 thread:           15  Device:            7
 thread:            8  Device:            0
 thread:           10  Device:            2
 thread:           14  Device:            6
 thread:            7  Device:            7
 thread:            3  Device:            3
 thread:            2  Device:            2
 thread:            5  Device:            5
 thread:            9  Device:            1
 thread             8  constant data is:            80
 thread:           11  Device:            3
 thread             6  constant data is:           140
 thread            14  constant data is:           140
 thread            15  constant data is:           150
 thread             7  constant data is:            70
 thread            13  constant data is:           130
 thread             5  constant data is:            50
 thread             1  constant data is:            10
 thread             9  constant data is:            90
 thread            10  constant data is:            20
 thread             2  constant data is:            20
 thread             3  constant data is:            30
 thread            11  constant data is:           110
 thread             4  constant data is:           120
 thread            12  constant data is:           120
  • Mat

Oh, it’s a very good news!

I don’t have multi-GPU machine so I can’t test it right now, but I’m sure it will work as it did in your example.

Thank you very much.

Mat,

I’ve been attempting to implement multi-GPU support in the ALADYN application using CUDA FORTRAN. My approach is essentially the same as the testmp.cuf program you posted to this thread. My issue is that I’d like to make the device constants part of the OpenMP private, however, doing so results in incorrect behavior. In particular, device0 prints 1000 while the other three devices all print 100. Is this the expected behavior?

Dan

Hi Dan,

Apologies for the late reply. I’m away at a conference this week.

Are you also initializing the constant variable in the module? If so, try initializing the constant variable after setting the device. Initialization in the module, like I have it, occurs upon load of the binary so only the default device would be initialized.

-Mat