Hello.
The constant variables in CUDA Fortran must be defined in the module section and their values should be set at host side.
The kernels that will use the constant memory space should be contained in it as well.
But I’m not sure if this can be applied to multi-GPU programming also.
Module is a bit confusing concept in multi-GPU programming. I can never assume that each GPU has its own copy of a module’s data.
How do I use constant memory in multi-GPU programming?
It would be the best if you can provide me an example.
Hi CNJ,
If you are using MPI, then each MPI process will have it’s own independent copy of the module data. If you are using a single host process with multiple threads (like OpenMP), then there would be a single copy of the module data. This is true whether you’re using CUDA Fortran or just basic Fortran.
CUDA Fortran static device module data is created and initialized at load. This means that if you launch your process and then in the program change the device number, the static module data (including constant data) will be on the wrong device.
For the single-process multi-threaded multi-device program (like OpenMP), this is problematic and static device module data can’t really be used.
For MPI, you can use static device module data provided that you don’t set the device in the program, but rather when you launch the program.
The way I do it is launch the binary from a script which then sets the environment variable “CUDA_VISIBLE_DEVICES” to the rank of each MPI process. This assumes a 1-to-1 correspondence of MPI process to device but you could add logic to the script if this isn’t correct in your case.
For example, here’s the simple script I use with OpenMPI under csh:
% cat run.csh
setenv CUDA_VISIBLE_DEVICES $OMPI_COMM_WORLD_LOCAL_RANK
$1
% mpirun -np 4 ./run.csh ./a.out
Hope this helps,
Mat
Unfortunately I’m using OpenMP.
Our program do use MPI, but its for distributing workloads between nodes, not between cores.
So currently there’s no way to define constant memory in each device like CUDA C?
Hi CNJ,
I decided to go back and see if I can make this work. It looks like we made a change back in 15.5 that helped, and, at least in my toy example, seems to work as it does in CUDA C.
Note that if you have two OpenMP threads sharing the same GPU, they will also share the same constant memory. Hence in this code when I oversubscribe the GPUs, there’s race condition on the constant data when updating it from the host.
% cat testmp.cuf
module data
use cudafor
integer,constant :: my_const_data = 100
contains
attributes(global) subroutine my_kernel(id)
integer, value :: id
print *,"thread ", id, " constant data is: ", my_const_data
end subroutine my_kernel
end module data
program foo
use data
use cudafor
use omp_lib
integer :: num_dev, my_dev, cdata
integer :: istat, thdid
num_dev = 0
istat = cudaGetDeviceCount(num_dev);
if (num_dev .eq. 0) then
print *, "no cuda devices found!"
stop 1
end if
print *, "num_dev=", num_dev
cdata = 1000
my_dev = -1
my_const_data = cdata
call my_kernel<<<1,1>>>(my_dev)
istat = cudaDeviceSynchronize()
!$omp parallel private(thdid,my_dev,cdata)
thdid = omp_get_thread_num()
my_dev = mod(thdid,num_dev)
print *, "thread:", thdid, " Device:", my_dev
istat = cudaSetDevice(my_dev)
cdata = thdid*10
my_const_data = cdata
call my_kernel<<<1,1>>>(thdid)
istat = cudaDeviceSynchronize()
!$omp end parallel
end program foo
% pgf90 -fast testmp.cuf -mp -V16.9
% setenv OMP_NUM_THREADS 8
% ./a.out
num_dev= 8
thread -1 constant data is: 1000
thread: 0 Device: 0
thread: 6 Device: 6
thread: 4 Device: 4
thread: 2 Device: 2
thread: 1 Device: 1
thread: 5 Device: 5
thread: thread 0 constant data is: 0
3 Device: 3
thread: 7 Device: 7
thread 3 constant data is: 30
thread 7 constant data is: 70
thread 5 constant data is: 50
thread 1 constant data is: 10
thread 6 constant data is: 60
thread 4 constant data is: 40
thread 2 constant data is: 20
% setenv OMP_NUM_THREADS 16
% ./a.out
num_dev= 8
thread -1 constant data is: 1000
thread: 0 Device: 0
thread: 4 Device: 4
thread: 6 Device: 6
thread: 1 Device: 1
thread: 13 Device: 5
thread: 12 Device: 4
thread 0 constant data is: 0
thread: 15 Device: 7
thread: 8 Device: 0
thread: 10 Device: 2
thread: 14 Device: 6
thread: 7 Device: 7
thread: 3 Device: 3
thread: 2 Device: 2
thread: 5 Device: 5
thread: 9 Device: 1
thread 8 constant data is: 80
thread: 11 Device: 3
thread 6 constant data is: 140
thread 14 constant data is: 140
thread 15 constant data is: 150
thread 7 constant data is: 70
thread 13 constant data is: 130
thread 5 constant data is: 50
thread 1 constant data is: 10
thread 9 constant data is: 90
thread 10 constant data is: 20
thread 2 constant data is: 20
thread 3 constant data is: 30
thread 11 constant data is: 110
thread 4 constant data is: 120
thread 12 constant data is: 120
Oh, it’s a very good news!
I don’t have multi-GPU machine so I can’t test it right now, but I’m sure it will work as it did in your example.
Thank you very much.
Mat,
I’ve been attempting to implement multi-GPU support in the ALADYN application using CUDA FORTRAN. My approach is essentially the same as the testmp.cuf program you posted to this thread. My issue is that I’d like to make the device constants part of the OpenMP private, however, doing so results in incorrect behavior. In particular, device0 prints 1000 while the other three devices all print 100. Is this the expected behavior?
Dan
Hi Dan,
Apologies for the late reply. I’m away at a conference this week.
Are you also initializing the constant variable in the module? If so, try initializing the constant variable after setting the device. Initialization in the module, like I have it, occurs upon load of the binary so only the default device would be initialized.
-Mat