CUDA Fortran+Openmp problem

carowin · March 1, 2022, 9:57am

Hi,all
I want to use two cpu threads to control two GPUs respectively. Here is the test program:
/////////////////////////////////////////////////
PROGRAM TEST

USE cudafor

USE omp_lib

IMPLICIT NONE

INTEGER :: istat

INTEGER :: GPU_ID,GPU_NUM,dev

REAL(KIND=8) :: cc

REAL(KIND=8),allocatable,dimension(:,:),device :: a

call omp_set_num_threads(2)

!$omp parallel private(GPU_ID,dev)

GPU_ID=omp_get_thread_num()

istat=cudaSetDevice(GPU_ID)

istat=cudaGetDevice(dev)

print*,'No.',GPU_ID,'CPU thread control','No.',dev,'GPU'

allocate(a(2,2))

a(:,:)=GPU_ID

!$omp end parallel

print*,'------------------------'

istat=cudaSetDevice(1)

istat=cudaGetDevice(GPU_ID)

print*,'GPU_ID',GPU_ID

cc=a(1,1)

print*,cc

print*,'------------------------'

istat=cudaSetDevice(0)

istat=cudaGetDevice(GPU_ID)

print*,'GPU_ID',GPU_ID

cc=a(1,1)

print*,cc

END PROGRAM

//////////////////////////////////////////
After running this program, I find that the output of “a(1,1)” are same. That means the value of the array “a” on GPUs are same. Why this happen? Can I use allocate to create arrays with same name on different GPU?

Thanks

bleback · March 1, 2022, 8:44pm

You have 1 copy of the array “a”. You have a race condition where two CPU threads try to allocate it. Then set it to a value.

If you want a copy of A for each thread, and each GPU, you must make them OMP private.

MatColgrove · March 1, 2022, 8:48pm

Hi carowin,

Because “a” is shared by the threads. Unfortunately we don’t support putting device variables in “threadprivate” (which would be the non-device way of fixing this), so you’ll either need to make “a” private within the parallel region, which limits its scope, or use multiple device arrays.

Note that in my experience it’s much easier to use MPI+X (where X is CUDA Fortran, OpenACC, or OpenMP with target offload) when doing multi-GPU programming. It’s very challenging to mange multiple discrete memory and you end up having to manually decomposing the problem. In MPI domain decomposition in inherent in the model and then you have one-to-one relationship between the rank and a GPU. Plus, MPI allows arbitrary scaling of the number of GPUs both within a single node or multiple nodes.

-Mat

bleback · March 1, 2022, 9:12pm

I agree with your 2nd statement. The first is not quite right. We do support OMP threadprivate, but you need to change your code a bit:

PROGRAM TEST
USE cudafor
USE omp_lib
IMPLICIT NONE
INTEGER :: istat
INTEGER :: GPU_ID,GPU_NUM,dev
REAL(KIND=8) :: cc
REAL(KIND=8),allocatable,dimension(:,:),device :: a
!$omp threadprivate(a)

call omp_set_num_threads(2)
!$omp parallel private(GPU_ID,dev)
GPU_ID=omp_get_thread_num()
istat=cudaSetDevice(GPU_ID)
istat=cudaGetDevice(dev)
print*,‘No.’,GPU_ID,‘CPU thread control’,‘No.’,dev,‘GPU’
allocate(a(2,2))
a(:,:)=GPU_ID
!$omp end parallel

print*,‘------------------------’

!$omp parallel private(dev, cc)
istat=cudaGetDevice(dev)
cc=a(1,1)
print*,dev, cc
print*,‘------------------------’
!$omp end parallel

end

It is far easier, and may be more flexible, to call an orphaned subroutine from your top-level OMP region and declare the device variables there, if they are short-lived. I don’t believe we support threadprivate device arrays at the module level, only declared at the same level as the OMP region, which is probably what Mat is referring to.

MatColgrove · March 1, 2022, 9:51pm

My bad. It’s only device arrays in modules that can’t be put in “threadprivate”.

carowin · March 2, 2022, 1:20am

That works! So why “cc” has to be valued and output use OMP? I find if I do not use OMP, “cc” will not change the value.

carowin · March 2, 2022, 2:51am

I wonder that how MPI solve this problem, if I want to use same name of the arrays on different GPUs?

MatColgrove · March 2, 2022, 3:21pm

Sorry but I’m not understanding this question.

Are you asking why the two separate sections need to be combined into a single “omp parallel” region? If so, it’s because each thread has it’s own separate private copy of “a” so if you want to access each copy it needs to be done within a parallel region.

I wonder that how MPI solve this problem, if I want to use same name of the arrays on different GPUs?

Each MPI rank is an independent process with it’s own address space. Hence while the name of a variable may be the same for each rank, the addresses are completely different.

carowin · March 3, 2022, 3:13am

I will try to use MPI+CUDA. Thanks!

system · March 17, 2022, 3:13am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
about multi GPU control CUDA Programming and Performance	3	714	December 23, 2019
Is it possible to use both OpenMP + CUDA in PGI fortran ? Legacy PGI Compilers	4	7699	December 18, 2010
Multi-dimension array allocation problem Legacy PGI Compilers	2	2344	November 30, 2017
Using multiple GPUs Legacy PGI Compilers	7	22086	August 11, 2009
CUDA & openMP Problem with the SDK sample code CUDA Programming and Performance	11	14007	September 12, 2015
OpenMP + CUDA Fortran issue Legacy PGI Compilers	1	3177	February 8, 2011
Running CUDA-Fortran on multiple GPU nodes nvc, nvc++ and nvfortran	4	811	March 12, 2021
Multiple GPUs with nvc++ -stdpar nvc, nvc++ and nvfortran	11	1308	January 2, 2024
Multi-GPU MPI launch failing when UVM enabled Legacy PGI Compilers	5	3777	January 2, 2019
Problem using CUDA Visual Profiler for CUDA Fortran Legacy PGI Compilers	1	5092	August 6, 2012

CUDA Fortran+Openmp problem

Related topics