Device Memory in OpenMP

Hi all,

I’m trying to assign same values on two GPUs in one node using OpenMP, but I found that after the “$omp end parallel”, all the values are “0” or “0.0”. As I tried to add “lastprivate”, the compiler said “PGF90-S-0533-Clause ‘LASTPRIVATE’ not allowed in OMP PARALLEL”.
So how should I code to ensure the values can be used outside the assignment openmp clause?
BTW, the memory copy works right after the device memory is allocated.

Thanks in advance!

Hi qijiin21c,

I’m assuming you’re meaning that you have an OpenMP+OpenACC code where you’re using OpenMP to parallelize the host code across multiple cores of the CPU and then use OpenACC to offload to the GPU?

I think what would help here is if you could post a code snip-it showing what you’re doing. I don’t think you need to post the whole code, just the OpenMP and OpenACC directives as well as any API calls. In particular I’d like to know what host variables are shared or private in OpenMP and where you have your OpenACC data directives.

Note that the OpenMP “lastprivate” clause can only be added to “omp for” directives.

Thanks,
Mat

Hi Mat,

Sorry, I did not say that it’s CUDA + OpenMP. I cannot post all the codes right now. What I’m trying to do is that allocating GPU memory on two devices first in one routine, assignment values in other routine, and use them in a third routine. The problem is now that outside the assignment routine, the values are gone, all became zero.
Yes, “lastprivate” is added to “omp for” directives.

Thanks!

Hey, here is what I’m trying to do. There are two GPUs, and 12 CPU cores in this server. I have allocated memories in both two GPUs in one subroutine, but after that, I also want to use all the GPU and CPU cores in next subroutine. So can I declare the GPU variables using private or firstprivate in c$omp clause? How can I code on earth?

Thanks.

Hi qijiin21c,

I much prefer and recommend using MPI+OpenACC for multi-GPU programming. Not that you can’t do OpenMP+OpenACC, it’s just a lot easier to program. With MPI, the domain decomposition is a natural part of the program and you can take advantage of GPUDirect so when doing device-to-device transfers, the data does not need to come back to the host.


I have allocated memories in both two GPUs in one subroutine, but after that, I also want to use all the GPU and CPU cores in next subroutine.

This should be fine provided that you have allocated memory on each device from a OpenMP parallel region and have each thread set the appropriate device (via acc_set_device). PGI OpenMP threads are persistent so the next time you enter an OpenACC compute region the same device and device data will be used.

So can I declare the GPU variables using private or firstprivate in c$omp clause?

OpenMP firstprivate variables are only implicitly created on the host. You can then use them on the GPU by putting them in a data region or data clause.

While not exactly what you’re doing, you may want to take a look at the sample code from Chapter 7 of the Parallel Programming with OpenACC book. It has samples for both using MPI+OpenACC and OpenMP+OpenACC.

Hi Mat,
You can try the following simple code:

      program main
      use cudafor
      use omp_lib
      Implicit None
      Integer*4 :: myid,istat,nGPU
      Real*8,Device,Target,Allocatable :: ADev(:)

      istat = cudaGetDeviceCount(nGPU)
      write(6,'(a,i3,a)') 'You have ',nGPU,' devices'

      call omp_set_num_threads(nGPU)
c$omp parallel private(myid)
        myid = omp_get_thread_num()
        istat = cudaSetDevice(myid)
        istat = cudaDeviceReset()
        Allocate(ADev(1024))
        Write(6,'(a7,i3,a11,z20)') 'Device ',myid,
     &    ', address: ',loc(ADev)
c$omp end parallel

        Write(6,'(a)') 'After allocation'

c$omp parallel private(myid)
        myid = omp_get_thread_num()
        istat = cudaSetDevice(myid)
        Write(6,'(a7,i3,a11,z20)') 'Device ',myid,
     &    ', address: ',loc(ADev)
c$omp end parallel

      end program main

compile options: pgfortran test.cuf -o test -Mcuda=ptxinfo,cuda7.5,cc35 -Mfixed -mcmodel=medium -O2 -mp

output:

You have 2 devices
Device 1, address: 1303EE0000
Device 0, address: 1307EC0000
After allocation
Device 0, address: 1307EC0000
Device 1, address: 1307EC0000

It’s strange. After the allocation, the address is the same. How can I fix this?