accelerate a single loop with mpi and gpu

Hi,

If I have a single large loop containing lots of computations that is already divided among mpi tasks, is there an effective way to also use a gpu?

Or is it only possible to either (a) divide the loop into mpi tasks or (b) run the loop threads in parallel on the gpu?

Similarly, to use both mpi and gpus do I need nested loops?

Thanks,
Ben

Hi Ben,

You would use MPI to divide your work across multiple accelerators and then use OpenACC to move the individual MPI process’s work over to the accelerator.

This article may help: http://www.pgroup.com/lit/articles/insider/v4n1a3.htm.

  • Mat

How do you deal with having a lot more MPI processes then GPUs? For example, if I’m running on nodes with 8 cores / GPU and every core runs an MPI process, they’re all fighting over that 1 gpu.

I’m having trouble understanding how you can utilize all 8 cores and the gpu efficiently. Even if you had just 1 MPI process/node and you used something like OpenMP to distribute work among the cores on the node, it still seems you’d have the same problem.

With reguards to the article:
At the very bottom, the table shows the processors used with run times and such. There looks to be 48 cores total. Are they all being used in the GPU run? Also on that system you have 2 cores/GPU. But on computers like Titan at Oak Ridge where you have 16 cores/ gpu will you suffer in performance?

Thanks,
Ben

Hi Ben,

For example, if I’m running on nodes with 8 cores / GPU and every core runs an MPI process, they’re all fighting over that 1 gpu.

It’s a problem. Unless you have a K20, NVIDIA doesn’t support multiple host processes (MPI) using the same GPU. It may work, it’s just not supported. Even if it does work, then you’ve serialized the GPU portion of the code. This situation works only if the MPI processes use the GPU infrequently and not at the same time.

I’m having trouble understanding how you can utilize all 8 cores and the gpu efficiently. Even if you had just 1 MPI process/node and you used something like OpenMP to distribute work among the cores on the node, it still seems you’d have the same problem.

You can do a hybrid model (MPI+OMP) but it is difficult to code. Though, typically using just MPI process (1 per core) is fine. You may loose a bit of time since the processes don’t share memory, but it may not be much.

At the very bottom, the table shows the processors used with run times and such. There looks to be 48 cores total. Are they all being used in the GPU run? Also on that system you have 2 cores/GPU. But on computers like Titan at Oak Ridge where you have 16 cores/ gpu will you suffer in performance?

On my runs, I had one GPU per MPI process. On my test system, although I had 4 host cores, I only used two MPI processes since I had 2 GPUs on that box. On the cluster run, I had 8 nodes with each node having 3 MPI process and GPUs. For the pure host versions (no GPUs), I used a hybrid MPI/OpenMP version of the code. The number of MPI processes were used, but I then used OpenMP to fully populate the host cores.

  • Mat

Not entirely related to the previous posts, but:

What if I had an OpenMP + GPU (either CUDA or directives) code, and again I just have one big loop that needs to be parallelized (nested inside is a smaller, but mostly insignificant loop).

Would it be viable to have one MP thread, say the master thread, execute half the entire loop with a call to a GPU kernel, and then have the other 3 open MP threads take on the work of executing the other half of the loop? Since it wouldn’t make sense to have the loop split between the threads and then to have every thread call the gpu.

If I had, say, 2 GPUs, and 4 processors, then maybe the better way would be to split the work of the loop between 2 open MP threads, and then allow each open mp thread to call one of the two GPUs to do the actual work. Is that reasoanble?

What would be an example of a code structure that would be able to fully take advantage of both open MP and gpu parallelization? Just large nested loops?

Hi Brush,

You certainly could do this with some effort. However, now you have a load balancing problem. You’d need to be able to calculate the amount work being performed, what resources are available, and then divide accordingly. You might be able to for a specific system with a specific workload, but a general solution will be very difficult to get correct.

Also, with a GPU you really want to bring the data over once use it across multiple kernel calls. By introducing the CPU n the mix, you’ll need to move data back more often.

It might work better to use MPI and then use the GPU or OpenMP depending upon the resources. Seem to me that it would be easier to manage. Though, you still have a load balancing problem. Maybe you can arrange the problem so that there is limited synchronization between processes, and a variable work flow (such as a producer/consumer model where the master process queues the work and each slave processes take work as their resources are free)

  • Mat

Why is more data transferred back and forth when more CPUs are introduced?

Why is more data transferred back and forth when more CPUs are introduced?

My assumption would be if you are splitting up a large loop where a portion is run the host and a portion on the GPU, you would need to synchronize the host and device copies of the data at the end of the loop. Having a GPU only version allows more opportunity for you copy the data over to the device and then re-use it in other accelerated loops without synchronizing the data.

Though, there’s no steadfast rule here. I’m sure you can come up with methods where you have one very large OpenMP parallel section which can divide up the work across the CPU and GPU with limited synchronization. If you can get it work for your algorithm, then please do so. Though, to me, this is not natural OpenMP programming and better fits MPI.

  • Mat

Thanks Mat. So if I had two GPUs, would it make sense to split up the work of a single large loop between the two GPUs? And if not, why? Would OMP or MPI be better for doing this?

So if I had two GPUs, would it make sense to split up the work of a single large loop between the two GPUs?

The bottle neck here is the PCIe bus. Both GPUs will share this one bus and essentially serializes your data copies. Hence, the benefit of splitting a single loop across multiple GPUs will depend upon how much compute is done versus how much data needs to be copied. If there is little to no data movement, then, you can expect a near 2x speed-up going from one to two GPUs. However, as you add data movement the speed-up diminishes.

Would OMP or MPI be better for doing this?

In general, you want to move your data over to the GPU once, perform all your computation on the device, and then move it back once. Most likely you’ll need to preform some data synchronization in-between, but ideally this is kept to a minimum.

This is why I prefer MPI over OpenMP for multi-GPU programming using OpenACC. MPI more naturally fits this model since data is already decomposed across processes, each process can manage it’s own device data, and data synchronization is typically limited.

For OpenMP, there is an assumption of data coherence with shared memory. However, when you start using multiple devices, each with their own discrete memory systems, you the programmer now need to make sure the host shared memory maintains this coherence. Secondly, OpenMP makes it more difficult to have a whole program view of your device data since OpenMP’s parallelism is more fine grained. You could have a single parallel region in your main routine that would fix this, but again it’s not normally how OpenMP is used.

Now, if you are only splitting a single loop across multiple GPUs, then I can see using OpenMP instead of MPI. However, I would question if you would see benefit of using the GPU at all, and if using two would be worth the extra programming effort. If 99% of your program time is spent in this loop, and there is limited data movement, then it’s absolutely worth doing multiple GPUs. If 5% of your program time is spent in this loop, then absolutely not.

Again, there is not one way of doing this. You need to evaluate your particular algorithm as to what is the best solution.

  • Mat

Hi again- so at the moment I’m splitting the work of a loop between two GPUs, and using OMP to do so. I’m not expecting this to work well for reasons you have mentioned, but I’m still curious.

Previously I accelerated this loop with just a PGI compute region, and another time with just OMP. Both seemed to work.

Now, to achieve my goal I have added some code, and changed the do loop slightly:

         call omp_set_num_threads(2)

!$OMP PARALLEL PRIVATE(myid)
             myid = OMP_GET_THREAD_NUM()
             call acc_set_device_num(myid, acc_device_nvidia)
!$OMP END PARALLEL 

             chunk=mm/2

!$OMP PARALLEL PRIVATE(exp1,eyp,wx0,wx1,wy0,
!$OMP&                 wy1,m,i,j,rhog,vfac,kap,th,xt,yt)
!$acc kernels loop
             do 100 m=i*chunk+1,i*chunk+chunk ! previously was do 100 m=1, mm

The second PRIVATE statement is the same one I used in my “just omp” implementation. However, in this case, upon compiling I get:

pgfortran -fast -Msave -mp -acc -Minfo=accel  -c slab.f
ppush:
    288, Generating present_or_copy(y3(:))
         Generating present_or_copy(y1(:))
         Generating present_or_copy(x3(:))
         Generating present_or_copy(x1(:))
         Generating present_or_copyin(u2(:))
         Generating present_or_copy(u3(:))
         Generating present_or_copy(u1(:))
         Generating present_or_copyin(ey(:,:))
         Generating present_or_copyin(ex(:,:))
         Generating present_or_copyin(rwx(:lr))
         Generating present_or_copyin(x2(:))
         Generating present_or_copyin(rwy(:lr))
         Generating present_or_copyin(y2(:))
         Generating present_or_copyin(mu(:))
         Generating present_or_copyin(w2(:))
         Generating present_or_copy(w1(:))
         Generating present_or_copy(w3(:))
         Generating NVIDIA code
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    289, Complex loop carried dependence of 'u1' prevents parallelization
         Complex loop carried dependence of 'u3' prevents parallelization
         Loop carried dependence due to exposed use of 'u3(:)' prevents parallelization
         Complex loop carried dependence of 'x1' prevents parallelization
         Complex loop carried dependence of 'x3' prevents parallelization
         Loop carried dependence due to exposed use of 'x3(:)' prevents parallelization
         Complex loop carried dependence of 'y1' prevents parallelization
         Complex loop carried dependence of 'y3' prevents parallelization
         Loop carried dependence due to exposed use of 'y3(:)' prevents parallelization
         Loop carried dependence due to exposed use of 'u1(:)' prevents parallelization
         Loop carried dependence due to exposed use of 'x1(:)' prevents parallelization
         Loop carried dependence due to exposed use of 'y1(:)' prevents parallelization
         Complex loop carried dependence of 'w1' prevents parallelization
         Complex loop carried dependence of 'w3' prevents parallelization
         Loop carried dependence due to exposed use of 'w3(:)' prevents parallelization
         Loop carried dependence due to exposed use of 'w1(:)' prevents parallelization
         Accelerator kernel generated
        295, !$acc loop vector(128) ! threadidx%x
         Loop is parallelizable

and the code crashes at run time. I am wondering: why would these dependencies pop up when they did not appear on the case with just the PGI directives? Looking at the code, I do not see an actual dependency, as demonstrated by the working “just directives” version.

Thanks,
Ben

Hi Ben,

why would these dependencies pop up when they did not appear on the case with just the PGI directives?

Sorry, I can’t tell from just this. Can you post a bit more of the code, including how these variables are being used? To work around it you can add the “independent” clause to your “kernels” directive.

!$OMP PARALLEL PRIVATE(exp1,eyp,wx0,wx1,wy0,
!$OMP& wy1,m,i,j,rhog,vfac,kap,th,xt,yt)
!$acc kernels loop
do 100 m=ichunk+1,ichunk+chunk ! previously was do 100 m=1, mm

Is this just a snip-it and you’ve cut out code? If not, then I’m wondering if “i” is uninitialized.

  • Mat

Is this just a snip-it and you’ve cut out code? If not, then I’m wondering if “i” is uninitialized.

Oh wow, you’re right! Can’t believe I forgot this. I meant to have myid instead of i. Changing this got rid of the unwanted dependancies. Thanks Mat.

Now, at run time I still get some errors, which change from run to run. Perhaps you could tell me what some of these mean:

First run:
Segmentation fault (core dumped)

Second run:
call to cuMemHostUnregister returned error 713: Other
Error: _mp_pcpu_reset: lost thread

Third run:
call to cuMemcpyDtoHAsync returned error 700: Launch failed
Error: _mp_pcpu_reset: lost thread

In case it matters, the new compiler feedback and the full subroutine, respectively. myid is defined before the acc directive, so I think that is okay. Also the number 8404992 is just chunk, as my total particle number is twice that, so that seems okay. Adding copy statements to the code so that the entire arrays are copied still got the same errors.

pgfortran -fast -Msave -mp -acc -Minfo=accel  -c slab.f
ppush:
    288, Generating present_or_copyout(y3(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copy(y1(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copyout(x3(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copy(x1(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copyin(u2(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copyout(u3(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copy(u1(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copyin(ey(:,:))
         Generating present_or_copyin(ex(:,:))
         Generating present_or_copyin(rwx(:lr))
         Generating present_or_copyin(x2(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copyin(rwy(:lr))
         Generating present_or_copyin(y2(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copyin(mu(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copyin(w2(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copy(w1(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copyout(w3(myid*8404992+1:myid*8404992+8404992))
         Generating NVIDIA code
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    289, Loop is parallelizable
         Accelerator kernel generated
        289, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    295, Loop is parallelizable



c
          subroutine ppush

          use OMP_LIB

          include 'slab.h'


          real exp1,eyp
          real wx0,wx1,wy0,wy1
          integer m,i,j,myid,chunk
          real rhog,vfac
          real kap,th
c
! hardcoded for 2 threads
         call omp_set_num_threads(2)
c
!$OMP PARALLEL PRIVATE(myid)
             myid = OMP_GET_THREAD_NUM()
             call acc_set_device_num(myid, acc_device_nvidia)
!$OMP END PARALLEL 
             chunk=mm/2
             dv=dx*dy
!$OMP PARALLEL PRIVATE(exp1,eyp,wx0,wx1,wy0,
!$OMP&                 wy1,m,i,j,rhog,vfac,kap,th,xt,yt)
!$acc kernels loop
             do 100 m=myid*chunk+1,myid*chunk+chunk
c
                exp1=0.
                eyp=0.

          rhog=sqrt(mu(m))/mims
          do l=1,lr

                xt=x2(m)+rwx(l)*rhog
                yt=y2(m)+rwy(l)*rhog

              if(xt.gt.lx) xt=2.*lx-xt
              if(xt.lt.0.)  xt=-xt
              if(xt.eq.lx)  xt=0.9999*lx
              if(yt.ge.ly) yt=yt-ly
              if(yt.le.0.)  yt=yt+ly
              if(yt.eq.ly)  yt=0.9999*ly


        if (ngp.eq.1) then
                i=int(xt/dx+0.5)
                j=int(yt/dy+0.5)
                exp1=exp1 + ex(i,j)
                eyp=eyp + ey(i,j)

        else
                i=int(xt/dx)
                j=int(yt/dy)
c
                wx0=float(i+1)-xt/dx
                wx1=1.-wx0
                wy0=float(j+1)-yt/dy
                wy1=1.-wy0
c
            exp1=exp1 + wx0*wy0*ex(i,j) + wx1*wy0*ex(i+1,j)
     %      + wx0*wy1*ex(i,j+1) + wx1*wy1*ex(i+1,j+1)
c
            eyp=eyp + wx0*wy0*ey(i,j) + wx1*wy0*ey(i+1,j)
     %      + wx0*wy1*ey(i,j+1) + wx1*wy1*ey(i+1,j+1)

        endif
        enddo ! end loop over 4 pt avg
          exp1=exp1/float(lr)
          eyp=eyp/float(lr)
c
c
c   LINEAR: epara=0. for no e para.
c
             th=theta
c1             th=(x2(m)-0.5*lx)/ls
             u3(m)=u1(m)+ epara*2.*dt*q*mims*(eyp*th)
c
             vfac=0.5*( u2(m)*u2(m) + mu(m) )*mims/tets
c             vfac=0.5*( u2(m)*u2(m) - 1.0 )*mims/tets
c
             kap=( kapn -(1.5-vfac)*kapt )
c            kap=( kapn + vfac*kapt )
c
c    LINEAR: next 3 lines are commented out if linear...
c
            x3(m)=x1(m)+ 2.*dt*( ecrossb*eyp )
            y3(m)=y1(m)+ 2.*dt*( u2(m)*th + ecrossb*(-exp1) )
c
            u1(m)=u2(m)  + .25*( u3(m) - u1(m) )
            x1(m)=x2(m)  + .25*( x3(m) - x1(m) )
            y1(m)=y2(m)  + .25*( y3(m) - y1(m) )

            if(x3(m).gt.lx) x3(m)=2.*lx-x3(m)
            if(x3(m).lt.0.) x3(m)=-x3(m)
            if(x3(m).eq.lx) x3(m)=0.9999*lx
            if(y3(m).ge.ly)  y3(m)=y3(m)-ly
            if(y3(m).le.0.)  y3(m)=y3(m)+ly
            if(y3(m).eq.ly)  y3(m)=0.9999*ly

c
c    now, calculate weight for linearized case
c
c---------weigthing delft-f (dependent on ldtype)----

            w3(m)=w1(m) + 2.*dt*( eyp*kap+q*tets*
     %      (th*eyp*u2(m)) )
     %      *(1-w2(m))
c
            w1(m)=w2(m)  + .25*( w3(m) - w1(m) )
c
  100     continue
!$OMP END PARALLEL
   90     continue
c
          return
          end

myid isn’t private in the second parallel region so this might be the source of the new error. Try merging the two regions.

            chunk=mm/2 
             dv=dx*dy
!$OMP PARALLEL PRIVATE(myid,exp1,eyp,wx0,wx1,wy0, 
!$OMP&                 wy1,m,i,j,rhog,vfac,kap,th,xt,yt) 
             myid = OMP_GET_THREAD_NUM() 
             call acc_set_device_num(myid, acc_device_nvidia) 

!$acc kernels loop 
             do 100 m=myid*chunk+1,myid*chunk+chunk

You might need to set the environment variable PGI_ACC_SYNCHRONOUS=1, especially if you’re using an earlier 13.x compiler.

  • Mat

Thanks Mat, that mostly did the trick (I get the correct output now, it just crashes occasioanlly but less often with PGI_ACC_SYNCHRONOUS=1).

I had another question with reguards to something you said earlier:

The bottle neck here is the PCIe bus. Both GPUs will share this one bus and essentially serializes your data copies.

For a majority of my copies, I only have to transfer the portion of the array that is used for half of the loop. So it seems I still transfer about the same amount of data overall. Is it that transfering half of the array still takes about the same time as transferring the entire array?

Ben

Hi Ben,

Is it that transfering half of the array still takes about the same time as transferring the entire array?

If both threads transfer half the array at about the same time, then yes, it typically takes about the same amount of time as if one thread transferred the whole array. So your compute time should half, but the overall data transfer time will stay about the same.

If you can interleave data and compute, then you might be able to maximize the data bandwidth. Though this is tough to do in an OpenMP context given there’s typically a tighter synchronization between threads. Eventually, you’ll also be able to use the OpenACC async clauses which might help in interleaving, but unfortunately, we don’t quite have async working well enough within OpenMP (hence the PGI_ACC_SYNCHRONOUS variable). Async works fine in a serial and MPI context though.

  • Mat

Could you clarify the usage of acc_set_device_num(devicenum,devicetype):

For the device number, are the GPUs numbered 0, 1, 2, … or 1, 2, 3… I thought it was the former, but according to this link (http://www.catagle.com/26-23/pgi_accel_prog_model_1_2.htm) passing in 0 gives default behavior, not the first GPU. Is the CUDA Device Number, as displayed by pgaccelinfo, the number I need to input as my argument to get that device?

What does a devicetype of 0 or 1 do? (I didn’t understand the documentation linked above).

Thanks,
Ben

Hi Ben,

The GPU are numbered 0-N.

For us, the default behavior is to use the lowest numbered device the binary will run. Typically this would be device zero, though could be something higher. The device information, including the numbering, can be found by running the “pgaccelinfo” utility.

For the the devicetype, you should use the enumerated names such as ACC_DEVICE_NVIDIA since the numbering may not be consistent between compilers. You can see the PGI list by viewing the header file “include/accel.h” (located in your PGI installation directory).

From 13.5’s accel.h:

typedef enum{
        acc_device_none = 0,
        acc_device_default = 1,
        acc_device_host = 2,
        acc_device_not_host = 3,
        acc_device_nvidia = 4,
        acc_device_radeon = 5,
        acc_device_xeonphi = 6,
        acc_device_pgi_opencl = 7,
        acc_device_nvidia_opencl = 8,
        acc_device_opencl = 9
    }acc_device_t;

Hope this helps,
Mat

Hi Matt,

If you do have a K20 GPU is there anything special you need to do to make it possible for multiple MPI processes to make calls to the same GPU simultaneously? Or should it work automatically? (ANSWER EDITED IN BELOW!)

Ben

EDIT: I did my homework this time. From the link below:
" All it takes is a Tesla K20 GPU with a CUDA 5 installation and setting an environment variable to let multiple MPI ranks share the GPU – Hyper-Q is then ready to use."
http://blogs.nvidia.com/blog/2012/08/23/unleash-legacy-mpi-codes-with-keplers-hyper-q/

Here’s a question, though: what’s the environment variable?