OpenMP and CUDA

qijiin21c1 · October 10, 2017, 1:51am

Hi, all

I have a computer with 2 GPUs and 8 cores, and I want to use all the GPUs and cores by OpenMP. There are many jobs with less but more time consuming jobs on GPU, on the other hand, the jobs on CPU is much more but all of them are much less time consuming. I tried to used the dynamic schedule, but some of the jobs were missed. So how should I arrange the jobs and the threads?

MatColgrove · October 10, 2017, 4:59pm

Hi qijiin21c,

I’m not sure how much we can help here. Load balancing questions are often very application and workload dependent and logic needs to be added to the program to determine the work sizes used for the threads using the GPUs and those that use the CPUs.

Can you provide more details and perhaps a code example of what you’re doing?

-Mat

qijiin21c1 · October 11, 2017, 1:43am

Hi Mat,

Thanks for your reply. The following is a code example.

       program main
       use omp_lib
       implicit none
       integer,parameter :: nTask=20,nCore=4,nGPU=2,nGPUTask=4
       integer ii,jj,myid,secs(nTask),iTask
       integer num_threads
       integer ChunkSize
       character*100 cmd

       secs(1:nTask)=(/8,7,6,5,2,1,1,1,2,2,2,1,1,1,2,2,2,1,1,1/)
       if(mod(nGPUTask,nGPU).eq.0)then
         ChunkSize=nGPUTask/nGPU
       else
         ChunkSize=nGPUTask/nGPU+1
       endif
       call omp_set_num_threads(nCore)
       
c$omp  parallel
c$omp  do schedule(dynamic,ChunkSize) private(myid,cmd,iTask)
       do iTask=1,nTask
         myid=omp_get_thread_num()
         write(cmd,'(a6,i2)') 'sleep ',secs(iTask)
         call system(cmd)
         write(6,'(a5,i1,a5,i1,a8,i2)') 
     &     'time=',secs(iTask),', id=',myid,', Task. ',iTask
       enddo
c$omp  end do
c$omp  barrier
c$omp  end parallel
       end program main

Notes: We have 20 tasks and the first 4 tasks must be executed on 2 GPUs, the remained on CPU cores. Suppose we have 4 threads right now. The GPU tasks takes longer time while the CPU tasks are smaller but with higher quantity. I’m using chunk here trying to make Task 0 and 1 on Thread 0, and Task 2 and 3 on Thread 1. But sometimes Task 2 and 3 are on other threads. So what can I do?

Compilation: pgfortran -Bstatic_pgi -i8 -r8 -mp test.f -o test

MatColgrove · October 11, 2017, 4:35pm

Thanks for the example.

From the OpenMP 4.5 specification Page 59

19 Different loop regions with the same schedule and iteration count, even if they occur in the same
20 parallel region, can distribute iterations among threads differently. The only exception is for the
21 static schedule as specified in Table 2.5. Programs that depend on which thread executes a
22 particular iteration under any other circumstances are non-conforming.

In other words, you can not rely on the order in which the threads will execute the loop iterations. If you change this to a static schedule, then the first 8 iterations (Ncores*ChunkSize) will use the same threads each time, but the remaining 12 iterations could be executed by any thread. For this simple example, using a static schedule would work, but probably not useful in a general case.

To determine which tasks should be executed on the GPU, you may consider using some other metric, such as the amount of work. Now what work threshold to use will depend on the system you’re using, the order of the work appears in the queue, how busy the GPUs are, etc.

-Mat

qijiin21c1 · October 12, 2017, 12:56am

Thanks for the reply.

I thought it would be easy to assign a task to a specified thread in openmp. Will pthread do this much more easier? Perhaps I can change my code to C and use pthread.

MatColgrove · October 12, 2017, 1:01pm

Will pthread do this much more easier?

Possible but, I haven’t used pthreads since grad school 20 years ago, so I may not be able to give the best advice here.

Personally, I would use MPI where one rank is the producer, other ranks using the GPUs (CUDA and/or OpenACC), and one or more ranks using the CPUs. For the CPU ranks, you can either run serially one per core, or a single rank parallelized using OpenMP or OpenACC across all cores. MPI also has the added benefit of allowing you to scale to multiple systems.

Of course, I don’t know your application so you should do what you think is best.

-Mat

Topic		Replies	Views
Combining OpenMP and OpenACC Legacy PGI Compilers	4	6178	November 14, 2017
Using multiple GPUs Legacy PGI Compilers	7	22076	August 11, 2009
Different execution time on multi gpu 4 equal cards, different execution time CUDA Programming and Performance	8	1904	March 28, 2011
Using MPI/OpenMP with GPU for batched solution of multiple linear systems of equations CUDA Programming and Performance	2	974	April 4, 2013
openMP translation CUDA Programming and Performance	8	7678	June 10, 2010
OpenMP + CUDA Multiple Parallel Sections Does GPU to Thread linking persist across multiple parallel CUDA Programming and Performance	11	3502	June 29, 2011
Multi-GPU, MPI or threads? best choice for my multi-GPU solution? CUDA Programming and Performance	11	13010	February 16, 2011
Using an OpenMP thread for GPU traffic Legacy PGI Compilers	2	1476	September 4, 2018
MultiGPU, multithread, and establishing contexts Odd (but good) behavior with OpenMP affecting multi CUDA Programming and Performance	4	6241	July 10, 2009
OpenMP and magically evil performance results Legacy PGI Compilers	4	8026	July 1, 2014

OpenMP and CUDA

Related topics