I have a computer with 2 GPUs and 8 cores, and I want to use all the GPUs and cores by OpenMP. There are many jobs with less but more time consuming jobs on GPU, on the other hand, the jobs on CPU is much more but all of them are much less time consuming. I tried to used the dynamic schedule, but some of the jobs were missed. So how should I arrange the jobs and the threads?
I’m not sure how much we can help here. Load balancing questions are often very application and workload dependent and logic needs to be added to the program to determine the work sizes used for the threads using the GPUs and those that use the CPUs.
Can you provide more details and perhaps a code example of what you’re doing?
Thanks for your reply. The following is a code example.
integer,parameter :: nTask=20,nCore=4,nGPU=2,nGPUTask=4
c$omp do schedule(dynamic,ChunkSize) private(myid,cmd,iTask)
write(cmd,'(a6,i2)') 'sleep ',secs(iTask)
& 'time=',secs(iTask),', id=',myid,', Task. ',iTask
c$omp end do
c$omp end parallel
end program main
Notes: We have 20 tasks and the first 4 tasks must be executed on 2 GPUs, the remained on CPU cores. Suppose we have 4 threads right now. The GPU tasks takes longer time while the CPU tasks are smaller but with higher quantity. I’m using chunk here trying to make Task 0 and 1 on Thread 0, and Task 2 and 3 on Thread 1. But sometimes Task 2 and 3 are on other threads. So what can I do?
Compilation: pgfortran -Bstatic_pgi -i8 -r8 -mp test.f -o test
19 Different loop regions with the same schedule and iteration count, even if they occur in the same
20 parallel region, can distribute iterations among threads differently. The only exception is for the
21 static schedule as specified in Table 2.5. Programs that depend on which thread executes a
22 particular iteration under any other circumstances are non-conforming.
In other words, you can not rely on the order in which the threads will execute the loop iterations. If you change this to a static schedule, then the first 8 iterations (Ncores*ChunkSize) will use the same threads each time, but the remaining 12 iterations could be executed by any thread. For this simple example, using a static schedule would work, but probably not useful in a general case.
To determine which tasks should be executed on the GPU, you may consider using some other metric, such as the amount of work. Now what work threshold to use will depend on the system you’re using, the order of the work appears in the queue, how busy the GPUs are, etc.
Possible but, I haven’t used pthreads since grad school 20 years ago, so I may not be able to give the best advice here.
Personally, I would use MPI where one rank is the producer, other ranks using the GPUs (CUDA and/or OpenACC), and one or more ranks using the CPUs. For the CPU ranks, you can either run serially one per core, or a single rank parallelized using OpenMP or OpenACC across all cores. MPI also has the added benefit of allowing you to scale to multiple systems.
Of course, I don’t know your application so you should do what you think is best.