Performance problem launching multiple target regions

unrue · August 4, 2021, 7:35am

I’m experiencing performance problem trying to launch multiple target regions in parallel. This is my actual code:

      do k = 2,m
         call add2s2_omp(xbar,xx(1,k),alpha(k),n)
         call add2s2_omp(bbar,bb(1,k),alpha(k),n)
         call add2s2_omp(b,bb(1,k),-alpha(k),n)
      enddo

where add2s2_omp is:

      subroutine add2s2_omp(a,b,c1,n)
      real a(n),b(n)
!$OMP TARGET TEAMS LOOP
      do i=1,n
        a(i)=a(i)+c1*b(i)
      enddo
!$OMP END TARGET TEAMS LOOP
      return
      end

My goal is to launch in parallel such target regions, in order to reduce useless synchronizations at the end of target region. A at the moment are launched one per time, so one target region starts only when the previous has finished. I tried using task:

!$OMP PARALLEL
!$OMP MASTER
      do k = 2,m
!$OMP TASK      
         call add2s2_omp(xbar,xx(1,k),alpha(k),n)
!$OMP END TASK 
!$OMP TASK 
         call add2s2_omp(bbar,bb(1,k),alpha(k),n)
!$OMP END TASK
!$OMP TASK 
         call add2s2_omp(b,bb(1,k),-alpha(k),n)
!$OMP END TASK
!$OMP TASKWAIT
      enddo
!$OMP END MASTER
!$OMP END PARALLEL

Performances are quite bad. Second attemps is to use nowait in target regions and no tasks. Ever
performances bad. I’m doing something wrong? I’m using CUDA/11.3 and V100 Gpus. Thanks.

MatColgrove · August 4, 2021, 2:56pm

How are you handling data movement? If you’re not using map directives, then the compiler’s needing to copy the “a” and “b” arrays each time entering and exiting the target region.

I’d also suggest you use the Nsight-Systems Profiler to better understand where the performance bottle-necks are.

unrue · August 5, 2021, 8:16am

I don’t see any MAP about a and b in the code (the code is quite big). But I don’t understand why this problem is present only using task. A continuos data movement from-to GPu should be get low performance in each case, regardless using the tasks. I try to investigate more in depth. Thanks.

unrue · August 6, 2021, 7:45am

I tried the following new version:

!$OMP TARGET DATA MAP(TOFROM:xbar,bbar,b) MAP(TO:xx,bb)
!$OMP PARALLEL
!$OMP MASTER
      do k = 2,m
!$OMP TASK      
         call add2s2_omp(xbar,xx(1,k),alpha(k),n)
!$OMP END TASK 
!$OMP TASK 
         call add2s2_omp(bbar,bb(1,k),alpha(k),n)
!$OMP END TASK
!$OMP TASK 
         call add2s2_omp(b,bb(1,k),-alpha(k),n)
!$OMP END TASK
!$OMP TASKWAIT
         enddo
!$OMP END MASTER
!$OMP END PARALLEL
!$OMP END TARGET DATA

No peformance improvement are detected. I’m using nsys and add2s2_omp is the most expensive routines on GPU and cuStreamSynchronize takes 89% of total time.

MatColgrove · August 6, 2021, 3:05pm

That’s just the host being blocked so is inclusive of the GPU execution time.

unrue · August 10, 2021, 6:53am

Ok, but doing more target regions in parallel in async mode cuStreamSynchronize should decrease. Right?

MatColgrove · August 10, 2021, 3:04pm

Possibly. Though it will depend on if the kernels can execute concurrently on the GPU. Since one kernel can’t start until the prior kernel releases device resources (i.e. the multiprocessors (SMs) ), if each kernel fully utilizes the GPUs, then you wont see much difference. It will just move where the CPU is blocked (i.e. moves to the ‘wait’).

Though if each kernel is only using part of the available SMs, and multiple kernels are executing concurrently (via different async queues (CUDA streams)), then you’d see less GPU execution time versus running each kernel sequentially.

unrue · August 16, 2021, 7:36am

Hi Mat,

from NSight, I see three kernels run on the same Stream (14) and each of them are using grid <<108, 1, 1>> and blocks <<1024, 1, 1>>

Since I’m using A100 GPU (not V100), I don’t think each kernel occupies all SMs. Right? Maybe I have to specify to run over different streams? ( I don’t know if is it possibile using omp offload)

unrue · August 16, 2021, 11:31am

Attached a little example. The three kernels run sequentially and I don’t understand the reason. I tried also witk OMP TASK. Same behaviour.

add2s2_omp.f (1.1 KB)

MatColgrove · August 16, 2021, 7:20pm

When I run your program it hangs when “nowait” is there. Not sure the discrepancy, but possibly since a target region is itself a task, there may be a problem when trying to join. Though, I didn’t look into this since that not the core issue.

“nowait” just launches the kernel asynchronously with the host. It does not specify use of different CUDA streams so all kernels will use the same stream. While we don’t support the ‘depends’, we’re looking at how this could be adapted to use multiple streams. Though unlike OpenACC which uses async queues, ‘depends’ creates dependencies on the data making it more challenging for the compiler to implicitly generate CUDA streams.

Though I find it very rare for a program to benefit from running multiple concurrent kernels. It more beneficial at overlapping data movement and compute, or hiding launch latency. Only for kernels with few blocks, would you possibly see benefit and I would rather see if I can modify the code or use a bigger dataset so the full GPU was utilized with each kernel.

Note when I ran your program, the grid size is 36,480. Far larger that the 108 and 1024 you posted above. So not much opportunity to overlap.

I’m not really understanding why you’re using TASK here since it doesn’t seem like it would help. I’d simplify to something like the following.

% cat add2s2_omp.f

      module foo
      contains
      subroutine add2s2_omp_async(a,b,c1,n)
      real a(n),b(n)
      real,value:: c1
      integer,value:: n
!$OMP TARGET TEAMS DISTRIBUTE PARALLEL DO MAP(tofrom:a,b) nowait
      do i=1,n
        a(i)=a(i)+c1*b(i)
      enddo
      return
      end
      end module foo

      program add2s2_omp
        use foo
        implicit none
        integer k, m, n
        real, dimension(:), allocatable :: xbar, bbar, b
        real, dimension(:,:), allocatable :: xx, bb
        real alpha_k

        m = 3
        n = 4669440

        allocate(xbar(n))
        allocate(bbar(n))
        allocate(b(n))
        allocate(xx(n,m))
        allocate(bb(n,m))

        xbar = 1.1
        bbar = 2.2
        b = 3.3
        alpha_k = 0.4
        xx = 3.5
        bb = 2.1

!$OMP TARGET DATA MAP(TOFROM:xbar,bbar,b) MAP(TO:xx,bb)
      do k = 2,m
         call add2s2_omp_async(xbar,xx(:,k),alpha_k,n)
         call add2s2_omp_async(bbar,bb(:,k),alpha_k,n)
         call add2s2_omp_async(b,bb(:,k),-alpha_k,n)
         enddo
!$OMP TASKWAIT
!$OMP END TARGET DATA

       print *, xbar(1),bbar(1),b(1)
       deallocate(xbar)
       deallocate(bbar)
       deallocate(b)
       deallocate(xx)
       deallocate(bb)

      end program

% nvfortran add2s2_omp.f -Minfo=mp -mp=gpu -V21.7 ; a.out
add2s2_omp_async:
      8, !$omp target teams distribute parallel do
          8, Generating Tesla and Multicore code
             Generating "nvkernel_foo_add2s2_omp_async__F1L8_1" GPU kernel
add2s2_omp:
     40, Generating map(tofrom:bbar(:))
         Generating map(to:bb(:,:))
         Generating map(tofrom:xbar(:))
         Generating map(to:xx(:,:))
         Generating map(tofrom:b(:))
     46, Taskwait
    3.900000        3.880000        1.620000

Once we support depends, we can go back and see about using multiple streams. However again, I doubt it will make much impact. To illustrate, here’s the equivalent OpenACC version of this code using multiple async queues. If you run this through Nsight, you’ll see a slight bit of overlap with the kernels, but not enough to offset the cost of creating the streams.

% cat add2s2_acc.f

      subroutine add2s2_acc_async(a,b,c1,n,qid)
      real a(n),b(n)
      real:: c1
      integer:: n,qid
!$ACC PARALLEL LOOP present(a,b) async(qid)
      do i=1,n
        a(i)=a(i)+c1*b(i)
      enddo
      return
      end

      program add2s2_omp
        implicit none
        integer k, m, n
        real, dimension(:), allocatable :: xbar, bbar, b
        real, dimension(:,:), allocatable :: xx, bb
        real alpha_k

        m = 3
        n = 4669440

        allocate(xbar(n))
        allocate(bbar(n))
        allocate(b(n))
        allocate(xx(n,m))
        allocate(bb(n,m))

        xbar = 1.1
        bbar = 2.2
        b = 3.3
        alpha_k = 0.4
        xx = 3.5
        bb = 2.1

!$ACC DATA copy(xbar,bbar,b) copyin(xx,bb)
      do k = 2,m
         call add2s2_acc_async(xbar,xx(1,k),alpha_k,n,1)
         call add2s2_acc_async(bbar,bb(1,k),alpha_k,n,2)
         call add2s2_acc_async(b,bb(1,k),-alpha_k,n,3)
         enddo
!$ACC WAIT
!$acc END DATA

       print *, xbar(1),bbar(1),b(1)
       deallocate(xbar)
       deallocate(bbar)
       deallocate(b)
       deallocate(xx)
       deallocate(bb)

      end program

% nvfortran add2s2_acc.f -Minfo=accel -acc -V21.7 ; a.out
add2s2_acc_async:
      6, Generating present(a(:),b(:))
         Generating Tesla code
          7, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
add2s2_omp:
     36, Generating copy(bbar(:)) [if not already present]
         Generating copyin(bb(:,:),xx(:,:)) [if not already present]
         Generating copy(xbar(:),b(:)) [if not already present]
    3.900000        3.880000        1.620000

unrue · August 17, 2021, 6:58am

Hi Mat,

thanks for very clear explanation. About grid size, I think is strictly depends form the compiler and form the GPU you are using, because From Nsight I used such grid. The problem of such code is I have multiple parts with loops like that and takes the major amount of total execution, so my original idea was to overlap as much possible kernels, because I cannot split these loops to overlap something. I tried your code, bur From Nsight I have a strange behaviour. Apparentely, The code does only memcpy. Attached the screenshot.

MatColgrove · August 17, 2021, 5:54pm

Which compiler version are you using? Is the binary running successfully?

The code I posted will error with 21.1 and 21.3 (segv) but will work with 21.5 and 21.7. Given you have an older Nsight-Systems version, I’m going to assume you have an older compiler version as well and the binary is crashing hence why the kernel is not appearing in the timeline.

unrue · August 18, 2021, 6:35am

I’m using NVHPC 21.5
Yes, the code crashed:

FATAL ERROR: variable in data clause is partially present on the device: name=b

I didn’t realize it. Why with some compilers works and other crashes?

Anyway,my original idea to overlap multiple target region, is started reading the OpenMP manual version 4.5. At page 135, it does an example of target region in a loop, very similar to my case. And from the manual I read:

The following example shows how the task and target constructs are used to execute multiple
target regions asynchronously. The task that encounters the task construct generates an
explicit task that contains a target region. The thread executing the explicit task encounters a
task scheduling point while waiting for the execution of the target region to complete, allowing
the thread to switch back to the execution of the encountering task or one of the previously
generated explicit tasks.

But if each target region run on same stream, how can be run in asynchronous manner? Attached a screenshot of such page. This point in not totally crear from myside.

MatColgrove · August 18, 2021, 5:53pm

Are you using the example I posted? This occurs with your original example because on how you’ve declared your arrays and passing them in. Basically the array size is wrong so when you pass in the pointer, the size of the data on the device doesn’t match.

Anyway,my original idea to overlap multiple target region, is started reading the OpenMP manual version 4.5. At page 135, it does an example of target region in a loop, very similar to my case. And from the manual I read:

I’ve not looked at this example before, but there’s multiple tasks each executing a target region. Each task would be blocked since there’s no “nowait”, but would at least launch each target region concurrently. There’s nothing here to indicate using different CUDA streams.

This example would be more what I’d expect: https://github.com/OpenMP/Examples/blob/main/devices/sources/async_target.4.c

Since we are still working on supporting the “depend” clause so multiple streams wont be used until we add this.

Though my original assertion that executing concurrent kernels is not useful for most algorithms (better to have each kernel fully utilize the GPU). If you really do think you need this, I’d use OpenACC.

-Mat

unrue · August 19, 2021, 11:06am

Ok, understood. Just to expand my example. This is a snippet from my code that is quite big:

!$OMP PARALLEL
!$OMP MASTER
      do k = 2,m
         call add2s2_omp(xbar,xx(1,k),alpha(k),n)
         call add2s2_omp(bbar,bb(1,k),alpha(k),n)
         call add2s2_omp(b,bb(1,k),-alpha(k),n)
      enddo
!$OMP END MASTER

      !Second round of CGS
      do k = 1, m
            alpha(k) = vlsc3_omp_cpu(xx(1,k),w,b,n)
      enddo

!$OMP END PARALLEL

Where vlsc3_omp_cpu is:

      function vlsc3_omp_cpu(x,y,b,n)
      dimension x(n),y(n),b(n)
      real dt

      dt = 0.0
!$OMP DO REDUCTION(+:dt)
      do i=1,n
         dt = dt+x(i)*y(i)*b(i)
      enddo
!$OMP END DO

      vlsc3_omp_cpu = dt

      return
      end

Now, since add2s2_omp has nowait inside, should be run on GPU while vlsc3_omp_cpu run on CPU right? Because it appears to run ever sequentially.

MatColgrove · August 19, 2021, 11:29pm

should be run on GPU while vlsc3_omp_cpu run on CPU right?

I believe so, but isn’t there a dependency on “b”? It’s computed in the call to add2s2_omp but used in vlsc3_omp_cpu.

Overall, this isn’t a strategy I typically like to take. Many users think that they need to maximize utilization of both the GPU and CPU but in this case you may be just making things slower if you end up needing to synchronize the host and device data more often. For example updating “alpha” on the host means that you’ll need to update it the device copy of the data if you enter this routine again.

Also there’s often load-balancing issues where performance bottle-necks on the CPUs. I find it better to offload as much of the compute as you can and limit data movement.

Granted I don’t know your full code or your algorithms, so I’m not saying you can’t get this working or that you shouldn’t, just in my experience this type of strategy of mixing parallelization on the device and CPU just makes things more complex without much or any gain in performance.

unrue · August 20, 2021, 9:23am

Many users think that they need to maximize utilization of both the GPU and CPU

I’m one ot them. I mean, when GPU is working, CPUs does nothing. Why don’t try to assign some work also on CPUs? In this case I have 48 cores that does nothing and it is a waste of resources in my opinion. Otherwhise, if compute node has only 4 cores, for example, probably it is not convenient to use such strategy because the performance improvment from cores is very limited compared to GPUs.

For example updating “alpha” on the host means that you’ll need to update it the device copy of the data if you enter this routine again.

Alpha is just a scalar (alpha(k)). The copy of a scalar can impact over performance?

Also there’s often load-balancing issues where performance bottle-necks on the CPUs. I find it better to offload as much of the compute as you can and limit data movement.

In such code all can be offloaded is offloaded. But I’m trying to improve further. First attempt was to overlap work on CPU and GPU, but maybe I have to focus my work on copies from/to GPU. I’m not sure there are fully optimized. Some advice? The code is very huge.

MatColgrove · August 20, 2021, 4:24pm

If there are no dependencies between the host and device, or limited data movement, sure this will work fine. Though the complexity and performance loss occurs when having to manage the two discrete memories. Even worse is trying to manage multiple CPU threads memories and/or multiple GPUs. This style of programming can lead to dramatic increase in data movement offsetting any gains from using the CPU for compute.

Alpha is just a scalar (alpha(k)). The copy of a scalar can impact over performance?

Fair enough. Alpha is an array, but you’re correct only the scalar alpha(k) is being used on the device. It will be implicitly copied to the device, most likely as an argument to the generated CUDA kernel. I was just using it as an example. Perhaps a better example would be “b”. You fill it on device and then consume it on the host. Besides needing a barrier after the end master, you’ll also need to copy b back to the host.

First attempt was to overlap work on CPU and GPU, but maybe I have to focus my work on copies from/to GPU. I’m not sure there are fully optimized. Some advice?

I don’t want to say that overlapping compute on the CPU and GPU can’t be done in a performant manner, I just haven’t seen it.

My typical strategy is to offload as much compute to the GPU as possible (even in some small sequential sections when the cost to move the data outweighs the cost of running serially on the device), and minimize data movement After that use the profilers, Nsight-System and Nsight-Compute, Nsight-System to identify data movement and CPU bottlenecks, Nsight-Compute to understand the performance of the individual kernels.

You may also consider using the “target teams loop” construct rather than “target teams distribute parallel do”. “loop” is more like OpenACC and allows the compiler more opportunity to optimize. With “distribute”, the compiler must outline the region and pass it to the runtime. Granted “loop” is more restrictive and support by other compilers is limited, but something you should look at using.

unrue · August 23, 2021, 8:44am

Hi Mat,

thanks for the reply. I already use TEAMS LOOP in my code. It was first optimization I did, having 10% of performance improvement.

From Nsight-System, I see the computation and data movement are not overlapped. So I think this is bigger problem of such code. Do you confirm from the screenshot?

Another problem is that some kernels has a GPU theoretical occupancy of only 25%, so I Have to analyze it using Nsight-Compute

screenshot|690x350

MatColgrove · August 23, 2021, 4:37pm

So I think this is bigger problem of such code. Do you confirm from the screenshot?

In general, you’d want to eliminate as much data movement as you can. What’s going on the host that requires the data to be copied back and forth? Are there additional sections of code that offloaded so the data movement is not necessary?

I see the computation and data movement are not overlapped.

This may or may not be an issue. Interleaving data movement and compute only works if there’s no data dependencies. Typically done when batching so one batch is computing while the next is sending it’s data to the device.

Again, I’d first try to remove data movement, but then if there’s not a data dependency, try to overlap compute and data movement by adding “nowait” to the compute regions and update directives.

-Mat