Calling function with target region inside from task

Dear Nvidia developers,

I’m trying to launch a function with target region inside from a task, but the application get a sigsegv. This is the function with target region inside:

subroutine add2s2_omp(a,b,c1,n)
      real a(n),b(n)

!$OMP TARGET TEAMS LOOP 
      do i=1,n
        a(i)=a(i)+c1*b(i)
      enddo
      return
end

And I call like that:

!$OMP TASK
      call add2s2_omp(b,bb(1,1),-alpha(1),n)
!$OMP END TASK
!$OMP TASKWAIT

The application get:

[jwb0033:3448 :0:3448] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
nek5000: malloc.c:4048: _int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed.
[jwb0033:3450 :0:3450] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[jwb0033:3446 :0:3446] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
nek5000: malloc.c:4048: _int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed.

Is it possible to call from a task a routine with target region inside? I’m using NVHPC/21.5-GCC-10.3.0 and ParaStation MPI. Thanks.

Hi unrue,

Yes, it should be possible. We created a simple test code here based on what you posted and it worked correctly for us.

Can you please provide a reproducing example so we can investigate?

Thanks,
Mat

I think not, this code is part of huge code. (nek5000). I don’t know how to isolate just that portion.

Ok, but without a reproducing example we wont be able to help much in determining the issue.

Hi,

I prepared a little test with same dimension used in nek5000 test. The application does not get sigfault, but usigng task the output matrix is not modified. With no task, output is well modified. Maybe this could be a part of the problem?

I compiled as:

mpif90 -O2 -Mipa=acc -mp=gpu -Mcuda=cc80 -mcmodel=medium -Minfo=all -Mpreprocess -r8 task_test.f -o task_test

task_test.f (813 Bytes)

A task needs to be within a parallel regions and works correctly when I add it. While I’m not sure if this is the problem in the full code, it’s what’s wrong here.

% cat task_test.f
        program task_test
          implicit none
          real, dimension(:), allocatable :: b
          real, dimension(:, :), allocatable :: bb
          real alpha
          integer i, n, m

          n = 4669440
          m = 1
          alpha = 1.3

          allocate(b(n))
          allocate(bb(n,m))

          do i=1, n
              b(i) = 1.1
              bb(i,1) = 1.2
          end do

!$OMP PARALLEL
!$OMP SINGLE
!$OMP TASK
      call add2s2_omp(b,bb(1,1),alpha,n)
!$OMP END TASK
!$OMP END SINGLE
!$OMP TASKWAIT
!$OMP END PARALLEL

        do i=1, 10
          write(*,*)  b(i)
        end do

        deallocate(b)
        deallocate(bb)

        end program

      subroutine add2s2_omp(a,b,c1,n)
        real a(n),b(n)
!$OMP TARGET TEAMS LOOP
        do i=1,n
          a(i)=a(i)+c1*b(i)
        enddo
!$OMP END TARGET TEAMS LOOP
        return
      end

% nvfortran -mp -acc task_test.f -Minfo=accel; a.out
add2s2_omp:
     40, Generating implicit map(tofrom:b(:),a(:))
    2.660000
    2.660000
    2.660000
    2.660000
    2.660000
    2.660000
    2.660000
    2.660000
    2.660000
    2.660000

Hi Mat, yes, I totally agree. Now the application works with no error. But another question. Using such approach in a loop, performances are quite bad:

!$OMP PARALLEL
      do k = 2,m
!$OMP TASK      
         call add2s2_omp(xbar,xx(1,k),alpha(k),n)
!$OMP END TASK
!$OMP TASK 
         call add2s2_omp(bbar,bb(1,k),alpha(k),n)
!$OMP END TASK
!$OMP TASK 
         call add2s2_omp(b,bb(1,k),-alpha(k),n)
!$OMP END TASK
!$OMP TASKWAIT
      enddo
!$OMP END PARALLEL

I’m doing something wrong? Thanks.

I rarely use tasks myself so may not be of much help, but don’t you need a “single” region so each thread doesn’t spawn every task? Don’t know if this would fix the performance issue, but you’re generating more task than needed here.

I tried as you suggested, same performance :/