Code execution depends strangely on irrelevant parameters

Hi,

This is likely going to be something stupid, but I’m at my wits end here. My code doesn’t seem to be executing properly. I’ve stripped the problem down to the most simple form (obviously I’ve changed variable names and initialized the variables to silly values since the work is proprietary).

program Mat_test

  implicit none
  integer :: i, j, endn
  integer :: inds(260)
  real*8  :: A(260, 260, 89), B(260, 89), C(260, 89, 89), D(260, 260)
  real*8  :: test(89), test1(89), test2(89)
  

  do i = 1, 89
     A(:,:,i) = i*2.5
     B(:,i) = i * 3.5
     do j = 1, 89
        C(:,i,i) = i*j*0.01
     end do
  end do

  endn = 250

!$acc kernels
!$acc loop independent private(D)
  do  i = 1, 2
     
     do  j = 1, 260
        A(:,j, i) = A(:,j,i) / B(:, i)
     enddo
     
     
     do  j = 1, endn
        D(:,j) = - A(:,j,i) * C(j,i,i)
        D(j,j) = D(j,j) + 1.0         
     enddo
     test(i) = D(5,5)
     test1(i) = - A(5,5,i)
     test2(i) = C(5,i,i)
     
     
  enddo
!$acc end kernels

  print *, test(1), test1(1), test2(1)
  print *, test(2), test1(2), test2(2)
  print *, A(5,5,1)
  
end program Mat_test

In my full version, D is used later on in the loop, but is unique for each iteration through i.

The problem is that the value of “test(2)” depends on “endn”. For endn = 10, it gives the correct answer, while for endn = 250 is doesn’t. In principle it should loop to 260, but it’s causing me problems.

What am I missing here?
-Rob

Hi Rob,

It looks like we might be privatizing D for each thread, not each gang, when using the kernels region. This causes the program to use too much global data.

The work around is to use the “parallel” directive instead.

% cat mat_test.F90
program Mat_test

   implicit none
   integer :: i, j, endn
   integer :: inds(260)
   real*8  :: A(260, 260, 89), B(260, 89), C(260, 89, 89), D(260, 260)
   real*8  :: test(89), test1(89), test2(89)


   do i = 1, 89
      A(:,:,i) = i*2.5
      B(:,i) = i * 3.5
      do j = 1, 89
         C(:,i,i) = i*j*0.01
      end do
   end do

   endn = 250

 !$acc parallel loop gang private(D)
   do  i = 1, 2

!$acc loop vector
      do  j = 1, 260
         A(:,j, i) = A(:,j,i) / B(:, i)
      enddo

!$acc loop vector
      do  j = 1, endn
         D(:,j) = - A(:,j,i) * C(j,i,i)
         D(j,j) = D(j,j) + 1.0
      enddo
      test(i) = D(5,5)
      test1(i) = - A(5,5,i)
      test2(i) = C(5,i,i)


   enddo
 !acc end parallel

   print *, test(1), test1(1), test2(1)
   print *, test(2), test1(2), test2(2)
   print *, A(5,5,1)

 end program Mat_test
% pgf90 -acc -Minfo=accel mat_test.F90
mat_test:
     20, Accelerator kernel generated
         21, !$acc loop gang ! blockidx%x
         24, !$acc loop vector(256) ! threadidx%x
         29, !$acc loop vector(256) ! threadidx%x
     20, Generating present_or_copyout(test2(:2))
         Generating present_or_copyout(test1(:2))
         Generating present_or_copyout(test(:2))
         Generating present_or_copy(a(:,:,:2))
         Generating present_or_copyin(b(:,:2))
         Generating present_or_copyin(c(:250,:2,:2))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     24, Loop is parallelizable
     25, Loop is parallelizable
     29, Loop is parallelizable
     30, Loop is parallelizable
% a.out
   0.3642857245036534       -0.7142857142857143        0.8899999856948853
  -0.2714285509926933       -0.7142857142857143         1.779999971389771
   0.7142857142857143

Hope this helps,
Mat

Still no dice. I used the same directives as you did and for endn=250 I still don’t get the right result. I also included some other info in case it can be of help.

crw8398 ~/codes/sandbox> pgf90 -acc -Minfo=accel -fast -o Mat_test Mat_test.f90
mat_test:
     20, Accelerator kernel generated
         21, !$acc loop gang ! blockidx%x
         24, !$acc loop vector(256) ! threadidx%x
         30, !$acc loop vector(256) ! threadidx%x
     20, Generating present_or_copyout(test2(:2))
         Generating present_or_copyout(test1(:2))
         Generating present_or_copyout(test(:2))
         Generating present_or_copy(a(:,:,:2))
         Generating present_or_copyin(b(:,:2))
         Generating present_or_copyin(c(:250,:2,:2))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     24, Loop is parallelizable
     25, Loop is parallelizable
     30, Loop is parallelizable
     31, Loop is parallelizable
crw8398 ~/codes/sandbox> Mat_test
launch CUDA kernel  file=/home/wiersmar/codes/sandbox/Mat_test.f90 function=mat_test line=20 device=0 grid=2 block=256
  -0.2714285509926933       -0.7142857142857143        0.8899999856948853
  -0.2714285509926933       -0.7142857142857143         1.779999971389771
   0.7142857142857143

Accelerator Kernel Timing data
/home/wiersmar/codes/sandbox/Mat_test.f90
  mat_test  NVIDIA  devicenum=0
    time(us): 2,205
    20: compute region reached 1 time
        20: data copyin reached 4 times
             device time(us): total=407 max=367 min=10 avg=101
        20: kernel launched 1 time
            grid: [2]  block: [256]
             device time(us): total=1,435 max=1,435 min=1,435 avg=1,435
            elapsed time(us): total=1,452 max=1,452 min=1,452 avg=1,452
        42: data copyout reached 4 times
             device time(us): total=363 max=334 min=9 avg=90
crw8398 ~/codes/sandbox> pgaccelinfo
CUDA Driver Version:           5050
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  319.49  Tue Aug 13 21:15:53 PDT 2013

Device Number:                 0
Device Name:                   Tesla C2075
Device Revision Number:        2.0
Global Memory Size:            5636554752
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             1566 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           79292 microseconds
Current free memory:           5569896448
Upload time (4MB):             2613 microseconds (1410 ms pinned)
Download time:                 2322 microseconds (1270 ms pinned)
Upload bandwidth:              1605 MB/sec (2974 MB/sec pinned)
Download bandwidth:            1806 MB/sec (3302 MB/sec pinned)
PGI Compiler Option:           -ta=nvidia,cc20

Device Number:                 1
Device Name:                   Quadro 600
Device Revision Number:        2.1
Global Memory Size:            1072889856
Number of Multiprocessors:     2
Number of Cores:               64
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1280 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             800 MHz
Memory Bus Width:              128 bits
L2 Cache Size:                 131072 bytes
Max Threads Per SMP:           1536
Async Engines:                 1
Unified Addressing:            Yes
Initialization time:           79292 microseconds
Current free memory:           577249280
Upload time (4MB):             1545 microseconds ( 897 ms pinned)
Download time:                 2280 microseconds (1989 ms pinned)
Upload bandwidth:              2714 MB/sec (4675 MB/sec pinned)
Download bandwidth:            1839 MB/sec (2108 MB/sec pinned)
PGI Compiler Option:           -ta=nvidia,cc20
crw8398 ~/codes/sandbox>

Thanks,
Rob

What’s the right result? I get the same output with and without directives.

% pgf90 mat_test.F90
% a.out
   0.3642857245036534       -0.7142857142857143        0.8899999856948853
  -0.2714285509926933       -0.7142857142857143         1.779999971389771
   0.7142857142857143
% pgf90 mat_test.F90 -acc
% a.out
   0.3642857245036534       -0.7142857142857143        0.8899999856948853
  -0.2714285509926933       -0.7142857142857143         1.779999971389771
   0.7142857142857143


% gfortran mat_test.F90
% a.out
  0.36428572450365337      -0.71428571428571430       0.88999998569488525
 -0.27142855099269325      -0.71428571428571430        1.7799999713897705
  0.71428571428571430
% ifort mat_test.F90; a.out
  0.364285724503653      -0.714285714285714       0.889999985694885
 -0.271428550992693      -0.714285714285714        1.77999997138977
  0.714285714285714

You’ve got the right result. I don’t for some reason (although I do without directives) :(.

Look at the first column in my output:

-0.2714285509926933 -0.7142857142857143 0.8899999856948853
-0.2714285509926933 -0.7142857142857143 1.779999971389771
0.7142857142857143

>

Hmm, it may be because I’m using the 13.10 pre-release compiler. With 13.9, I show it failing intermittently.

13.10 should be out very soon.

  • Mat

Thanks Mat.

A few further questions then:

  • How long is the “very soon” for 13.10? Are we talking days, weeks, or months?
  • Is the choice to use the ‘parallel’ directive rather than the ‘kernel’ directive one that should have been obvious to me? If not, does that mean there is some degree of trial and error to using OpenACC?
  • I’m surprised this wasn’t noticed before … Do people normally use mainly scalars and small arrays as loop-dependent quanities?

Rob

How long is the “very soon” for 13.10? Are we talking days, weeks, or months?

Days. Hopefully this week.

Is the choice to use the ‘parallel’ directive rather than the ‘kernel’ directive one that should have been obvious to me? If not, does that mean there is some degree of trial and error to using OpenACC?

The general rule of thumb is to use “kernels” when you have tightly nested loops, many loops, or using Fortran array syntax. “parallel” is best used when loops are not tightly nested (like yours) or when the user wants more explicit control over the loop scheduling.

Here, the issue becomes at which level, the “gang” or “vector”, the “private” array is created. Because you want privatize a large 260x260 array, it’s simply too big to have every vector thread have it’s own copy. Hence, you need to make sure it’s applied to the “parallel” directive which defaults to having each “gang” having a private copy that’s shared amongst the “vectors”. Alternatively, it should have the same effect if you apply it to a “!$acc loop gang private” directive, but it looks like we’re applying the “private” to the “vector” level when “kernels” is used. I’m investigating and will submit a problem report.

I’m surprised this wasn’t noticed before … Do people normally use mainly scalars and small arrays as loop-dependent quanities?

Scalars are private by default. I’ll see a few codes that do use small arrays, but I wouldn’t classify a 260x260 array as small given that you need to multiply it by the number of vectors or gangs.

  • Mat

Gotcha. I eagerly await 13.10!

I recognize that it’s probably a large array. I was supposing that other people must only use small private arrays since this apparently hasn’t been noticed yet. Basically I’m trying to figure out if I’m doing something out of the ordinary since that usually increases the number of issues I expect to come across.

Rob