accelerator strange beahviour

Hello,
I have a problem accelerating a code. I managed to reproduce the strange behaviour:

program strange_behaviour
implicit none
integer, parameter :: nx=4,ny=4,nz=2
integer :: i,j,k
real, dimension(nx,ny,nz) :: a,b

a=10
call random_number(b)

!$acc kernels
do j=1,ny
do i=1,nx
!$acc do seq
   do k=1,nz
      a(i,j,k) = 3.
   enddo

!  IT DOES NOT WORK ON GPU
   b(i,j,1) = a(i,j,1)

!  WORKAROUND
!!$acc do seq
!   do k=1,1
!   b(i,j,k) = a(i,j,k)
!   enddo

enddo
enddo
!$acc end kernels

print*,'b: ',b
end program strange_behaviour

I can fix the problem using the given workaround but It would be very appreciated to avoid to use it, if possibile. Is there anything I do not understand? I am using PGI 12.6

thanks for any help
Francesco

Hi Francesco,

Can you please give more details on the problem you are seeing? When I run your code with and without -acc, I get the same answers.

% pgf90 strange.f90 -Minfo -V12.6
% a.out
 b:     3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
   0.6280164       0.6701866       0.6281718       0.5310344     
   0.5005002       0.4253310       0.3070166       0.2169546     
   0.6901000       0.8211479       0.8735071       0.9649668     
   0.8245004       0.4523637       0.2586277       0.3373762    
% pgf90 strange.f90 -Minfo -V12.6 -acc
strange_behaviour:
      7, Memory set idiom, array assignment replaced by call to pgf90_mset4
     10, Generating copyout(b(:,:,:1))
         Generating copyout(a(:,:,:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     11, Loop is parallelizable
     12, Loop is parallelizable
         Accelerator kernel generated
         11, !$acc loop gang ! blockidx%y
         12, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
             CC 1.0 : 10 registers; 32 shared, 4 constant, 0 local memory bytes
             CC 2.0 : 13 registers; 0 shared, 48 constant, 0 local memory bytes
     14, Loop is parallelizable
% a.out
 b:     3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
   0.6280164       0.6701866       0.6281718       0.5310344     
   0.5005002       0.4253310       0.3070166       0.2169546     
   0.6901000       0.8211479       0.8735071       0.9649668     
   0.8245004       0.4523637       0.2586277       0.3373762
  • Mat

Hi Mat,

I run exactly as you do but the results are not correct. My device is Tesla S2050 and unfortunately I cannot test any other device easily.

% pgf90 strange.f90 -Minfo -V12.6
ella008:~/w/COSMO/CONVSTN_LUGLIO/TEST/PGI_BUG>./a.out 
 b:     3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
   0.6280164       0.6701866       0.6281718       0.5310344     
   0.5005002       0.4253310       0.3070166       0.2169546     
   0.6901000       0.8211479       0.8735071       0.9649668     
   0.8245004       0.4523637       0.2586277       0.3373762    
% pgf90 strange.f90 -Minfo -V12.6 -acc 
strange_behaviour:
      7, Memory set idiom, array assignment replaced by call to pgf90_mset4
     10, Generating copyout(b(:,:,:1))
         Generating copyout(a(:,:,:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     11, Loop is parallelizable
     12, Loop is parallelizable
         Accelerator kernel generated
         11, !$acc loop gang ! blockidx%y
         12, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
             CC 1.0 : 10 registers; 32 shared, 4 constant, 0 local memory bytes
             CC 2.0 : 13 registers; 0 shared, 48 constant, 0 local memory bytes
     14, Loop is parallelizable

%./a.out 
 b:   -1.9983972E+18  -1.9983972E+18  -1.9983972E+18  -1.9983972E+18 
  -1.9983972E+18  -1.9983972E+18  -1.9983972E+18  -1.9983972E+18 
  -1.9983972E+18  -1.9983972E+18  -1.9983972E+18  -1.9983972E+18 
  -1.9983972E+18  -1.9983972E+18  -1.9983972E+18  -1.9983972E+18 
   0.6280164       0.6701866       0.6281718       0.5310344     
   0.5005002       0.4253310       0.3070166       0.2169546     
   0.6901000       0.8211479       0.8735071       0.9649668     
   0.8245004       0.4523637       0.2586277       0.3373762

The code works using PGI 12.5.

thanks
Francesco

Hi Francesco,

What CUDA Driver version do you have? (see the output from pgaccelinfo) In 12.6 we switched to using CUDA 4.2 by default so if your driver is old it may cause problems.

You can also try switching back to using CUDA 4.0 by compiling with “-ta=nvidia,cuda4.0”.

  • Mat

Hi Mat,

thanks for attention,

out CUDA driver is 4.1. However, trying to compile using cuda4.0 or cuda4.1 the error is still the same. Anything is always fine using PGI V12.5.

Something (maybe similar) happens in the big code I am working at the moment.

Francesco

Hi Francesco,

When move systems, I’m now able to recreate the error. Interestingly, the failure is intermittent suggesting a synchronisation issue within the kernel. I have sent a report to our engineers (TPR#18854). I’ll continue to look for a work-around.

  • Mat

TPR 18854 has been corrected in the 13.10 release.

thanks for the report.

dave