accelerator strange beahviour

franzisko · July 31, 2012, 9:04am

Hello,
I have a problem accelerating a code. I managed to reproduce the strange behaviour:

program strange_behaviour
implicit none
integer, parameter :: nx=4,ny=4,nz=2
integer :: i,j,k
real, dimension(nx,ny,nz) :: a,b

a=10
call random_number(b)

!$acc kernels
do j=1,ny
do i=1,nx
!$acc do seq
   do k=1,nz
      a(i,j,k) = 3.
   enddo

!  IT DOES NOT WORK ON GPU
   b(i,j,1) = a(i,j,1)

!  WORKAROUND
!!$acc do seq
!   do k=1,1
!   b(i,j,k) = a(i,j,k)
!   enddo

enddo
enddo
!$acc end kernels

print*,'b: ',b
end program strange_behaviour

I can fix the problem using the given workaround but It would be very appreciated to avoid to use it, if possibile. Is there anything I do not understand? I am using PGI 12.6

thanks for any help
Francesco

MatColgrove · July 31, 2012, 11:51pm

Hi Francesco,

Can you please give more details on the problem you are seeing? When I run your code with and without -acc, I get the same answers.

% pgf90 strange.f90 -Minfo -V12.6
% a.out
 b:     3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
   0.6280164       0.6701866       0.6281718       0.5310344     
   0.5005002       0.4253310       0.3070166       0.2169546     
   0.6901000       0.8211479       0.8735071       0.9649668     
   0.8245004       0.4523637       0.2586277       0.3373762    
% pgf90 strange.f90 -Minfo -V12.6 -acc
strange_behaviour:
      7, Memory set idiom, array assignment replaced by call to pgf90_mset4
     10, Generating copyout(b(:,:,:1))
         Generating copyout(a(:,:,:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     11, Loop is parallelizable
     12, Loop is parallelizable
         Accelerator kernel generated
         11, !$acc loop gang ! blockidx%y
         12, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
             CC 1.0 : 10 registers; 32 shared, 4 constant, 0 local memory bytes
             CC 2.0 : 13 registers; 0 shared, 48 constant, 0 local memory bytes
     14, Loop is parallelizable
% a.out
 b:     3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
   0.6280164       0.6701866       0.6281718       0.5310344     
   0.5005002       0.4253310       0.3070166       0.2169546     
   0.6901000       0.8211479       0.8735071       0.9649668     
   0.8245004       0.4523637       0.2586277       0.3373762

Mat

franzisko · August 1, 2012, 10:55am

Hi Mat,

I run exactly as you do but the results are not correct. My device is Tesla S2050 and unfortunately I cannot test any other device easily.

% pgf90 strange.f90 -Minfo -V12.6
ella008:~/w/COSMO/CONVSTN_LUGLIO/TEST/PGI_BUG>./a.out 
 b:     3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
    3.000000        3.000000        3.000000        3.000000     
   0.6280164       0.6701866       0.6281718       0.5310344     
   0.5005002       0.4253310       0.3070166       0.2169546     
   0.6901000       0.8211479       0.8735071       0.9649668     
   0.8245004       0.4523637       0.2586277       0.3373762    
% pgf90 strange.f90 -Minfo -V12.6 -acc 
strange_behaviour:
      7, Memory set idiom, array assignment replaced by call to pgf90_mset4
     10, Generating copyout(b(:,:,:1))
         Generating copyout(a(:,:,:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     11, Loop is parallelizable
     12, Loop is parallelizable
         Accelerator kernel generated
         11, !$acc loop gang ! blockidx%y
         12, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
             CC 1.0 : 10 registers; 32 shared, 4 constant, 0 local memory bytes
             CC 2.0 : 13 registers; 0 shared, 48 constant, 0 local memory bytes
     14, Loop is parallelizable

%./a.out 
 b:   -1.9983972E+18  -1.9983972E+18  -1.9983972E+18  -1.9983972E+18 
  -1.9983972E+18  -1.9983972E+18  -1.9983972E+18  -1.9983972E+18 
  -1.9983972E+18  -1.9983972E+18  -1.9983972E+18  -1.9983972E+18 
  -1.9983972E+18  -1.9983972E+18  -1.9983972E+18  -1.9983972E+18 
   0.6280164       0.6701866       0.6281718       0.5310344     
   0.5005002       0.4253310       0.3070166       0.2169546     
   0.6901000       0.8211479       0.8735071       0.9649668     
   0.8245004       0.4523637       0.2586277       0.3373762

The code works using PGI 12.5.

thanks
Francesco

MatColgrove · August 1, 2012, 9:27pm

Hi Francesco,

What CUDA Driver version do you have? (see the output from pgaccelinfo) In 12.6 we switched to using CUDA 4.2 by default so if your driver is old it may cause problems.

You can also try switching back to using CUDA 4.0 by compiling with “-ta=nvidia,cuda4.0”.

Mat

franzisko · August 2, 2012, 2:35pm

Hi Mat,

thanks for attention,

out CUDA driver is 4.1. However, trying to compile using cuda4.0 or cuda4.1 the error is still the same. Anything is always fine using PGI V12.5.

Something (maybe similar) happens in the big code I am working at the moment.

Francesco

MatColgrove · August 2, 2012, 5:52pm

Hi Francesco,

When move systems, I’m now able to recreate the error. Interestingly, the failure is intermittent suggesting a synchronisation issue within the kernel. I have sent a report to our engineers (TPR#18854). I’ll continue to look for a work-around.

Mat

tull · November 1, 2013, 11:14pm

TPR 18854 has been corrected in the 13.10 release.

thanks for the report.

dave

Topic		Replies	Views
PGF90-F-0155-Compiler failed to translate accelerator region Legacy PGI Compilers	6	9262	December 6, 2013
Performance decrease with PGI 12.1 Legacy PGI Compilers	11	6312	May 10, 2012
compiler ask acc routine information for internal function Legacy PGI Compilers	12	20306	October 25, 2017
Erroneous behavior with loops inside sequential routines Legacy PGI Compilers	3	3043	June 5, 2015
Accelerator Fatal Error: No NVIDIA/CUDA version... Legacy PGI Compilers	12	14632	May 15, 2017
compiler error. pragma: bad ilmopc Legacy PGI Compilers	3	7373	December 23, 2019
function/procedure calls not supported Legacy PGI Compilers	5	7466	March 2, 2012
PGF90-W-0155-Compiler failed ... with PGI 12.4 Legacy PGI Compilers	17	11274	August 30, 2012
Problem accelerating nested arrays Legacy PGI Compilers	5	7109	August 4, 2010
Code execution depends strangely on irrelevant parameters Legacy PGI Compilers	8	8071	October 22, 2013

accelerator strange beahviour

Related topics