cudaEventSynchronize returned status 4: unspecified

Hello,

I’m compiling the toy code below with the flags

-Mcuda -ta=nvidia,wait,time

If natoms is greater than 472 I get the following error message:

line 23: cudaEventSynchronize returned status 4: unspecified launch failure

Am I exceeding a resource limitation on the Tesla C2075?

  • Sarom
      program test
      
      implicit none
      
      integer irad, iang, iatom, jatom, igridpoint
      integer, parameter :: natoms = 473
      integer, parameter :: nradpt = 96
      integer, parameter :: nlebpt = 1202
      
      double precision, dimension(natoms) :: tempa
      double precision, dimension(natoms,natoms) :: tempb
      double precision, dimension(natoms,nradpt*nlebpt) :: tempc
      double precision, dimension(nradpt*nlebpt) :: tempd
      double precision valuea,valueb,valuec,valued,valuee
      
      do irad=1,nradpt
!$acc data region
!$acc> copyout(tempc(1:natoms,(irad-1)*nlebpt+1:(irad-1)*nlebpt+nlebpt))
!$acc region
!$acc do parallel, private(tempa(1:natoms),tempb(1:natoms,1:natoms))
        do iang=1,nlebpt
          igridpoint=(irad-1)*nlebpt+iang
          do iatom=1,natoms
            tempa(iatom)=1.0D+00
            do jatom=1,natoms
              tempb(jatom,iatom)=1.0D+00
            if (iatom .eq. jatom) cycle
              valuea=dble(jatom)/(22.0D+00/7.00D+00)
              valueb=sin(valuea)
              valuec=dble(iatom)/(22.0D+00/7.00D+00)
              valued=cos(valuec)
              valuee=abs(atan(valued/valueb))
              tempb(jatom,iatom)=valuee**4.0D+00
            enddo
            tempa(iatom)=product(tempb(1:natoms,iatom),1)
          enddo
          tempc(1:natoms,igridpoint)=tempa(1:natoms)
        enddo
!$acc end region
!$acc end data region
      enddo
     
      do irad=1,nradpt
        do iang=1,nlebpt
          igridpoint=(irad-1)*nlebpt+iang
          tempd(igridpoint)=sum(tempc(1:natoms,igridpoint),1)*
     >    dble(igridpoint)/dble(nlebpt*nradpt)
        enddo
      enddo
      
      write(*,*) 'SUM:= ', sum(tempd(1:nradpt*nlebpt),1)
      end

Build output

------ Build started: Project: test-loop-b, Configuration: Debug x64 ------
Compiling Project  ...
test-b.f
test:
     17, Generating copyout(tempc(:,(irad-1)*1202+1:(irad-1)*1202+1202))
     21, Loop is parallelizable
     23, Loop is parallelizable
         Accelerator kernel generated
         21, !$acc do parallel ! blockidx%y
             Using register for 'tempa'
         23, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
     25, Loop is parallelizable
     35, product reduction inlined
         Loop is parallelizable
     37, Loop is parallelizable
         Accelerator kernel generated
         21, !$acc do parallel ! blockidx%y
         37, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
     46, sum reduction inlined
     51, sum reduction inlined
Linking...
test-loop-b build succeeded.

Build log was saved at "file://C:\gamessVS\11.28.2011\test\test-loop-b\x64\Debug\BuildLog.htm"

Hi Sarom,

Private arrays are allocated in one large chunk of memory on the device with each thread getting it’s own section. With 1202 threads, each having a 473x473 elements apiece, at 8 bytes per element, the total memory is just over 2GB. When the arrays are 472, the memory is 1.9999GB.

In looking at the generated GPU code, were using “int” data types to calculate the address of tempb, so when the size goes over 2GB, the index overflows the 32-bit data type. I’ve asked out engineers to take a look at what can be done. Most likely we’ll need to add a flag similar to “-Mlarge_arrays” for the GPU.

The work around is to use small arrays, smaller data type, or manually privatize your arrays. We use a slightly different way to calculate addresses on global arrays that happens to work in this case. It’s not an ideal solution though since it wastes a lot of host memory and requires the medium memory model.

 % cat acc_12_21_11.f

      program test
     
      implicit none
     
      integer irad, iang, iatom, jatom, igridpoint
      integer, parameter :: natoms = 473
      integer, parameter :: nradpt = 96
      integer, parameter :: nlebpt = 1202
     
      double precision, dimension(nlebpt,natoms) :: tempa
      double precision, dimension(nlebpt,natoms,natoms) :: tempb
      double precision, dimension(natoms,nradpt*nlebpt) :: tempc
      double precision, dimension(nradpt*nlebpt) :: tempd
      double precision valuea,valueb,valuec,valued,valuee
     
      do irad=1,nradpt
!$acc data region
!$acc> copyout(tempc(1:natoms,(irad-1)*nlebpt+1:(irad-1)*nlebpt+nlebpt))
!$acc region local(tempb,tempa)
!$acc do parallel
!, private(tempa(1:natoms),tempb(1:natoms,1:natoms))
        do iang=1,nlebpt
          igridpoint=(irad-1)*nlebpt+iang
          do iatom=1,natoms
            tempa(iang,iatom)=1.0D+00
            do jatom=1,natoms
              tempb(iang,jatom,iatom)=1.0D+00
            if (iatom .eq. jatom) cycle
              valuea=dble(jatom)/(22.0D+00/7.00D+00)
              valueb=sin(valuea)
              valuec=dble(iatom)/(22.0D+00/7.00D+00)
              valued=cos(valuec)
              valuee=abs(atan(valued/valueb))
              tempb(iang,jatom,iatom)=valuee**4.0D+00
            enddo
            tempa(iang,iatom)=product(tempb(iang,1:natoms,iatom),1)
          enddo
          tempc(1:natoms,igridpoint)=tempa(iang,1:natoms)
        enddo
!$acc end region
!$acc end data region
      enddo
     
      do irad=1,nradpt
        do iang=1,nlebpt
          igridpoint=(irad-1)*nlebpt+iang
          tempd(igridpoint)=sum(tempc(1:natoms,igridpoint),1)*
     >    dble(igridpoint)/dble(nlebpt*nradpt)
        enddo
      enddo
     
      write(*,*) 'SUM:= ', sum(tempd(1:nradpt*nlebpt),1)
      end
 

% pgf90 -ta=nvidia -Mcuda -mcmodel=medium -fast acc_12_21_11.f
% a.out
 SUM:=     130948983393.7319   
%
  • Mat

Thanks for the response Mat.

Just so I am clear on this. The usage of ‘int’ data types to calculated the address of tempb is outside of my control? The usage of an -i8 flag would have no effect?

And how would I get your work around to work on Win64?

Just so I am clear on this. The usage of ‘int’ data types to calculated the address of tempb is outside of my control? The usage of an -i8 flag would have no effect?

I believe so. Then again, this is just my analysis and I may be wrong. I’ve asked Michael and his team to investigate once they are back from Winter Break.

And how would I get your work around to work on Win64?

You would need to dynamically allocate the arrays. Windows doesn’t support large static arrays, only dynamic.

  • Mat

Hi Mat,

Have the engineers offer a compiler flag to allow for indexing of privatized arrays to beyond 2GB?