Problem loading data to shared memory

Alberto_Sicilia · September 13, 2010, 4:32pm

Hi,

I’d like to make use of shared memory in CUDA Fortran, but I get an error that I don’t understand.

I want to pass data from a 2-d array in global memory (X_global) to
a 1-d array in shared memory (X_shared).

Each element of the arrays is of type “float4” (that I defined in a separate
module).

 
attributes(global) subroutine steepest_kernell(X_global,Nmin,Npart)
integer, value :: Nmin, Npart
type(float4), device, dimension(Npart,Nmin) :: X_global
type(float4), shared, dimension(4) :: X_shared
X_shared(threadidx%x) = X_global(threadidx%x,blockidx%x)
end subroutine steepest_kernell
[\code]

But I get the following error when compiling:

[quote]
PGF90-S-0000-Internal compiler error. unsupported operation     264 (module_minization_gpu.f90: 48)
PGF90-S-0000-Internal compiler error. unsupported operation     264 (module_minization_gpu.f90: 48)
  0 inform,   0 warnings,   2 severes, 0 fatal for steepest_kernell
/tmp/pgcudaforpcefTExtKbwL.gpu(23): error: expected an expression
/tmp/pgcudaforpcefTExtKbwL.gpu(23): error: expected an expression
2 errors detected in the compilation of "/tmp/pgnvdRdefbY8bwnCL.nv0".
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code       0 (module_minization_gpu.f90: 48)
[\quote]

When I comping in emulation mode (-Mcuda=emu), everything works nicely.

Thanks,
Alberto.

MatColgrove · September 13, 2010, 4:54pm

Hi Alberto,

This means that the compiler is generating CPU code that has no GPU equivalent. We’re either generating the wrong code or not flagging a semantic error. Either way, it is a compiler error.

Can you please post the definition for ‘float4’? Also, which version of the compilers are you using?

Thanks,
Mat

Alberto_Sicilia · September 13, 2010, 5:03pm

Hi Matt,

I’m using the compiler pgi/64/10.8 and CUDA driver 3.1

The definition of float4:

type float4
	real :: x  !!! X-position
	real :: y  !!! Y-position
	real :: z  !!! Z-position
	real :: d  !!! diameter
end type float4

Alberto

MatColgrove · September 13, 2010, 11:04pm

Hi Alberto,

It looks like the compiler is trying to add memory copy optimization which isn’t supported on the GPU. I have sent our engineers TPR#17223 detailing the problem. The work around is to copy each member separately.

Thanks for the report!
Mat

attributes(global) subroutine steepest_kernell(X_global,Nmin,Npart)
  integer, value :: Nmin, Npart
  type(float4), device, dimension(Npart,Nmin) :: X_global
  type(float4), shared, dimension(4) :: X_shared
  X_shared(threadidx%x)%x = X_global(threadidx%x,blockidx%x)%x
  X_shared(threadidx%x)%y = X_global(threadidx%x,blockidx%x)%y
  X_shared(threadidx%x)%z = X_global(threadidx%x,blockidx%x)%z
  X_shared(threadidx%x)%d = X_global(threadidx%x,blockidx%x)%d
end subroutine steepest_kernell

Alberto_Sicilia · September 14, 2010, 3:35pm

Hi Mat,

When copying each component separately, the program does compile, but it doesn’t work properly.

In emulation mode it compiles “ok” and the results are also “ok”. But in the gpu (w/o emulation), it compiles “ok” (using the trick you mentioned), but the results are completely wrong.

Can you confirm me that the PGI Cuda-Fortran compiler does work with derived data types?

For example, this trivial program:

attributes(global) subroutine steepest_kernell(X_global,Nmin,Npart) 
  integer, value :: Nmin, Npart 
  type(float4), device, dimension(Npart,Nmin) :: X_global 
  type(float4), shared, dimension(4) :: X_shared 
  real :: temp 

  X_shared(threadidx%x)%x = X_global(threadidx%x,blockidx%x)%x 
  
  temp = X_shared(threadidx%x)%x + 1

  X_global(threadidx%x,blockidx%x)%x = temp
end subroutine steepest_kernell

Works nicely in emulation mode, but in the gpu doesn’t add the “1” to the vector as it should.

Best regards,
Alberto.

MatColgrove · September 14, 2010, 8:30pm

Hi Alberto,

Can you please post a reproducing example? I tried with the code below but it works for me.

Thanks,
Mat

% cat test.cuf
module foo

use cudafor

type float4
   real :: x  !!! X-position
   real :: y  !!! Y-position
   real :: z  !!! Z-position
   real :: d  !!! diameter
end type float4

contains

attributes(global) subroutine steepest_kernell(X_global,Nmin,Npart)
  integer, value :: Nmin, Npart
  real :: temp
  type(float4), device, dimension(Npart,Nmin) :: X_global
  type(float4), shared, dimension(4) :: X_shared
  X_shared(threadidx%x)%x = X_global(threadidx%x,blockidx%x)%x
  temp = X_shared(threadidx%x)%x + 1
  X_global(threadidx%x,blockidx%x)%x = temp
end subroutine steepest_kernell

subroutine testme(Npart,Nmin)

  integer :: Npart
  integer :: Nmin

  type(float4), dimension(Npart,Nmin) :: X
  type(float4), device, dimension(Npart,Nmin) :: X_global

  X%x = 1.0
  X_global=X
  call steepest_kernell<<<32,4>>>(X_global,Nmin,Npart)
  X = X_global
  print *, X(1:4,1)%x

end subroutine testme
end module foo

program test

   use foo
   call testme(4,32)

end program test

% pgf90 -fast test.cuf -V10.8; a.out
    2.000000        2.000000        2.000000        2.000000

Alberto_Sicilia · September 15, 2010, 1:26pm

Hi Mat,

Your example works “ok” for me as well.

After several hours I have detected where the error in my program comes from, but it seems very strange.

If I run the code below with -Mcuda=emu, I get the good results.
If I run the code without “emu”, the program gives wrong results.

But the really strange thing is that if I erase the first subroutine in the file (the one called “force_over_my_particule”), everything works without “emu”!

And this is even more strange, because as you can see this subroutine
is not called from the kernell! (it’s commented!)

Thanks a lot for your help!
Alberto.

module foo 

use cudafor 

type float4 
   real :: x  !!! X-position 
   real :: y  !!! Y-position 
   real :: z  !!! Z-position 
   real :: d  !!! diameter 
end type float4


type float3 
   real :: x  !!! X-position 
   real :: y  !!! Y-position 
   real :: z  !!! Z-position
end type float3



contains


attributes(device) subroutine force_over_my_particule(X_shared,grad_shared,Npart)
type(float4), shared, dimension(8), intent(in) :: X_shared
type(float3), shared, dimension(8), intent(out) :: grad_shared
integer, value, intent(in) :: Npart


real :: force
real :: dx, dy, dz, r, invr
real :: diam, invdiam, ptemp, fr
integer :: i
type(float3) :: temp_grad

temp_grad%x = 0. !! No olvidar el "punto" al final.
temp_grad%y = 0.
temp_grad%z = 0.

	do i=1,Npart
           dx = X_Shared(threadidx%x)%x - X_Shared(i)%x
           dx = dx-nint(dx)
	   dy = X_Shared(threadidx%x)%y - X_Shared(i)%y
	   dy = dy - nint(dy)
	   dz = X_Shared(threadidx%x)%z - X_Shared(i)%z
	   dz = dz - nint(dz)
	   r = sqrt(dx*dx+dy*dy+dz*dz)
	   invr = 1.0D0/r
	   diam = (X_Shared(threadidx%x)%d + X_Shared(i)%d)/2.0D0
           invdiam = 1.0D0/diam

           if ((r > 0) .AND. (r<diam)) then
              ptemp = (1.0D0- r * invdiam)
              fr= invdiam*invr*ptemp
 
              temp_grad%x = temp_grad%x - fr*dx
              temp_grad%y = temp_grad%y - fr*dy
	      temp_grad%z = temp_grad%z - fr*dz	
           end if
	enddo

grad_shared(threadidx%x)%x=temp_grad%x
grad_shared(threadidx%x)%y=temp_grad%y
grad_shared(threadidx%x)%z=temp_grad%z

end subroutine force_over_my_particule



attributes(global) subroutine steepest_kernell(X_global,Nmin,Npart,step_size) 
  integer, value :: Nmin, Npart
  real, value :: step_size
  type(float4), device, dimension(Npart,Nmin) :: X_global 
  type(float4), shared, dimension(8) :: X_shared
  type(float3), shared, dimension(8) :: grad_shared
  type(float3) :: temp

  !!! Copy to shared memory
  X_shared(threadidx%x)%x = X_global(threadidx%x,blockidx%x)%x
  X_shared(threadidx%x)%y = X_global(threadidx%x,blockidx%x)%y
  X_shared(threadidx%x)%z = X_global(threadidx%x,blockidx%x)%z
  X_shared(threadidx%x)%d = X_global(threadidx%x,blockidx%x)%d
  
  call syncthreads()

  !!! Compute the gradient
  grad_shared(threadidx%x)%x = step_size
  grad_shared(threadidx%x)%y = step_size
  grad_shared(threadidx%x)%z = step_size

!  call force_over_my_particule(X_shared,grad_shared,Npart)
  !!! end compute gradient

  call syncthreads()
  temp%x =  X_shared(threadidx%x)%x - step_size*grad_shared(threadidx%x)%x
  temp%y =  X_shared(threadidx%x)%y - step_size*grad_shared(threadidx%x)%y
  temp%z =  X_shared(threadidx%x)%z - step_size*grad_shared(threadidx%x)%z

  call syncthreads()
  X_global(threadidx%x,blockidx%x)%x = temp%x
  X_global(threadidx%x,blockidx%x)%y = temp%y
  X_global(threadidx%x,blockidx%x)%z = temp%z


end subroutine steepest_kernell






subroutine minimize_steepest_gpu(Npart,Nmin,X0,Xf,step_size) 

  integer :: Npart 
  integer :: Nmin
  real :: step_size

  type(float4), dimension(Npart,Nmin) :: X0, Xf
  type(float4), device, dimension(Npart,Nmin) :: X_global 

  X_global=X0 
  call steepest_kernell<<<Nmin,Npart>>>(X_global,Nmin,Npart,step_size) 
  Xf = X_global


end subroutine minimize_steepest_gpu



subroutine initialize_positions(X0)
   type(float4), dimension(:,:), intent(inout) :: X0
   integer :: i,j
   real :: ran

   do i=1, size(X0(1,:))
      do j=1,size(X0(:,1))
	   call random_number(ran)
	   X0(j,i)%x = ran-0.5
	   call random_number(ran)
	   X0(j,i)%y = ran-0.5
	   X0(j,i)%z = 0
      enddo
   enddo
end subroutine initialize_positions



subroutine initialize_diameters(X0,r,PHI)
  type(float4), dimension(:,:), intent(inout) :: X0
  real, intent(in) :: r,PHI

  real, parameter :: PI=3.14159265358979D0
  real :: d1,d2
  integer :: npart, nmin, i, j

  nmin = size(X0(1,:))
  npart = size(X0(:,1))

  d1 = sqrt(8*PHI/(PI*npart*(1+r*r)))
  d2 = d1*r

  do i=1,npart
     if (i<=npart/2) then
        X0(i,:)%d = d1
     else
        X0(i,:)%d = d2
     end if
  end do

end subroutine initialize_diameters



end module foo 

program test 
use foo
integer, parameter :: Nmin = 1
integer, parameter :: Npart = 8
real,parameter :: step_size = 1.0
type(float4), dimension(Npart,Nmin) :: X0, Xf  !!! Posiciones iniciales
real, parameter :: phi = 0.9  !!! Volume ration
real, parameter :: r = 1.4 !!! Diameter ratio

 call initialize_positions(X0)
 call initialize_diameters(X0,r,phi)
write(*,*) 'x0'
write(*,*) X0(:,1) 

   call minimize_steepest_gpu(Npart,Nmin,X0,Xf,step_size) 

write(*,*) 'xf'
write(*,*), Xf(:,1) 

end program test

MatColgrove · September 15, 2010, 4:06pm

Hi Alberto,

I suspect something else is wrong since when I run your program with and without -Mcuda=emu, I get the same results. This is true if I leave force_over_my_particule commented out, remove the function altogether or include the call.

Could you please add the following error checking code just after your kernel call? I’m wondering if it’s failing for some reason.

    call steepest_kernell<<<Nmin,Npart>>>(X_global,Nmin,Npart,step_size) 
! Check for errors
    errCode = cudaGetLastError()
    if (errCode .gt. 0) then
       print *, cudaGetErrorString(errCode)
       stop
    endif

Another possibility is that I’m getting the same wrong answers for both. Below are the results I’m getting. Can you please verify that they are correct?

% pgf90 alberto1.cuf -Mcuda=emu -V10.9 -o emu109.out 
% pgf90 alberto1.cuf -Mcuda -V10.9 -o cuf109.out 
% cuf109.out
 x0
   0.4079230      -0.3093079        0.000000       0.3111003
  -0.4328347       0.3000845        0.000000       0.3111003
   0.2973146       0.1368300        0.000000       0.3111003
   8.8782728E-02  -0.3409344        0.000000       0.3111003
  -0.1793523      -5.1823914E-02    0.000000       0.4355405
   0.1579502       0.1075459        0.000000       0.4355405
   0.2622637      -6.2520266E-02    0.000000       0.4355405
  -3.6130935E-02   0.2004157        0.000000       0.4355405
 xf
  -0.5920770       -1.309308       -1.000000       0.3111003
   -1.432835      -0.6999155       -1.000000       0.3111003
  -0.7026854      -0.8631700       -1.000000       0.3111003
  -0.9112173       -1.340934       -1.000000       0.3111003
   -1.179352       -1.051824       -1.000000       0.4355405
  -0.8420498      -0.8924541       -1.000000       0.4355405
  -0.7377363       -1.062520       -1.000000       0.4355405
   -1.036131      -0.7995843       -1.000000       0.4355405
% emu109.out
 x0
   0.4079230      -0.3093079        0.000000       0.3111003
  -0.4328347       0.3000845        0.000000       0.3111003
   0.2973146       0.1368300        0.000000       0.3111003
   8.8782728E-02  -0.3409344        0.000000       0.3111003
  -0.1793523      -5.1823914E-02    0.000000       0.4355405
   0.1579502       0.1075459        0.000000       0.4355405
   0.2622637      -6.2520266E-02    0.000000       0.4355405
  -3.6130935E-02   0.2004157        0.000000       0.4355405
 xf
  -0.5920770       -1.309308       -1.000000       0.3111003
   -1.432835      -0.6999155       -1.000000       0.3111003
  -0.7026854      -0.8631700       -1.000000       0.3111003
  -0.9112173       -1.340934       -1.000000       0.3111003
   -1.179352       -1.051824       -1.000000       0.4355405
  -0.8420498      -0.8924541       -1.000000       0.4355405
  -0.7377363       -1.062520       -1.000000       0.4355405
   -1.036131      -0.7995843       -1.000000       0.4355405

Mat

Alberto_Sicilia · September 15, 2010, 4:24pm

Hi Mat,

Thanks a lot for your response.

Your results are the correct ones.

I inserted the error checking lines, and indeed there’s an error:

errCode = 8
cudaGetErrorString(errCode) = “invalid device function”

The call “force_over_my_particule” is still commented.

Do you know what can be wrong?

Alberto.

Alberto_Sicilia · September 15, 2010, 6:00pm

Hi Mat,

I think I’ve found where things go wrong:
apparently it doesn’t like the representations of the numbers, i.e,
with “1.0D0” doesn’t work, but with “1.” does work.

Does this make any sense? I mean: why did it work before in the emulation
mode? and why it was saying “errCode=8”?

A.

MatColgrove · September 15, 2010, 6:38pm

Does your card support double precision? What NVIDIA CUDA driver do you have? If you don’t know, post the output from the PGI utility ‘pgaccelinfo’.

Note that “Error 8: invalid device function” means that your trying to run Compute capability (CC) 1.3 code on a device that only supports CC1.2 or earlier. Double precision was first added in CC13 devices.

Alberto_Sicilia · September 15, 2010, 6:47pm

Hi Mat,

Indeed, the card of my local machine is CC1.1

I’m sorry for this stupid error from me!

Alberto.

tull · December 24, 2010, 5:17pm

Alberto,

TPR 17223 - CUF: User code gives ICE unsupported
operation when copying a derived type

Has been corrected in th e11.0 release, which is available now.

thanks again for your submission.

regards,
dave

Topic		Replies	Views
Error running simple CUDA Fortran program Legacy PGI Compilers	9	21305	February 26, 2010
pgfortran hangs on .cuf file with shared memory Legacy PGI Compilers	2	4736	June 11, 2010
Declaring local arrays in device code Legacy PGI Compilers	16	9064	June 8, 2012
Trouble Getting Started CUDA/PGI Fortran Legacy PGI Compilers	30	15816	November 9, 2012
Operators both on host and device functions Legacy PGI Compilers	21	10646	October 12, 2010
cuda fortran questions Legacy PGI Compilers	10	10941	July 27, 2012
cuda memory issues Legacy PGI Compilers	13	9509	March 19, 2012
Help with strange error CUDA Programming and Performance	8	2095	February 25, 2010
Used of shared memory in device function Legacy PGI Compilers	18	19985	September 17, 2015
Pgfortran 20.4 and OpenACC giving "cudaLaunchKernel returned status 2: out of memory" Legacy PGI Compilers	17	929	June 3, 2021

Problem loading data to shared memory

Related topics