Using Fortran derived types and cuBLAS

I am trying to combine OpenACC with cuBLAS use but I am not sure how derived types need to be used. Here’s part of the code that I have.

 subroutine interpolate_at_flux_points(mesh)
    use cublas
    use cudafor
    use openacc
    use input_module, only: chunksize
    type(mesh2d), intent(inout) :: mesh

    ! locals
    integer(c_int) :: index
    integer(c_int) :: j

    real(c_double), allocatable :: uflux_tmp(:, :)
    real(c_double), allocatable :: uelem_tmp(:, :)

    allocate(uelem_tmp(mesh%Np, mesh%Nvar*chunksize))
    allocate(uflux_tmp(mesh%Nflux, mesh%Nvar*chunksize))

    index = 1
    !$acc data copyin(mesh, mesh%u, mesh%Nflux, mesh%Nvar, mesh%Np, mesh%Lf(:, :)) &
    !$acc  create(uelem_tmp, uflux_tmp)
    !$acc parallel loop private(j) 
       do j = 1, chunksize
          uelem_tmp(:, (j-1)*mesh%Nvar+1:j*mesh%Nvar) = &
               mesh%u(:, :, (index-1)*chunksize+j)
       end do
    !$acc end parallel 


    !$acc host_data use_device(mesh, mesh%u, mesh%Nflux, mesh%Nvar, mesh%Np, mesh%Lf, &
    !$acc uelem_tmp, uflux_tmp)
    
       call cublasDGEMM('N', 'N', mesh%Nflux, mesh%Nvar*chunksize, mesh%Np, &
            ONE, mesh%Lf(:, :), mesh%Nflux, &
            uelem_tmp(:, :), mesh%Np, &
            ZERO, uflux_tmp, mesh%Nflux)
    !$acc end host_data

    !$acc end data
    deallocate(uelem_tmp)
    deallocate(uflux_tmp)

  end subroutine interpolate_at_flux_points

compiling this gives the following error.

PGF90-S-0528-Argument number 7 to cublasdgemm: device attribute mismatch (libfr.f90: 240)

I am using the following flags

pgfortran -fast -gopt -O2 -acc -Minfo=accel -Mcuda

I first tried a similar example without derived types, and it worked so I think it has to do with derived types.

Sorry for that. Turns out it was simply a linker issue. I added -lcublas -Mcuda and it seems to compile. But it was giving seg faults so I modified the code to this.

subroutine interpolate_at_flux_points(mesh)
    use cublas
    use openacc
    use input_module, only: chunksize
    type(mesh2d), intent(inout) :: mesh

  integer(c_int) :: Nvar, Nflux, Np, index
  integer(c_int) :: istat

  real(c_double), allocatable :: A(:,:), B(:, :), C(:, :)
  integer           :: size_of_real, i, j, k

  allocate(A(mesh%Nflux, mesh%Nvar*chunksize))
  allocate(B(mesh%Np, mesh%Nvar*chunksize))
  allocate(C(mesh%Nflux, mesh%Nvar*chunksize))

  Nflux = mesh%Nflux
  Np = mesh%Np
  Nvar = mesh%Nvar

  index = 1

  !$acc enter data copyin(mesh)
  !$acc enter data copyin(mesh%Lf, mesh%u)
  !$acc enter data create(A, B, C) copyin(Nflux, Np, Nvar) 

  !$acc kernels copyin(index, chunksize)
  A(:, :) = mesh%Lf(:, :)

  do index = 1, chunksize
      do j = 1, chunksize
          B(:, (j-1)*Nvar+1:j*Nvar) = &
            mesh%u(:, :, (index-1)*chunksize+j)
      end do
  end do

  C(:, :)      = -0.5_c_double
  !$acc end kernels

  !$acc host_data use_device(A, B, C)
  call cublasDGEMM('N', 'N', Nflux, Nvar*chunksize, Np, &
      1.0_c_double,  A, Nflux, B, Np, 0.0_c_double, C, Nflux)
  !$acc end host_data

  !$acc update self(C)
  !$acc exit data delete(A, B, C, Nflux, Np, Nvar)
  !$acc exit data delete(mesh%Lf, mesh%u)
  !$acc exit data delete(mesh)


  deallocate(A)
  deallocate(B)
  deallocate(C)
end subroutine interpolate_at_flux_points

It compiles fine, but on running gives the following error.

Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 3.0
host:0x8915c0 device:0x700680000 size:13256 presentcount:1 line:232 name:mesh
host:0x2b8514b24020 device:0x700880000 size:451584 presentcount:1 line:233 name:(null)
host:0x7fffbaac7c90 device:0x700980000 size:4 presentcount:1 line:234 name:nflux
host:0x7fffbaac7c94 device:0x700980200 size:4 presentcount:1 line:234 name:np
host:0x7fffbaac7ed8 device:0x700980400 size:4 presentcount:1 line:234 name:nvar
FATAL ERROR: data in update host clause was not found on device 1: name=c
 file:Projects/deepfry/examples/CNS/shocktube/../../..//src/libfr.f90 interpolate_at_flux_points line:258
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

Hi vsingh96824,

The error is indicating the “C” wasn’t created on the device and from the present table dump, neither are “A” or “B”. Plus, only either “mesh%Lf” or “mesh%u” is being created. I’m not sure which one since it says “(null)” for the name.

This could be a compiler issue but I think it’s more likely the case that either “chunksize” or “mesh%Nvar” are zero and you’re using zero size arrays. Can you check the values of these variables? Also, is either “mesh%Lf” or “mesh%u” zero sized?

If you do think it’s a compiler issue, can you please create a reproducing example?

On a side note, scalars are private by default and by putting them in data regions you are making them global (shared). In the case of “index”, this is an error since index variables must be private. For the other scalars, “Nflux”, “Np”, “Nvar”, and “chunksize”, it’s not wrong to put them in a data region, just not optimal. This will hurt your performance since the data must now be loaded from global memory as opposed to most likely being put into local registers. Best practice is to not manage scalars unless you have a good reason to do so.

Hope this helps,
Mat

Hi Mat,

Thanks for the pointers. I have modified the code to this.

 subroutine interpolate_at_flux_points(mesh)
    use cublas
    use openacc
    use input_module, only: chunksize
    type(mesh2d), intent(inout) :: mesh


  integer(c_int) :: Nvar, Nflux, Np, index
  integer(c_int) :: istat

  real(c_double), allocatable :: A(:,:), B(:, :), C(:, :)
    real(c_double), allocatable :: uflux_tmp(:, :)
    real(c_double), allocatable :: uelem_tmp(:, :)

  integer           :: size_of_real, i, j, k

  Nflux = mesh%Nflux
  Np = mesh%Np
  Nvar = mesh%Nvar

  index = 1

  allocate(A(Nflux, Np))
  allocate(B(Np, Nvar*chunksize))
  allocate(C(Nflux, Nvar*chunksize))
  
  print*, Nflux, Np, Nvar, chunksize
  print*, 'Lf'  
  print*, mesh%Lf(1, :)     
  print*, 'u'      
  print*, mesh%u(:, 1, 1)
  !$acc enter data copyin(mesh)
  !$acc enter data copyin(mesh%Lf)
  !$acc enter data copyin(mesh%u)
  !$acc enter data create(A, B, C) 

  !$acc kernels 
  A(:, :) = mesh%Lf(:, :)

  do index = 1, chunksize
      do j = 1, chunksize
          B(:, (j-1)*Nvar+1:j*Nvar) = &
              mesh%u(:, :, (index-1)*chunksize+j)
      end do
  end do

  C(:, :)      = -0.5_c_double
  !$acc end kernels

  !$acc host_data use_device(A, B, C)
  call DGEMM('N', 'N', Nflux, Nvar*chunksize, Np, &
      1.0_c_double,  A, Nflux, B, Np, 0.0_c_double, C, Nflux)

  !$acc end host_data
  !$acc update self(C)

  !$acc exit data delete(A, B, C)
  !$acc exit data delete(A, B, C)
  !$acc exit data delete(mesh%u)
  !$acc exit data delete(mesh%Lf)
  !$acc exit data delete(mesh)

  print*, C(1, :)

  deallocate(A)
  deallocate(B)
  deallocate(C)

Here’s the compile output.

 
    241, Generating enter data copyin(mesh)
    242, Generating enter data copyin(mesh%lf(:,:))
    243, Generating enter data copyin(mesh%u(:,:,:))
    244, Generating enter data create(c(:,:),b(:,:),a(:,:))
    246, Generating copyout(c(1:nflux,1:))
         Generating copy(mesh)
         Generating copyout(a(1:nflux,1:np))
         Generating copyin(mesh%lf(:,:))
         Generating copyout(b(1:np,:))
         Generating copyin(mesh%u(:,:,:))
    247, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        247, !$acc loop gang ! blockidx%y
             !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    249, Parallelization would require privatization of array b(1:np,:)
    250, Parallelization would require privatization of array b(1:np,:)
    251, Parallelization would require privatization of array b(1:np,:)
         Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        251, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    256, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        256, !$acc loop gang ! blockidx%y
             !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    264, Generating update self(c(:,:))
    266, Generating exit data delete(c(:,:),b(:,:),a(:,:))
    267, Generating exit data delete(mesh%u(:,:,:))
    268, Generating exit data delete(mesh%lf(:,:))
    269, Generating exit data delete(mesh)

And here’s the output

 
           16           16            4            8
 Lf
    1.526788125457267        1.6653345369377348E-016   4.1633363423443370E-017 
  -9.7144514654701197E-017  -0.8136324494869269       -1.1102230246251565E-016 
  -5.5511151231257827E-017   2.7755575615628914E-017   0.4007615203116500      
   5.5511151231257827E-017    0.000000000000000       -1.3877787807814457E-016 
  -0.1139171962819898        5.5511151231257827E-017  -1.8041124150158794E-016 
   1.1796119636642288E-016
 u
    1.000000000000000         1.000000000000000         1.000000000000000      
    1.000000000000000         1.000000000000000         1.000000000000000      
    1.000000000000000         1.000000000000000         1.000000000000000      
    1.000000000000000         1.000000000000000         1.000000000000000      
    1.000000000000000         1.000000000000000         1.000000000000000      
    1.000000000000000     
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 3.0
host:0x891e40 device:0x700680000 size:13256 presentcount:1 line:241 name:mesh
FATAL ERROR: data in update host clause was not found on device 1: name=c
 file:/home/vikram/Projects/deepfry/examples/CNS/shocktube/../../..//src/libfr.f90 interpolate_at_flux_points line:264
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

This is just to confirm that Lf and u are not empty.

Now, what is weird is that if I repeatedly run it, some runs the error does not show up whereas on other runs it shows up.

The code is here.

https://bitbucket.org/vsingh001/deepfry/src/a8e5619447a7013078fe22bbd81434557d4f97da/src/?at=gpu

in deepfry / src / libfr.f90

I am running

deepfry / examples / CNS / shocktube /

make COMP=pgi

./main.Linux.pgi.atlas.exe shocktube.in

Thanks.

Hi Vsingh,

I was able build the code and my runs did not encounter the runtime error. Given that the error only occurs sometimes for you, I suspect you have a UMR or other memory problem. Hence, I built the code using “-O -g” (host only debugging) and ran it through valgrind.

The utility shows that the line at 361 of system.f90 gets the following error:

==4982== Conditional jump or move depends on uninitialised value(s)
==4982==    at 0x44D547: system_module_get_common_solution_values_ (system.f90:361)
==4982==    by 0x45E6E8: system_module_get_gradients_ (system.f90:1366)
==4982==    by 0x4169F8: initialize_module_set_initial_values_ (initialize.f90:34)
==4982==    by 0x42C2DB: MAIN_ (main.f90:93)
==4982==    by 0x410D53: main (in /proj/qa/support/vsingh/deepfry/examples/CNS/shocktube/main.Linux.pgi.atlas.exe)
==4982==

After a bit of digging, the problem here is that “ul” is a NaN. “ul” value is computed from Vxfluild and Vyfluid which are also NaNs since they in turn are computed using a divide by zero at line 350.

rhofluid = mesh%uflux(left_indices(1) + (j-1), 1, left)
Vxfluid = mesh%uflux(left_indices(1) + (j-1), 2, left)/rhofluid
Vyfluid = mesh%uflux(left_indices(1) + (j-1), 3, left)/rhofluid

I have not investigated why rhofluid is zero, but your core issue may be that mesh%uflux isn’t getting properly initialized or the data is bad.

If the problem persists after you fix this issue, let me know and I’ll investigate further.

Best Regards,
Mat

Hi Mat,

Thanks a lot for going so deep.

The error only occurs sometimes on my machine but on another machine it seems to occur every time so not sure what is happening.

Thanks for putting it through valgrind. Unfortunately the modules you are pointing to will obviously give errors because they depend on values that are generated in this module, that is, interpolate_at_flux_points(mesh). Those modules get called after this module.

Just to confirm, I ran the original code without the openacc changes I have made, and the error does not pop up on valgrind.

Maybe it’s a problem with the data set?

Note that my runs did not include OpenACC and were run without optimization (-g -O0). The behavior you’re seeing is indicative of a UMR but unfortunately I don’t know the root cause.

Hi Mat,

The openacc part is heavily modified. Even without the pragmas the data set for the subsequent modules won’t work until I have put all the loops required into this module.

This is the starting point from which I start introducing OpenACC.

https://bitbucket.org/vsingh001/deepfry/src/76a7290f26d93a182e90d1d67da6564f339f1ee4/src/?at=limiters

I ran this in debug mode. It gave some errors regarding hpf_malloc but with gfortran in debug mode even those errors are absent while running valgrind.

I did some further testing.

https://bitbucket.org/vsingh001/deepfry/src/7a5d3a23935260098263b4197cf6e6067b2939b6?at=accTest

Now all I do is,

  integer(c_int) :: Nvar, Nflux, Np, index
  integer(c_int) :: istat

  real(c_double), allocatable :: A(:,:), B(:, :), C(:, :)
    real(c_double), allocatable :: uflux_tmp(:, :)
    real(c_double), allocatable :: uelem_tmp(:, :)

  integer           :: size_of_real, i, j, k

  Nflux = mesh%Nflux
  Np = mesh%Np
  Nvar = mesh%Nvar

  index = 1

  allocate(A(Nflux, Np))
  allocate(B(Np, Nvar*chunksize))
  allocate(C(Nflux, Nvar*chunksize))

  !$acc enter data copyin(mesh)
  !$acc enter data copyin(mesh%Lf)

  !$acc kernels 
  A(:, :) = mesh%Lf(:, :)
  !$acc end kernels

  !$acc update self(A)

  !$acc exit data delete(mesh%Lf)
  !$acc exit data delete(mesh)

mesh%Lf is a simple matrix defined only once and is very small. It is created in polynomials.f90, get_flux_point_vandermode_matrix.

It has nothing to do with u which depends on the input hdf5 file. I still get the same error.

Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 3.0
host:0x8854a0 device:0x700680000 size:13256 presentcount:1 line:235 name:mesh
host:0x1a0d1a0 device:0x700780000 size:2048 presentcount:1 line:236 name:(null)
FATAL ERROR: data in update host clause was not found on device 1: name=a
 file:/home/vikram/Projects/deepfry/examples/CNS/shocktube/../../..//src/libfr.f90 interpolate_at_flux_points line:242

I think this indicates the problem is in copying derived types.

Hi vsingh,

The error is correct. You’re trying to update “A” but “A” isn’t in a data region.

Try removing the “update self(A)”, and see if that fixes the problem.

  • Mat

Hi Mat,

Thanks.

Sorry for the late reply.

I changed the code to


  
  allocate(A(Nflux, Np))
  allocate(B(Np, Nvar*chunksize))
  allocate(C(Nflux, Nvar*chunksize))


  index = 1

  !$acc enter data copyin(mesh) 
  !$acc enter data copyin(mesh%Lf, mesh%u) 
  !$acc enter data create(A, B, C) 

  !$acc kernels 
  do i = 1, Np
      do k = 1, Nvar
          do j = 1, chunksize
              B(i, (j-1)*Nvar+k) = &
                  mesh%u(i, k, (index-1)*chunksize+j)
          end do
      end do
  end do
  A(1:Nflux, 1:Np) = mesh%Lf(1:Nflux, 1:Np)
  !$acc end kernels 

  !$acc host_data use_device(A, B, C)
     call DGEMM('N', 'N', Nflux, Nvar*chunksize, Np, &
          1.0_c_double, A, Nflux, &
          B, Np, &
          0.0_c_double, C, Nflux)
  !$acc end host_data

  !$acc update self(A, B, C)

  !$acc exit data delete(A, B, C) 
  !$acc exit data delete(mesh%Lf, mesh%u) 
  !$acc exit data delete(mesh)

  print*, B(1, 1), A(1, 1), C(1, 1)

  deallocate(C)
  deallocate(B)
  deallocate(A)

It seems to work now. All I did was make all array references explicit.

This is the accelerator output

    238, Generating enter data copyin(mesh)
    239, Generating enter data copyin(mesh%u(:,:,:),mesh%lf(:,:))
    240, Generating enter data create(c(:,:),b(:,:),a(:,:))
    242, Generating copyin(mesh%lf(1:nflux,1:np))
         Generating copyout(a(1:nflux,1:np))
         Generating copy(mesh)
         Generating copyout(b(1:np,:))
         Generating copyin(mesh%u(1:np,1:nvar,1:chunksize))
    243, Loop is parallelizable
    244, Loop carried dependence of b prevents parallelization
         Loop carried backward dependence of b prevents vectorization
    245, Loop carried dependence of b prevents parallelization
         Loop carried backward dependence of b prevents vectorization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
         Generating Tesla code
        243, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    251, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        251, !$acc loop gang ! blockidx%y
             !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    261, Generating update self(c(:,:),b(:,:),a(:,:))
    263, Generating exit data delete(c(:,:),b(:,:),a(:,:))
    264, Generating exit data delete(mesh%u(:,:,:),mesh%lf(:,:))
    265, Generating exit data delete(mesh)

Does this mean the loop for assigning B is not parallel? The previous notation was much more elegant but I guess if that is the only way to make it work, I’ll have to use it.

Hi Vsingh,

Does this mean the loop for assigning B is not parallel?

It parallel, but since you’re using a computed index in the second dimension, the compiler analysis can’t tell it’s parallel. Though it’s easy to give a hint to the compiler that it’s parallel by using the “independent” clause.

  !$acc kernels 
  !$acc loop independent collapse(3) 
  do i = 1, Np 
      do k = 1, Nvar 
          do j = 1, chunksize 
              B(i, (j-1)*Nvar+k) = & 
                  mesh%u(i, k, (index-1)*chunksize+j) 
          end do 
      end do 
  end do 
  A(1:Nflux, 1:Np) = mesh%Lf(1:Nflux, 1:Np) 
  !$acc end kernels

I’m glad that everything is working for you.

  • Mat

Hi Mat,

Thanks. The updated code is

  !$acc enter data copyin(mesh) 
  !$acc enter data copyin(mesh%Lf, mesh%u) 
  !$acc enter data create(A, uelem_tmp, C) 
  !$acc enter data copyin(mesh%uflux) 

  !$acc kernels
  A(1:Nflux, 1:Np) = mesh%Lf(1:Nflux, 1:Np)
  !$acc end kernels 

  do index = 1, mesh%Nelements/chunksize
    !$acc parallel loop  independent collapse(3)
    do j = 1, chunksize
        do k = 1, Nvar
            do i = 1, Np
                uelem_tmp(i, (j-1)*Nvar+k) = &
                    mesh%u(i, k, (index-1)*chunksize+j)
            end do
        end do
    end do
    !$acc end parallel

    !$acc host_data use_device(A, uelem_tmp, C)
       call DGEMM('N', 'N', Nflux, Nvar*chunksize, Np, &
            1.0_c_double, A, Nflux, &
            uelem_tmp, Np, &
            0.0_c_double, C, Nflux)
    !$acc end host_data

    !$acc parallel loop independent collapse(3)
    do j = 1, chunksize
        do k = 1, Nvar
            do i = 1, Nflux
                mesh%uflux(i, k, (index-1)*chunksize+j) = &
                    C(i, (j-1)*Nvar+k)
            end do
        end do
    end do
  end do

    !$acc update self(mesh%uflux)

    !$acc exit data delete(mesh%uflux) 
    !$acc exit data delete(A, uelem_tmp, C) 
    !$acc exit data delete(mesh%Lf, mesh%u) 
    !$acc exit data delete(mesh)

I ran the code through nvprof. This is the output I get.

==23663== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 35.29%  24.593ms      3629  6.7760us  6.5280us  10.400us  interpolate_at_flux_points_245_gpu
 34.06%  23.732ms      3630  6.5370us  6.2720us  9.1520us  interpolate_at_flux_points_263_gpu
 24.27%  16.913ms      3630  4.6590us  4.5760us  8.0320us  void gemm_kernel2x2_tile_multiple_core<double, bool=1, bool=0, bool=0, bool=0, bool=0>(double*, double const *, double const *, int, int, int, int, int, int, double*, double*, double, double, int)
  4.12%  2.8721ms       232  12.379us  1.0240us  43.135us  [CUDA memcpy HtoD]
  1.81%  1.2591ms        33  38.152us  36.959us  39.007us  [CUDA memcpy DtoH]
  0.44%  309.98us        33  9.3930us  8.9600us  9.9200us  interpolate_at_flux_points_241_gpu

So what I want to ask is, the CUDA memcpy time seems to be very low, yet interpolate_at_flux_points_245_gpu and interpolate_at_flux_points_263_gpu are taking significant time. These are just

B(i, (j-1)*Nvar+k) = & 
                  mesh%u(i, k, (index-1)*chunksize+j)

mesh%uflux(i, k, (index-1)*chunksize+j) = &
                    C(i, (j-1)*Nvar+k)

So, why is just assigning values taking so much time.

Hi Vsingh,

6us per call isn’t bad but your point is why is moving data more expensive than computing a DGEMM. Basically, computation is cheap, data movement including on the device is expensive.

How big is “Np” and “Nflux”? If they are over 128, you might be able to get better performance by putting a “acc loop vector” on the inner “i” loops, instead of collapsing all three loops.

    !$acc parallel loop independent collapse(2) 
    do j = 1, chunksize 
        do k = 1, Nvar 
!$acc loop vector independent
            do i = 1, Nflux 
                mesh%uflux(i, k, (index-1)*chunksize+j) = & 
                    C(i, (j-1)*Nvar+k) 
            end do 
        end do 
    end do

Another idea is to push more compute into each kernel by exchanging the i and k loops so that the k loop is executed sequentially on the device.

    !$acc parallel loop gang independent 
    do j = 1, chunksize 
  !$acc loop vector independent
            do i = 1, Nflux
  !$acc loop seq
        do k = 1, Nvar 
                mesh%uflux(i, k, (index-1)*chunksize+j) = & 
                    C(i, (j-1)*Nvar+k) 
            end do 
        end do 
    end do

I have no idea if these will help since I don’t know the size of your loops, but it’s very easy to try different things!

Hope this helps,
Mat

Hi Mat,

Thanks for that.

Nflux and Nvar are not really very big, and usually much, much smaller than 128.

But I will definitely try to experiment based on your suggestions.

Right now, I am trying to ensure data movement across subroutines. So just to try, I move mesh outside of the subroutine.

      mesh = system%mesh
      !$acc data copy(mesh)
      call interpolate_at_flux_points(mesh)
      !$acc end data

Then I retain the code

 !$acc enter data copyin(mesh) 
  !$acc enter data copyin(mesh%Lf, mesh%u) 
  !$acc enter data create(A, uelem_tmp, C) 
  !$acc enter data copyin(mesh%uflux) 

  !$acc kernels
  A(1:Nflux, 1:Np) = mesh%Lf(1:Nflux, 1:Np)
  !$acc end kernels 

  do index = 1, mesh%Nelements/chunksize
    !$acc parallel loop  independent collapse(3)
    do j = 1, chunksize
        do k = 1, Nvar
            do i = 1, Np
                uelem_tmp(i, (j-1)*Nvar+k) = &
                    mesh%u(i, k, (index-1)*chunksize+j)
            end do
        end do
    end do
    !$acc end parallel

    !$acc host_data use_device(A, uelem_tmp, C)
       call DGEMM('N', 'N', Nflux, Nvar*chunksize, Np, &
            1.0_c_double, A, Nflux, &
            uelem_tmp, Np, &
            0.0_c_double, C, Nflux)
    !$acc end host_data

    !$acc parallel loop independent collapse(3)
    do j = 1, chunksize
        do k = 1, Nvar
            do i = 1, Nflux
                mesh%uflux(i, k, (index-1)*chunksize+j) = &
                    C(i, (j-1)*Nvar+k)
            end do
        end do
    end do
    !$acc end parallel 

  end do

    !$acc update self(mesh%uflux)

    !$acc exit data delete(mesh%uflux) 
    !$acc exit data delete(A, uelem_tmp, C) 
    !$acc exit data delete(mesh%Lf, mesh%u) 
    !$acc exit data

I am assuming that the copyin for mesh will not matter if it is already copied. But I get the error

mesh lives at 0x8b09e0 size 13544 present
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 3.0
host:0x8b09e0 device:0x700680000 size:13544 presentcount:1 line:235 name:mesh
FATAL ERROR: variable in data clause was already present on device 1: name=mesh

Clearly my assumption is incorrect. But I don’t understand how to specify inside the subroutine that mesh is already present, since if I remove the data clauses for mesh, I get the error.

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

Sorry for reviving an old thread, but I decided to create the most simplified form of the code to test acc routines that I could. Since this is completely self-contained there’s no possibility of data corruption.

program cuBLAS_example

  use iso_c_binding, only: c_double, c_int
  use cublas
  use openacc
  implicit none

  integer, parameter :: nelems = 50000
  integer, parameter :: Np = 16, Nvar = 4, Nflux = 16
  integer, parameter :: chunkSize = 10

  type mesh2d
      real(c_double) :: u(Np, Nvar, nelems)
      real(c_double) :: ux(Np, Nvar, nelems)
      real(c_double) :: uflux(Nflux, Nvar, nelems)
      real(c_double) :: ucommon(Nflux, Nvar, nelems)
  end type mesh2d

  real(c_double)   :: Lgrad(Np, Np), phi(Np, Nflux)
  real(c_double)   :: uelem_tmp(Np, Nvar*chunkSize), ucorr_tmp(Nflux, Nvar*chunkSize)
  real(c_double)   :: ugrad_tmp(Np, Nvar*chunkSize)

  type(mesh2d)     :: mesh

  integer(c_int)      :: i, j, k, iter


  do i = 1, nelems
      do j = 1, Nvar
          do k = 1, Np
              mesh%u(k, j, i) = i+j+k
          end do
          do k = 1, Nflux
              mesh%uflux(k, j, i) = i-j-k
              mesh%ucommon(k, j, i) = i-j+k
          end do
      end do
  end do

  do j = 1, Np
    do k = 1, Np
        Lgrad(j, k) = k*j
    end do
    do k = 1, Nflux
        phi(j, k) = k*j + k + j
    end do
  end do

  !$acc enter data copyin(mesh)
  !$acc enter data copyin(mesh%u, mesh%uflux, mesh%ucommon)
  !$acc enter data copyin(mesh%ux)
  !$acc enter data create(uelem_tmp, ucorr_tmp, ugrad_tmp, Lgrad, phi)
  do iter = 1, nelems/chunkSize

      !$acc parallel loop present(mesh, mesh%u, uelem_tmp)
      do i = 1, chunkSize
          do j = 1, Nvar
              do k = 1, Np
                  uelem_tmp(k, Nvar*(i-1) + j) = mesh%u(k, j, (iter-1)*chunkSize + i)
              end do
          end do
      end do
      !$acc end parallel

      !$acc parallel loop present(mesh, mesh%uflux, mesh%ucommon, ucorr_tmp)
      do i = 1, chunkSize
          do j = 1, Nvar
              do k = 1, Nflux
                  ucorr_tmp(k, Nvar*(i-1) + j) = mesh%uflux(k, j, (iter-1)*chunkSize + i) &
                                            - mesh%ucommon(k, j, (iter-1)*chunkSize + i)

              end do
          end do
      end do
      !$acc end parallel

     !$acc host_data use_device(Lgrad, uelem_tmp, ugrad_tmp)
     call cublasDGEMM('N', 'N', Np, Nvar*chunkSize, Np, &
         1.0_c_double, Lgrad, Np, uelem_tmp, Np, &
         0.0_c_double, ugrad_tmp, Np)
     !$acc end host_data


      !$acc host_data use_device(phi, uelem_tmp, ugrad_tmp)
      call cublasDGEMM('N', 'N', Np, Nvar*chunkSize, Np, &
         1.0_c_double, phi, Np, uelem_tmp, Np, &
         1.0_c_double, ugrad_tmp, Np)
      !$acc end host_data

      !$acc parallel loop present(mesh, mesh%ux, ugrad_tmp)      
      do i = 1, chunkSize
          do j = 1, Nvar
              do k = 1, Nflux
                  mesh%ux(k, j, (iter-1)*chunkSize + i) = ugrad_tmp(k, Nvar*(i-1) + j)
              end do
          end do
      end do
      !$acc end parallel

   end do

   !$acc update self(mesh%ux)

   !$acc exit data delete(mesh%ucommon, mesh%uflux, mesh%u, mesh%ux)
   !$acc exit data delete(uelem_tmp, ugrad_tmp, ucorr_tmp)
   !$acc exit data


end program cuBLAS_example

Here’s the compiler output

pgfortran -o speed -acc -Minfo=accel speedup.f90  -L/usr/local/cuda/lib64 -lcublas -Mcuda
cublas_example:
     49, Generating enter data copyin(mesh)
     50, Generating enter data copyin(mesh%ucommon(:,:,:),mesh%uflux(:,:,:),mesh%u(:,:,:))
     51, Generating enter data copyin(mesh%ux(:,:,:))
     52, Generating enter data create(phi(:,:),lgrad(:,:),ugrad_tmp(:,:),ucorr_tmp(:,:),uelem_tmp(:,:))
     55, Generating present(mesh,mesh%u(:,:,:),uelem_tmp(:,:))
         Accelerator kernel generated
         Generating Tesla code
         56, !$acc loop gang ! blockidx%x
         58, !$acc loop vector(128) ! threadidx%x
     57, Loop is parallelizable
     58, Loop is parallelizable
     65, Generating present(mesh,mesh%uflux(:,:,:),mesh%ucommon(:,:,:),ucorr_tmp(:,:))
         Accelerator kernel generated
         Generating Tesla code
         66, !$acc loop gang ! blockidx%x
         68, !$acc loop vector(128) ! threadidx%x
     67, Loop is parallelizable
     68, Loop is parallelizable
     92, Generating present(mesh,mesh%ux(:,:,:),ugrad_tmp(:,:))
         Accelerator kernel generated
         Generating Tesla code
         93, !$acc loop gang ! blockidx%x
         Accelerator restriction: scalar variable live-out from loop: mesh
     94, Accelerator restriction: scalar variable live-out from loop: mesh
     95, Accelerator restriction: scalar variable live-out from loop: mesh
    104, Generating update self(mesh%ux(:,:,:))
    106, Generating exit data delete(mesh%ux(:,:,:),mesh%u(:,:,:),mesh%uflux(:,:,:),mesh%ucommon(:,:,:))
    107, Generating exit data delete(ucorr_tmp(:,:),ugrad_tmp(:,:),uelem_tmp(:,:))

and here’s the output

(null) lives at 0x608500 size 25600000 present
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 3.0
host:0x608500 device:0x700680000 size:102400000 presentcount:1 line:49 name:mesh
FATAL ERROR: variable in data clause was already present on device 1: name=(null)
 file:/home/Experiments/Nvidia/OpenACC/acc_demo/speedup.f90 cublas_example line:50

I guess if all errors can be sorted out at this level, then I can build everything else out of this.[/code]

A bump just in case nobody noticed this.

Here “mesh” is a fixed size type who’s array member’s sizes are also known at compile time. Hence, when you put mesh into the data region’s “copyin” clause the whole structure is created and the copied over to the device. So when you put “mesh%u” in a data region, the runtime is complaining that it’s already there. Adding “mesh%u” is only necessary if “u” was an allocatable.

There is a compiler issue here in that it should detect that “mesh” is already present and do nothing. This was fixed in 16.4 and your code will run as is. Prior to 16.4, the solution is not put the data members in a data region.

Note that your code will get incorrect answers since “Lgrad” and “Phi” are not initialized on the device. To fix, put these variables in a copyin clause instead of a create clause, or use the update directive to set the values.

For better portability, I’d also recommend calling “DGEMM” instead of “cublasDGEMM”. If you pass DGEMM device arrays (such as from a host_data region), then the compiler will use the cuBLAS version. Otherwise, the compiler will use BLAS. This will allow for your code to run with or without OpenACC.

% cat test.F90
program cuBLAS_example

   use iso_c_binding, only: c_double, c_int
#ifdef _OPENACC
   use cublas
   use openacc
#endif
   implicit none

   integer, parameter :: nelems = 50000
   integer, parameter :: Np = 16, Nvar = 4, Nflux = 16
   integer, parameter :: chunkSize = 10

   type mesh2d
       real(c_double) :: u(Np, Nvar, nelems)
       real(c_double) :: ux(Np, Nvar, nelems)
       real(c_double) :: uflux(Nflux, Nvar, nelems)
       real(c_double) :: ucommon(Nflux, Nvar, nelems)
   end type mesh2d

   real(c_double)   :: Lgrad(Np, Np), phi(Np, Nflux)
   real(c_double)   :: uelem_tmp(Np, Nvar*chunkSize), ucorr_tmp(Nflux, Nvar*chunkSize)
   real(c_double)   :: ugrad_tmp(Np, Nvar*chunkSize)

   type(mesh2d)     :: mesh

   integer(c_int)      :: i, j, k, iter


   do i = 1, nelems
       do j = 1, Nvar
           do k = 1, Np
               mesh%u(k, j, i) = i+j+k
           end do
           do k = 1, Nflux
               mesh%uflux(k, j, i) = i-j-k
               mesh%ucommon(k, j, i) = i-j+k
           end do
       end do
   end do

   do j = 1, Np
     do k = 1, Np
         Lgrad(j, k) = k*j
     end do
     do k = 1, Nflux
         phi(j, k) = k*j + k + j
     end do
   end do

   !$acc enter data copyin(mesh)
   !$acc enter data copyin(Lgrad, phi)
   !$acc enter data create(uelem_tmp, ucorr_tmp, ugrad_tmp)
   do iter = 1, nelems/chunkSize

       !$acc parallel loop present(mesh, mesh%u, uelem_tmp)
       do i = 1, chunkSize
           do j = 1, Nvar
               do k = 1, Np
                   uelem_tmp(k, Nvar*(i-1) + j) = mesh%u(k, j, (iter-1)*chunkSize + i)
               end do
           end do
       end do
       !acc end parallel

       !$acc parallel loop present(mesh, mesh%uflux, mesh%ucommon, ucorr_tmp)
       do i = 1, chunkSize
           do j = 1, Nvar
               do k = 1, Nflux
                   ucorr_tmp(k, Nvar*(i-1) + j) = mesh%uflux(k, j, (iter-1)*chunkSize + i) &
                                             - mesh%ucommon(k, j, (iter-1)*chunkSize + i)

               end do
           end do
       end do
       !$acc end parallel


      !$acc host_data use_device(Lgrad, uelem_tmp, ugrad_tmp)
      call DGEMM('N', 'N', Np, Nvar*chunkSize, Np, &
          1.0_c_double, Lgrad, Np, uelem_tmp, Np, &
          0.0_c_double, ugrad_tmp, Np)
      !$acc end host_data


       !$acc host_data use_device(phi, uelem_tmp, ugrad_tmp)
       call DGEMM('N', 'N', Np, Nvar*chunkSize, Np, &
          1.0_c_double, phi, Np, uelem_tmp, Np, &
          1.0_c_double, ugrad_tmp, Np)
       !$acc end host_data

       !$acc parallel loop present(mesh, mesh%ux, ugrad_tmp,phi)
       do i = 1, chunkSize
           do j = 1, Nvar
               do k = 1, Nflux
                   mesh%ux(k, j, (iter-1)*chunkSize + i) = ugrad_tmp(k, Nvar*(i-1) + j)
               end do
           end do
       end do
       !$acc end parallel

    end do

    !$acc update self(mesh%ux)
    !$acc exit data delete(mesh)
    !$acc exit data delete(uelem_tmp, ugrad_tmp, ucorr_tmp)
    do i = 1, 10
          print *, mesh%ux(1, 1, i)
    end do

 end program cuBLAS_example

% pgf90 -acc -Minfo=accel -Mcuda -Mcudalib:cublas test.F90 -V16.1; a.out
cublas_example:
     51, Generating enter data copyin(mesh)
     52, Generating enter data copyin(phi(:,:),lgrad(:,:))
     53, Generating enter data create(ugrad_tmp(:,:),ucorr_tmp(:,:),uelem_tmp(:,:))
     56, Generating present(mesh,mesh%u(:,:,:),uelem_tmp(:,:))
         Accelerator kernel generated
         Generating Tesla code
         57, !$acc loop gang ! blockidx%x
         59, !$acc loop vector(128) ! threadidx%x
     58, Loop is parallelizable
     59, Loop is parallelizable
     66, Generating present(mesh,mesh%uflux(:,:,:),mesh%ucommon(:,:,:),ucorr_tmp(:,:))
         Accelerator kernel generated
         Generating Tesla code
         67, !$acc loop gang ! blockidx%x
         69, !$acc loop vector(128) ! threadidx%x
     68, Loop is parallelizable
     69, Loop is parallelizable
     92, Generating present(mesh,mesh%ux(:,:,:),ugrad_tmp(:,:),phi(:,:))
         Accelerator kernel generated
         Generating Tesla code
         93, !$acc loop gang ! blockidx%x
         95, !$acc loop vector(128) ! threadidx%x
     94, Loop is parallelizable
     95, Loop is parallelizable
    104, Generating update self(mesh%ux(:,:,:))
    105, Generating exit data delete(mesh)
    106, Generating exit data delete(ucorr_tmp(:,:),ugrad_tmp(:,:),uelem_tmp(:,:))
    5472.000000000000
    5896.000000000000
    6320.000000000000
    6744.000000000000
    7168.000000000000
    7592.000000000000
    8016.000000000000
    8440.000000000000
    8864.000000000000
    9288.000000000000
% pgf90 -fast -lblas  test.F90 -V16.1 ; a.out
    5472.000000000000
    5896.000000000000
    6320.000000000000
    6744.000000000000
    7168.000000000000
    7592.000000000000
    8016.000000000000
    8440.000000000000
    8864.000000000000
    9288.000000000000

Hope this helps,
Mat

Thanks a lot Mat.

That seems to have sorted everything out.

I had read that you could just use DGEMM but I could not. Turns out I had to rename the files as *.F90 instead of *.f90.

Thanks again.

The difference between “.F90” and “.f90” is the preprocessing is automatically applied to “.F90” files while you need to add the flag “-Mpreprocess” to “.f90” files to invoke the preprocessor.

I used preprocessing here to conditionally use the cublas module. The DGEMM selection is done via a generic interface so is unrelated to preprocessing.