Memory read error when using csrmv with transpose operation

Hi,

I am trying to use cusparse csrmv routine with CUSPARSE_OPERATION_TRANSPOSE. The section of the code that produces the error looks as the following:

status = cusparseDcsrmv(this%h_P,CUSPARSE_OPERATION_TRANSPOSE,ntb,K_dim,this%nnz,&
        1._wp,this%descr_P,this%P_Val_d,this%P_RowPtr_d,this%P_ColInd_d,this%Fx_d,0._wp,&
        this%F_meshPx_di)
if (status /= CUSPARSE_STATUS_SUCCESS) print'(" csrmv x-component Error: ",i)',status

When running under cuda-memcheck (mpirun cuda-memcheck --language fortran ./BDpack), I get many errors of the same type:

========= Invalid __shared__ read of size 8
=========     at 0x000008f0 in void csrMvT_hyb_kernel<double, double, double, int=7, int=2, int=8, int=5, int=0>(cusparseCsrMvParams<double, double, double>, int*)
=========     by thread (8,1,1) in block (23,1,1)
=========     Address 0xfffffe30 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
=========     Host Frame:/opt/packages/cuda/9.0/lib64/libcusparse.so.9.0 [0x35cadb]
=========     Host Frame:/opt/packages/cuda/9.0/lib64/libcusparse.so.9.0 [0x37a23e]
=========     Host Frame:/opt/packages/cuda/9.0/lib64/libcusparse.so.9.0 [0x18d243]
=========     Host Frame:/opt/packages/cuda/9.0/lib64/libcusparse.so.9.0 [0x18ed51]
=========     Host Frame:/opt/packages/pgi/linux86-64/18.1/lib/libcudafor.so (cusparsedcsrmv_sethpm_ + 0xba) [0x1f881a]
=========     Host Frame:./BDpack [0x103f11]
=========     Host Frame:./BDpack [0x102ea2]
=========     Host Frame:./BDpack [0x107349]
=========     Host Frame:./BDpack [0x10a1f9]
=========     Host Frame:./BDpack [0xfecbc]
=========     Host Frame:./BDpack [0xfcac8]
=========     Host Frame:./BDpack [0x8e738]
=========     Host Frame:./BDpack [0xa94c]
=========     Host Frame:./BDpack [0xa7f4]
=========     Host Frame:/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
=========     Host Frame:./BDpack [0x80c9]

In my tests, the matrix dimension is {ntb, K_dim} = {2000, 125} and the number of non-zero element is 128000. I made sure that there is suffiecint memory before calling csrmv – When using status = cudaMemGetInfo( free, total ), I get:

 free              16483483648 total              17071734784

A few side notes:

1- The program produces the correct results and no error if I make csc layout and then use CUSPARSE_OPERATION_NON_TRANSPOSE, but unfortunately, the conversion is very slow for the typical large values of K_dim and I would prefer to try csr layout and use TRANSPOSE.
2- The program works fine for small values of K_dim. In some cases it runs but gives wrong results.
3- I am using pgi-18.1 and cuda 9.0
4- During compilation, I get many infos regarding possible copy in, copy out:

Possible copy in and copy out of p_colptr_d in call to cusparsedcsrmv_sethpm
Possible copy in and copy out of f_meshpx_di in call to cusparsedcsrmv_sethpm
Possible copy in and copy out of fx_d in call to cusparsedcsrmv_sethpm
Possible copy in and copy out of p_rowind_d in call to cusparsedcsrmv_s

I tried to make a MWE, but I couldn’t reproduce the error when calling csrmv in a simple program and on a matrix with similar dimensions. My code is on github, however, and I will try to provide any other information as needed.

I would greatly appreciate your help and advise.

Best,
Amir

Hi Amir,

Unfortunately, we’re not really sure what’s wrong. You call looks fine so we suspect somethings wrong with how the sparse matrix arrays are declared or possibly their use is inconsistent with the declarations.

The variable names ending with “_d” imply that you’re declaring these as CUDA device arrays, but you wouldn’t get the “Possible copy in …” warning messages if they we’re. (These messages would only occur with host arrays). How are you (if you are) managing the device data?

My code is on github

Can you post the link? We can take a look and see what we can determine.

-Mat

Hi Mat,

Many thanks for your reply.

I define device arrays inside the object, i.e., device arrays are type bound variables. In the call that I had in the first message body, the object “this” contained the sparse variables.

I am not sure if I quite understand your comment about the possible issue with the declaration of sparse arrays, but let me point out one more thing:
The exact same csrmv call (with all the variables being the same) is done later in the code (line 1276-1278) but with the CUSPARSE_OPERATION_NO_TRANSPOSE. I would not get any error if I use CSC layout (and no transpose) in the first calls and csr (no transpose) in the second call. This made me think that the issue may be related to CUSPARSE_OPERATION_TRANSPOSE. Since the issue is apparently coming from an internal routine (csrMvT_hyb_kernel), I wonder if there are other operations needed because of using transpose?

I hope that I could explain my thoughts, my apologies for any complication and if it is wrong.

If the copy in copy out messages only occur for host arrays, I am very confused about how this could have happened. They all have “device” attribute. Could this be because of using object and referring to the variable through the object (this%array_d)?

Thanks a lot again for your help.
Amir

Dear all,

I wonder how long I need to wait to get help on the issues presented here? My research really depends on this.

Do I need to post the question in another forum?

Thanks,
Amir

Hi Amir,

I was able to a take a look at your code. It took a bit of effort to get setup and I wasn’t sure which input to use but was able to reproduce the error using the “projects/semidilute_dumb_shear” input.

Note that I’m building with PGI 19.1 and CUDA 10.0’s cusparse.

I ran with both the CUSPARSE_OPERATION_TRANSPOSE and CUSPARSE_OPERATION_NON_TRANSPOSE version of the code (I uncommented out the lines around 1143 in diffcalc_cumod.cuf to get the CUSPARSE_OPERATION_NON_TRANSPOSE version). I also added a counter to how many times this second was called.

In both cases, the code failed with a cusparse execution error (error 6) after 10191 times through this section of code in the y and z components. This leads me to believe that the cuSparse codes are fine (either version) and something else in your code is causing the problem.

Error output from both versions:

 free              13714980864 total              16914055168
 BEFORE cusparseDcsrmv:         10191
 csrmv y-component Error:            6
 csrmv z-component Error:            6
 Error!!: Problem in FORWARD cuFFT.
  Error, status = 6
Warning: ieee_underflow is signaling
Warning: ieee_inexact is signaling
FORTRAN STOP

The “Possible copy in …” warning messages is the compiler warning telling you that it may need to create a temp array to pass in a non-unit 1 array (non-contiguous) array to a subroutine. Since these are pointers, the compiler must assume that they are non-contiguous and defer the this test if it is contiguous until runtime. If you add the “contiguous” attribute to the declaration of the pointers (assuming that they are indeed contiguous), the warnings will go away.

For example in hi_cumod.cuf:

complex(wp),device,pointer,contiguous :: F_mesh_do(:,:) => null()

Interestingly, when I add “contiguous” to all the pointer in the warnings, the code fails sooner (after 33 iterations) instead of at 10191. Maybe the arrays aren’t actually contiguous? In this case, it would problematic since the compiler must create a contiguous temp array to pass into a subroutine.

Unfortunately, I don’t have time to fully debug your program, but hopefully this gives you enough to go on.

-Mat

Hi Mat,

I really appreciate your time and effort. Your input was really insightful. I am trying to fully resolve the problem, and I wish to get your help.

I went back and did further debugging as you suggested and I could reproduce that the attribute “contiguous” changes the behavior of the program. So far, I was really thinking I am using them in a contiguous fashion. For example, in line 364-367 of “hi_cumod.cuf”, I use:

    allocate(this%F_mesh_di(0:K_dim-1,3))
    this%F_meshPx_di => this%F_mesh_di(:,1)
    this%F_meshPy_di => this%F_mesh_di(:,2)
    this%F_meshPz_di => this%F_mesh_di(:,3)

or in line 214-216, 225, 261, and 293 of “dcmp_cumod” https://github.com/amir-saadat/BDpack/blob/master/src/semidilute_bs/cuda/dcmp_cumod.cuf/url I have:

      VPt_d => V_d(:,1:m)
      e1Pt_d => e1_d(1:m)
      lamVPt => lamV(1:m)
      ...
      VP_d => V_d(:,k)
      ....
      VP_d => V_d(:,k-1)

Is it considered as non-contiguous?

Thanks,
Amir

Hi Amir,

Yes, these all look contiguous, so I still don’t understand where the failures could be coming from.

-Mat

Hi Mat,

Thanks again for your reply.

In the process of better understanding the pointer behavior, I wrote the following code:

program main
   use cudafor
   use cusparse
   implicit none

   real(8),allocatable :: arr_h(:,:),arr_test_h(:)
   real(8),allocatable,device,target :: arr_d(:,:)
   real(8),pointer,device :: ptr_d(:)

   allocate(arr_h(4,4),arr_d(4,4),arr_test_h(4))
   arr_h=0.0
   arr_h(1:4,1)=1.0

   arr_d=arr_h

   print*,'first column:'
   print'(4(f10.5,1x))',arr_h(:,1)
   print*,'second column:'
   print'(4(f10.5,1x))',arr_h(:,2)

   ptr_d => arr_d(:,1)
   arr_test_h=ptr_d
   print*,'supposed to be the first column (all ones):'
   print'(4(f10.5,1x))',arr_test_h
   ptr_d => arr_d(:,2)
   arr_test_h=ptr_d
   print*,'supposed to be the second column (all zeros):'
   print'(4(f10.5,1x))',arr_test_h

end program main

And I get the following:

first column:
   1.00000    1.00000    1.00000    1.00000
 second column:
   0.00000    0.00000    0.00000    0.00000
 supposed to be the first column (all ones):
   1.00000    1.00000    1.00000    1.00000
 supposed to be the second column (all zeros):
   1.00000    1.00000    1.00000    1.00000

Would you please take a look and see if this is expected.

Thanks again,
Amir

Hi Amir,

This does appear to be a compiler issue where the wrong bounds are being used for the second device-to-host copy. I have added a problem report, TPR#26958, and sent it to engineering for further investigation.

The work around would be to explicitly set the bounds of “ptr_d” when doing the copy. To be safe, you may also want to include the bounds when setting the pointer as well.

For example:

% cat test2.CUF
program main
   use cudafor
   use cusparse
   implicit none

   real(8),allocatable :: arr_h(:,:),arr_test_h(:)
   real(8),allocatable,device,target :: arr_d(:,:)
   real(8),pointer,device :: ptr_d(:)

   allocate(arr_h(4,4),arr_d(4,4),arr_test_h(4))
   arr_h=0.0
   arr_h(1:4,1)=1.0

   arr_d=arr_h

   print*,'first column:'
   print'(4(f10.5,1x))',arr_h(:,1)
   print*,'second column:'
   print'(4(f10.5,1x))',arr_h(:,2)

   ptr_d => arr_d(1:4,1)
   arr_test_h=ptr_d(1:4)
   print*,'supposed to be the first column (all ones):'
   print'(4(f10.5,1x))',arr_test_h
   ptr_d => arr_d(1:4,2)
#ifdef FAILS
   arr_test_h=ptr_d
#else
! WORKS
   arr_test_h=ptr_d(1:4)
#endif
   print*,'supposed to be the second column (all zeros):'
   print'(4(f10.5,1x))',arr_test_h

end program main

% pgfortran -V19.1 test2.CUF -DFAILS ; a.out
 first column:
   1.00000    1.00000    1.00000    1.00000
 second column:
   0.00000    0.00000    0.00000    0.00000
 supposed to be the first column (all ones):
   1.00000    1.00000    1.00000    1.00000
 supposed to be the second column (all zeros):
   1.00000    1.00000    1.00000    1.00000

% pgfortran -V19.1 test2.CUF ; a.out
 first column:
   1.00000    1.00000    1.00000    1.00000
 second column:
   0.00000    0.00000    0.00000    0.00000
 supposed to be the first column (all ones):
   1.00000    1.00000    1.00000    1.00000
 supposed to be the second column (all zeros):
   0.00000    0.00000    0.00000    0.00000

Hopefully the applying the workaround to your full source will also fix the issue.

-Mat