Memory read error when using csrmv with transpose operation

asaadat · February 19, 2019, 6:13pm

Hi,

I am trying to use cusparse csrmv routine with CUSPARSE_OPERATION_TRANSPOSE. The section of the code that produces the error looks as the following:

status = cusparseDcsrmv(this%h_P,CUSPARSE_OPERATION_TRANSPOSE,ntb,K_dim,this%nnz,&
        1._wp,this%descr_P,this%P_Val_d,this%P_RowPtr_d,this%P_ColInd_d,this%Fx_d,0._wp,&
        this%F_meshPx_di)
if (status /= CUSPARSE_STATUS_SUCCESS) print'(" csrmv x-component Error: ",i)',status

When running under cuda-memcheck (mpirun cuda-memcheck --language fortran ./BDpack), I get many errors of the same type:

========= Invalid __shared__ read of size 8
=========     at 0x000008f0 in void csrMvT_hyb_kernel<double, double, double, int=7, int=2, int=8, int=5, int=0>(cusparseCsrMvParams<double, double, double>, int*)
=========     by thread (8,1,1) in block (23,1,1)
=========     Address 0xfffffe30 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
=========     Host Frame:/opt/packages/cuda/9.0/lib64/libcusparse.so.9.0 [0x35cadb]
=========     Host Frame:/opt/packages/cuda/9.0/lib64/libcusparse.so.9.0 [0x37a23e]
=========     Host Frame:/opt/packages/cuda/9.0/lib64/libcusparse.so.9.0 [0x18d243]
=========     Host Frame:/opt/packages/cuda/9.0/lib64/libcusparse.so.9.0 [0x18ed51]
=========     Host Frame:/opt/packages/pgi/linux86-64/18.1/lib/libcudafor.so (cusparsedcsrmv_sethpm_ + 0xba) [0x1f881a]
=========     Host Frame:./BDpack [0x103f11]
=========     Host Frame:./BDpack [0x102ea2]
=========     Host Frame:./BDpack [0x107349]
=========     Host Frame:./BDpack [0x10a1f9]
=========     Host Frame:./BDpack [0xfecbc]
=========     Host Frame:./BDpack [0xfcac8]
=========     Host Frame:./BDpack [0x8e738]
=========     Host Frame:./BDpack [0xa94c]
=========     Host Frame:./BDpack [0xa7f4]
=========     Host Frame:/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
=========     Host Frame:./BDpack [0x80c9]

In my tests, the matrix dimension is {ntb, K_dim} = {2000, 125} and the number of non-zero element is 128000. I made sure that there is suffiecint memory before calling csrmv – When using status = cudaMemGetInfo( free, total ), I get:

 free              16483483648 total              17071734784

A few side notes:

1- The program produces the correct results and no error if I make csc layout and then use CUSPARSE_OPERATION_NON_TRANSPOSE, but unfortunately, the conversion is very slow for the typical large values of K_dim and I would prefer to try csr layout and use TRANSPOSE.
2- The program works fine for small values of K_dim. In some cases it runs but gives wrong results.
3- I am using pgi-18.1 and cuda 9.0
4- During compilation, I get many infos regarding possible copy in, copy out:

Possible copy in and copy out of p_colptr_d in call to cusparsedcsrmv_sethpm
Possible copy in and copy out of f_meshpx_di in call to cusparsedcsrmv_sethpm
Possible copy in and copy out of fx_d in call to cusparsedcsrmv_sethpm
Possible copy in and copy out of p_rowind_d in call to cusparsedcsrmv_s

I tried to make a MWE, but I couldn’t reproduce the error when calling csrmv in a simple program and on a matrix with similar dimensions. My code is on github, however, and I will try to provide any other information as needed.

I would greatly appreciate your help and advise.

Best,
Amir

MatColgrove · February 20, 2019, 9:04pm

Hi Amir,

Unfortunately, we’re not really sure what’s wrong. You call looks fine so we suspect somethings wrong with how the sparse matrix arrays are declared or possibly their use is inconsistent with the declarations.

The variable names ending with “_d” imply that you’re declaring these as CUDA device arrays, but you wouldn’t get the “Possible copy in …” warning messages if they we’re. (These messages would only occur with host arrays). How are you (if you are) managing the device data?

My code is on github

Can you post the link? We can take a look and see what we can determine.

-Mat

asaadat · February 21, 2019, 3:29am

Hi Mat,

Many thanks for your reply.

I define device arrays inside the object, i.e., device arrays are type bound variables. In the call that I had in the first message body, the object “this” contained the sparse variables.

The section of the code that causes the problem is in BDpack/src/semidilute_bs/cuda at master · amir-saadat/BDpack · GitHub. I can provide further instruction about the installation as needed.
The main object is defined in hi_cumod.cuf line 51-56.
The problem happens in diffcalc_cumod.cuf, in line 1156 inside the subroutine calcDF_recip.

I am not sure if I quite understand your comment about the possible issue with the declaration of sparse arrays, but let me point out one more thing:
The exact same csrmv call (with all the variables being the same) is done later in the code (line 1276-1278) but with the CUSPARSE_OPERATION_NO_TRANSPOSE. I would not get any error if I use CSC layout (and no transpose) in the first calls and csr (no transpose) in the second call. This made me think that the issue may be related to CUSPARSE_OPERATION_TRANSPOSE. Since the issue is apparently coming from an internal routine (csrMvT_hyb_kernel), I wonder if there are other operations needed because of using transpose?

I hope that I could explain my thoughts, my apologies for any complication and if it is wrong.

If the copy in copy out messages only occur for host arrays, I am very confused about how this could have happened. They all have “device” attribute. Could this be because of using object and referring to the variable through the object (this%array_d)?

Thanks a lot again for your help.
Amir

asaadat · March 3, 2019, 10:37pm

Dear all,

I wonder how long I need to wait to get help on the issues presented here? My research really depends on this.

Do I need to post the question in another forum?

Thanks,
Amir

MatColgrove · March 4, 2019, 9:04pm

Hi Amir,

I was able to a take a look at your code. It took a bit of effort to get setup and I wasn’t sure which input to use but was able to reproduce the error using the “projects/semidilute_dumb_shear” input.

Note that I’m building with PGI 19.1 and CUDA 10.0’s cusparse.

I ran with both the CUSPARSE_OPERATION_TRANSPOSE and CUSPARSE_OPERATION_NON_TRANSPOSE version of the code (I uncommented out the lines around 1143 in diffcalc_cumod.cuf to get the CUSPARSE_OPERATION_NON_TRANSPOSE version). I also added a counter to how many times this second was called.

In both cases, the code failed with a cusparse execution error (error 6) after 10191 times through this section of code in the y and z components. This leads me to believe that the cuSparse codes are fine (either version) and something else in your code is causing the problem.

Error output from both versions:

 free              13714980864 total              16914055168
 BEFORE cusparseDcsrmv:         10191
 csrmv y-component Error:            6
 csrmv z-component Error:            6
 Error!!: Problem in FORWARD cuFFT.
  Error, status = 6
Warning: ieee_underflow is signaling
Warning: ieee_inexact is signaling
FORTRAN STOP

The “Possible copy in …” warning messages is the compiler warning telling you that it may need to create a temp array to pass in a non-unit 1 array (non-contiguous) array to a subroutine. Since these are pointers, the compiler must assume that they are non-contiguous and defer the this test if it is contiguous until runtime. If you add the “contiguous” attribute to the declaration of the pointers (assuming that they are indeed contiguous), the warnings will go away.

For example in hi_cumod.cuf:

complex(wp),device,pointer,contiguous :: F_mesh_do(:,:) => null()

Interestingly, when I add “contiguous” to all the pointer in the warnings, the code fails sooner (after 33 iterations) instead of at 10191. Maybe the arrays aren’t actually contiguous? In this case, it would problematic since the compiler must create a contiguous temp array to pass into a subroutine.

Unfortunately, I don’t have time to fully debug your program, but hopefully this gives you enough to go on.

-Mat

asaadat · March 7, 2019, 6:56pm

Hi Mat,

I really appreciate your time and effort. Your input was really insightful. I am trying to fully resolve the problem, and I wish to get your help.

I went back and did further debugging as you suggested and I could reproduce that the attribute “contiguous” changes the behavior of the program. So far, I was really thinking I am using them in a contiguous fashion. For example, in line 364-367 of “hi_cumod.cuf”, I use:

    allocate(this%F_mesh_di(0:K_dim-1,3))
    this%F_meshPx_di => this%F_mesh_di(:,1)
    this%F_meshPy_di => this%F_mesh_di(:,2)
    this%F_meshPz_di => this%F_mesh_di(:,3)

or in line 214-216, 225, 261, and 293 of “dcmp_cumod” https://github.com/amir-saadat/BDpack/blob/master/src/semidilute_bs/cuda/dcmp_cumod.cuf/url I have:

      VPt_d => V_d(:,1:m)
      e1Pt_d => e1_d(1:m)
      lamVPt => lamV(1:m)
      ...
      VP_d => V_d(:,k)
      ....
      VP_d => V_d(:,k-1)

Is it considered as non-contiguous?

Thanks,
Amir

MatColgrove · March 7, 2019, 7:11pm

Hi Amir,

Yes, these all look contiguous, so I still don’t understand where the failures could be coming from.

-Mat

asaadat · March 8, 2019, 8:32am

Hi Mat,

Thanks again for your reply.

In the process of better understanding the pointer behavior, I wrote the following code:

program main
   use cudafor
   use cusparse
   implicit none

   real(8),allocatable :: arr_h(:,:),arr_test_h(:)
   real(8),allocatable,device,target :: arr_d(:,:)
   real(8),pointer,device :: ptr_d(:)

   allocate(arr_h(4,4),arr_d(4,4),arr_test_h(4))
   arr_h=0.0
   arr_h(1:4,1)=1.0

   arr_d=arr_h

   print*,'first column:'
   print'(4(f10.5,1x))',arr_h(:,1)
   print*,'second column:'
   print'(4(f10.5,1x))',arr_h(:,2)

   ptr_d => arr_d(:,1)
   arr_test_h=ptr_d
   print*,'supposed to be the first column (all ones):'
   print'(4(f10.5,1x))',arr_test_h
   ptr_d => arr_d(:,2)
   arr_test_h=ptr_d
   print*,'supposed to be the second column (all zeros):'
   print'(4(f10.5,1x))',arr_test_h

end program main

And I get the following:

first column:
   1.00000    1.00000    1.00000    1.00000
 second column:
   0.00000    0.00000    0.00000    0.00000
 supposed to be the first column (all ones):
   1.00000    1.00000    1.00000    1.00000
 supposed to be the second column (all zeros):
   1.00000    1.00000    1.00000    1.00000

Would you please take a look and see if this is expected.

Thanks again,
Amir

MatColgrove · March 8, 2019, 5:32pm

Hi Amir,

This does appear to be a compiler issue where the wrong bounds are being used for the second device-to-host copy. I have added a problem report, TPR#26958, and sent it to engineering for further investigation.

The work around would be to explicitly set the bounds of “ptr_d” when doing the copy. To be safe, you may also want to include the bounds when setting the pointer as well.

For example:

% cat test2.CUF
program main
   use cudafor
   use cusparse
   implicit none

   real(8),allocatable :: arr_h(:,:),arr_test_h(:)
   real(8),allocatable,device,target :: arr_d(:,:)
   real(8),pointer,device :: ptr_d(:)

   allocate(arr_h(4,4),arr_d(4,4),arr_test_h(4))
   arr_h=0.0
   arr_h(1:4,1)=1.0

   arr_d=arr_h

   print*,'first column:'
   print'(4(f10.5,1x))',arr_h(:,1)
   print*,'second column:'
   print'(4(f10.5,1x))',arr_h(:,2)

   ptr_d => arr_d(1:4,1)
   arr_test_h=ptr_d(1:4)
   print*,'supposed to be the first column (all ones):'
   print'(4(f10.5,1x))',arr_test_h
   ptr_d => arr_d(1:4,2)
#ifdef FAILS
   arr_test_h=ptr_d
#else
! WORKS
   arr_test_h=ptr_d(1:4)
#endif
   print*,'supposed to be the second column (all zeros):'
   print'(4(f10.5,1x))',arr_test_h

end program main

% pgfortran -V19.1 test2.CUF -DFAILS ; a.out
 first column:
   1.00000    1.00000    1.00000    1.00000
 second column:
   0.00000    0.00000    0.00000    0.00000
 supposed to be the first column (all ones):
   1.00000    1.00000    1.00000    1.00000
 supposed to be the second column (all zeros):
   1.00000    1.00000    1.00000    1.00000

% pgfortran -V19.1 test2.CUF ; a.out
 first column:
   1.00000    1.00000    1.00000    1.00000
 second column:
   0.00000    0.00000    0.00000    0.00000
 supposed to be the first column (all ones):
   1.00000    1.00000    1.00000    1.00000
 supposed to be the second column (all zeros):
   0.00000    0.00000    0.00000    0.00000

Hopefully the applying the workaround to your full source will also fix the issue.

-Mat

Topic		Replies	Views
cusparseScsrmv transpose mode is not working CUDA Programming and Performance	17	1591	July 9, 2018
CUSPARSE conversion routines not working... cusparseSnnz and cusparseSdense2csr misbehaving... CUDA Programming and Performance	11	4212	February 28, 2011
Fortran CUSPARSE Bindings CUDA Programming and Performance	1	1994	May 22, 2012
Wrong result sparse matrix-vector multiplication Legacy PGI Compilers	5	3187	August 29, 2018
cusparse<t>csrmv_mp() when the operation is CUSPARSE_OPERATION_TRANSPOSE CUDA Programming and Performance	3	612	January 8, 2018
Problem with cusparseXcoo2csr and cuda-gdb CUDA Programming and Performance	9	1491	January 12, 2012
Sparse Matrix-Vector Multiplication on CUDA CUDA Programming and Performance	79	313782	November 22, 2010
CuSparse csrmv with transpose matrix and optimizations GPU-Accelerated Libraries	0	1361	September 26, 2014
cuSparse multiplication "Address out of bounds" error GPU-Accelerated Libraries	2	511	March 13, 2019
Problems allocating sparse Matrix CUDA Programming and Performance	3	2497	March 15, 2010

Memory read error when using csrmv with transpose operation

Related topics