The incorrect results calculated by the function "cusolverDnXgetrs" when calling it again

Hi,

I am using the function “cusolverDnXgetrs” to solve the linear system of equations. The subroutine need to be called multiple times. I found when it is called first, the correct result can be obtain. However, using it again, I obtained the incorrect result, and the result (is the “gm” in my codes) after using the function is same as the result before using the function. In other words, it look the “gm” is not be changed form CPU to GPU and then to CPU. It is tricky why the result can be obtained on the first time. I am sure that the “A” and “b” of the “Ax=b” are updated correctly before they are copied on the deviced. Here are my codes:

SUBROUTINE Calculate_Updating_model_Direct_GPU(conductivity)

    IMPLICIT NONE
    
    REAL(r8k), intent(inout):: conductivity( Model_Num )
    INTEGER(i4k):: i,j,k, moment, point
    INTEGER(i4k):: Hessian_row,Hessian_col,WtW_row, idx_tol
    INTEGER(i8k)::t1,t2,count_rate           !t1 denotes original cpu time , t2 denotes the end cpu time and t=t2-t1
    REAL(r8k)   ::t
    !============= Calculate normal equation APIs =============
    TYPE(cusolverDnHandle):: handle_LSE								! A handle that handle to the cuSolver library context 
    TYPE(cusolverDnParams):: params_LSE								! A handle that handle to the cuSolver library context 
    INTEGER(i8k)::rows 											! The number of rows of the dense matrix
    INTEGER(i8k)::cols											! The number of columns of the dense matrix
    INTEGER(i8k):: ld, ldb 
    REAL(r8k), device :: dX(Model_Num)			! Values of the dense vector, i.e. unknown term. Array of size cols        
    integer(8)  :: workspaceInBytesOnDevice, workspaceInBytesOnHost
    integer(4), device :: devinfo
    INTEGER(i1k), DIMENSION(:), device, ALLOCATABLE :: bufferOnDevice	
    INTEGER(i1k), DIMENSION(:), ALLOCATABLE :: bufferOnHost
    INTEGER(i8k), device:: ipiv(Model_Num)								! Array of size at least min(rows,cols), containing pivot indices
    
    !============= Host problem definition =============
    rows = Model_Num
    cols = Model_Num
    ld   = Model_Num
    ldb  = Model_Num
    !============= Device memory management =============
    istat = cusolverDnCreate(handle_LSE)
    istat = cusolverDnSetStream(handle_LSE,acc_get_cuda_stream(1)) 
    istat = cusolverDnCreateParams(params_LSE)
    istat = cusolverDnSetAdvOptions(params_LSE, CUSOLVERDN_GETRF, CUSOLVER_ALG_0)
    !============= Record the computation time =============
    call system_clock(count_rate=count_rate)
    call system_clock (t1)  

    ! d_dense = 0
    !$OMP PARALLEL num_threads(Threads_Num)
    !$OMP DO PRIVATE(WtW_row,Hessian_col)
    do Hessian_row=1, Model_Num
        do WtW_row=Mat_Wm2%Pos_Row_Firstdata(Hessian_row), Mat_Wm2%Pos_Row_Firstdata(Hessian_row+1)-1
            Hessian_col = Mat_Wm2%Col_List(WtW_row)
            d_dense( (Hessian_row-1)*Model_Num + Hessian_col) = d_dense( (Hessian_row-1)*Model_Num + Hessian_col) + laida*Mat_Wm2%Val_List(WtW_row)
        enddo
    enddo
    !$OMP END DO

    !$OMP DO PRIVATE(idx_tol,Hessian_row,moment,point)	
    do Hessian_col=1, Model_Num
        do Hessian_row=1, Model_Num
            DO moment=1,Timegate_Num 
                DO point=1,Point_Num
                    idx_tol=Point_Num*(moment-1)+point
                    d_dense( (Hessian_row-1)*Model_Num + Hessian_col) = d_dense( (Hessian_row-1)*Model_Num + Hessian_col) + Jacobian_Tran(Hessian_row,idx_tol) /Dataerr_Variance(idx_tol)**2 * Jacobian_Tran(Hessian_col,idx_tol)
                ENDDO
            enddo
        ENDDO
    enddo
    !$OMP END DO
    !$OMP END PARALLEL

    !$acc data copyin (d_dense,Gm)

        !$acc kernels
        !$acc loop
        do i=1, Model_Num
            dX(i) = Gm(i)
        enddo   
        !$acc end kernels

        !============= Calculate the sizes needed for pre-allocated buffer =============
        istat = cusolverDnXgetrf_bufferSize( handle_LSE, params_LSE, rows, cols, cudaDataType(CUDA_R_64F), d_dense, ld,cudaDataType(CUDA_R_64F), workspaceInBytesOnDevice, workspaceInBytesOnHost )
        write(*,*)istat
        allocate(bufferOnDevice(workspaceInBytesOnDevice) , bufferOnHost(workspaceInBytesOnHost)  )
        !============= Slove the linear system of equations : AX=B =============
        istat = cusolverDnXgetrf( handle_LSE, params_LSE, rows, cols, cudaDataType(CUDA_R_64F), d_dense, ld,  ipiv, cudaDataType(CUDA_R_64F), bufferOnDevice,workspaceInBytesOnDevice,bufferOnHost, workspaceInBytesOnHost,devinfo )
        write(*,*)istat
        istat = cusolverDnXgetrs( handle_LSE, params_LSE,CUBLAS_OP_T, rows, 1, cudaDataType(CUDA_R_64F), d_dense, ld,  ipiv, cudaDataType(CUDA_R_64F), dX, ldb, devinfo )
        write(*,*)istat
        !============= Get the Results =============
        call system_clock (t2)
        t = REAL(t2 - t1) / REAL(count_rate)
        print*,'The computing time is:', t  

        ! write(*,*)
        ! write(*,*)"========Assign the result to the host==========="
        ! write(*,*)

        !$acc kernels
        !$acc loop
        do i=1, Model_Num
            Gm(i) = dX(i)
        enddo   
        !$acc end kernels

        !$acc update host(Gm)
    !$acc end data

!============= Calculate the sizes needed for pre-allocated buffer =============
    deallocate( bufferOnDevice,bufferOnHost )
    istat = cusolverDnDestroyParams(params_LSE)
    istat = cusolverDnDestroy(handle_LSE) 
    !============= Update the Conductivity =============

    conductivity=conductivity-Gm

 
    RETURN
ENDSUBROUTINE Calculate_Updating_model_Direct_GPU

Regards,
Amore.

Hi SkyCool,

Unfortunately, I’m not seeing anything obvious so will need to ask for a reproducing example in order to investigate.

-Mat

Here are the codes that you can run it directly, and two files need to be put on the directory.
A_dense.txt (4.2 MB)
A_GmG.txt (10.7 KB)

SUBROUTINEtest_cusolverDnXgetrs
    IMPLICIT NONE

    !============= CUSPARSE APIs =============
    type(cusolverDnHandle):: handle_LSE								! A handle that handle to the cuSolver library context 
    type(cusolverDnParams) :: params_LSE								! A handle that handle to the cuSolver library context 
    INTEGER(KIND=8)::rows 											! The number of rows of the dense matrix
    INTEGER(KIND=8)::cols 											! The number of columns of the dense matrix
    INTEGER(KIND=4):: istat											! This data type represents the status returned by the library functions and it can have the values from 0 to 31 
    INTEGER(KIND=8):: i,j,ld,ldb
    integer(8) :: workspaceInBytesOnDevice, workspaceInBytesOnHost
    integer(4), device :: devinfo(2)
    INTEGER(4) ::devinfo_temp
    INTEGER(KIND=1), DIMENSION(:), device, ALLOCATABLE :: bufferOnDevice	
    INTEGER(KIND=1), DIMENSION(:), ALLOCATABLE :: bufferOnHost
    INTEGER(KIND=8), device:: ipiv(3)								! Array of size at least min(rows,cols), containing pivot indices
    INTEGER(KIND=8) :: ipiv_temp(3)									! Used to Check the ipiv
    INTEGER(i4k)::info, NRHS
    REAL(r8k):: A(3,3), B(3), IPIVs( 3 )
    
    REAL(KIND=8),DEVICE :: d_dense(405*405)
    REAL(KIND=8) :: d_dense_temp(405*405)
    REAL(KIND=8),device :: dX(405)
    REAL(KIND=8) :: dY(405)

    OPEN(234,FILE="A_GmG.txt")
    do i=1, 405
        read(234,*)dY(i)
    enddo
    close(234)
    OPEN(234,FILE="A_dense.txt")
    do i=1, 405*405
        read(234,*)d_dense_temp(i)
    enddo
    close(234)

    !============= Host problem definition =============
    rows = 405
    cols = 405
    ld = 405
    ldb = 405
    !============= Allocating and assigning the value of RHS =============
    dX = dY
    d_dense = d_dense_temp
    !============= Device memory management =============
    ipiv = 1
    
    istat = cusolverDnCreate(handle_LSE)
    istat = cusolverDnSetStream(handle_LSE,acc_get_cuda_stream(44)) 
    istat = cusolverDnCreateParams(params_LSE)
    istat = cusolverDnSetAdvOptions(params_LSE, CUSOLVERDN_GETRF, CUSOLVER_ALG_0)

    istat = cusolverDnXgetrf_bufferSize( handle_LSE, params_LSE, rows, cols, cudaDataType(CUDA_R_64F), d_dense, ld,cudaDataType(CUDA_R_64F), workspaceInBytesOnDevice, workspaceInBytesOnHost )
    allocate(bufferOnDevice(workspaceInBytesOnDevice) , bufferOnHost(workspaceInBytesOnHost)  )
    istat = cusolverDnXgetrf( handle_LSE, params_LSE, rows, cols, cudaDataType(CUDA_R_64F), d_dense, ld,  ipiv, cudaDataType(CUDA_R_64F), bufferOnDevice,workspaceInBytesOnDevice,bufferOnHost, workspaceInBytesOnHost,devinfo(1) )
    istat = cusolverDnXgetrs( handle_LSE, params_LSE,CUBLAS_OP_T, rows, 1, cudaDataType(CUDA_R_64F), d_dense, ld,  ipiv, cudaDataType(CUDA_R_64F), dx, ldb, devinfo(2) )
    dY = dX

    write(*,*)"========The calculational results GPU==========="
    !============= Printing the relust of X =============

    OPEN(1,file='A_gm.txt')
    do i=1, 405
        write(1,*)dY(i)
    enddo
    close(1)

    write(*,*)"========The End==========="
    pause
    
END SUBROUTINEtest_cusolverDnXgetrs 

After over the program, the result files " A_gm.txt "can be created. Now I can not get the correct results. Before I transplant these codes from my subroutine to here, I can get the results with the same codes.
I hope you can run these codes second times. I can not get the same and correct results in my system.

Hi SkyCool,

I took a look at your ported over code and was able to reproduce the bad behavior I think you were seeing. I reduced it down to a simple 5x5 case for quick turn around and was able to identify that if I added the following lines after calculating dX (I put it in the line before dY = dX, but I think you just need it before the end of the subroutine)

istat = cusolverDnDestroyParams(params_LSE)
istat = cusolverDnDestroy(handle_LSE)

Then I got the expected answer every subsequent time that I called the subroutine. I’m not very experienced with using the cuda libraries, but it appears that the handle and params were sticking around and corrupting memory behavior. Thus, it seems that you just need to clean it up after your calculation. You also probably want to deallocate the allocatables you make, but it doesn’t appear to be as big an issue.

Hope that this helps resolve your issue. So at least this part of your code appears to work correctly - if you can pare down to a small reproducer that highlights the issue, we’ll try to help you with the larger subroutine. Also - it would make our lives easier if you dropped the dimensionality for the reproducer. Matrices on the order of 3x3 to 6x6 will make life much easier for debugging purposes in comparison to 405x405.

Cheers,

Seth.

Hi Seth,

Thanks for your replies. I also obtained the correct result when the dimension of the coefficient is reduced, such as dim=3. However, when dim=405, the result calculated by function is not correct and is equal to the RHS that is as a input term. You can use the code directly I given above, I run it on the V100 and a few seconds will be spent on the program.
When simulating the actual problem, the dimension of the linear system of equations becomes significantly larger.

regards,

Amore

Hi Mat,

I gave the codes that can be run. I find that different answers may be obtaind for the same program and different dimensions of coefficient matrixes. such as, a correct result with 33 matrix and a wrong result with 400400 matrix when the program is run only once. If the subroutine with the function is put in the loop, the wrong result will appear when the loop is more than 2, while the result is correct if loop is set to one.
I have tried many different approaches, but none of them have changed this fact. I hope you can help me. Thank you for your quick response.

regards,

Amore

Amore,

I think you have 2 issues in your code. First - I believe ipiv needs to be an array the size of min(rows,cols) - so here it should be an array of size 405. I have made that change in my sample code, I share below. Second, I believe you have a race condition - I added a cudaDeviceSynchronize call prior it updating dY. It seems like you can get lucky on the first call through, but on subsequent calls, the luck doesn’t occur and you update dY with the host value of dX before the device returns its value of dX.

Below I have modified your code to be user friendly for me, adding the necessary libraries, deallocates, and destroy calls to have everything work correctly, whilst also adding the needed synchronize and increasing the size of ipiv to the correct value.

program test

    call test_cusolverDnXgetrs()
    call test_cusolverDnXgetrs()
    call test_cusolverDnXgetrs()

end program

SUBROUTINE test_cusolverDnXgetrs
    use cusolverdn
    use cudafor
    use openacc

    IMPLICIT NONE

    !============= CUSPARSE APIs =============
    type(cusolverDnHandle):: handle_LSE                               ! A handle that handle to the cuSolver library context
    type(cusolverDnParams) :: params_LSE                            ! A handle that handle to the cuSolver library context
    INTEGER(KIND=8)::rows                                           ! The number of rows of the dense matrix
    INTEGER(KIND=8)::cols                                            ! The number of columns of the dense matrix
    INTEGER(KIND=4):: istat                                         ! This data type represents the status returned by the library functions and it can have the values from 0 to 31
    INTEGER(KIND=8):: i,j,ld,ldb
    integer(8) :: workspaceInBytesOnDevice, workspaceInBytesOnHost
    integer(4), device :: devinfo(2)
    INTEGER(4) ::devinfo_temp
    INTEGER(KIND=1), DIMENSION(:), device, ALLOCATABLE :: bufferOnDevice
    INTEGER(KIND=1), DIMENSION(:), ALLOCATABLE :: bufferOnHost
    INTEGER(KIND=8), device:: ipiv(405)                                        ! Array of size at least min(rows,cols), containing pivot indices

    REAL(KIND=8),DEVICE :: d_dense(405*405)
    REAL(KIND=8) :: d_dense_temp(405*405)
    REAL(KIND=8),device :: dX(405)
    REAL(KIND=8) :: dY(405)

    OPEN(234,FILE="A_GmG.txt")
    do i=1, 405
        read(234,*)dY(i)
    enddo
    close(234)
    OPEN(234,FILE="A_dense.txt")
    do i=1, 405*405
        read(234,*)d_dense_temp(i)
    enddo
    close(234)

    write(*,*) "Sum of b is"
    write(*,*) sum(dY)
    write(*,*) "------"

    !============= Host problem definition =============
    rows = 405
    cols = 405
    ld = 405
    ldb = 405
    !============= Allocating and assigning the value of RHS =============
    dX = dY
    d_dense = d_dense_temp

    !============= Device memory management =============
    ipiv = 1

    istat = cusolverDnCreate(handle_LSE)
    istat = cusolverDnSetStream(handle_LSE,acc_get_cuda_stream(44))
    istat = cusolverDnCreateParams(params_LSE)
    istat = cusolverDnSetAdvOptions(params_LSE, CUSOLVERDN_GETRF, CUSOLVER_ALG_0)

    istat = cusolverDnXgetrf_bufferSize( handle_LSE, params_LSE, rows, cols, cudaDataType(CUDA_R_64F), d_dense, ld,cudaDataType(CUDA_R_64F), workspaceInBytesOnDevice, workspaceInBytesOnHost )
    allocate(bufferOnDevice(workspaceInBytesOnDevice) , bufferOnHost(workspaceInBytesOnHost)  )
    istat = cusolverDnXgetrf( handle_LSE, params_LSE, rows, cols, cudaDataType(CUDA_R_64F), d_dense, ld,  ipiv, cudaDataType(CUDA_R_64F), bufferOnDevice,workspaceInBytesOnDevice,bufferOnHost, workspaceInBytesOnHost,devinfo(1) )
    istat = cusolverDnXgetrs( handle_LSE, params_LSE,CUBLAS_OP_T, rows, 1, cudaDataType(CUDA_R_64F), d_dense, ld,  ipiv, cudaDataType(CUDA_R_64F), dx, ldb, devinfo(2) )
    istat = cudaDeviceSynchronize()
    dY = dX
    istat = cusolverDnDestroyParams(params_LSE)
    istat = cusolverDnDestroy(handle_LSE)
    deallocate(bufferOnDevice, bufferOnHost)

    write(*,*)"Sum of X is"
    !============= Printing the relust of X =============

    write(*,*) sum(dY)

    write(*,*)"========Subroutine cycle==========="

END SUBROUTINE test_cusolverDnXgetrs

If I compile your code via:

nvfortran test.F90 -acc -cuda -cudalibs -o test

Then now I get very consistent results

[scamp@ice4 405]$ ./test
 Sum of b is
    335.1042215138598
 ------
 Sum of X is
   -311.1725408200209
 ========Subroutine cycle===========
 Sum of b is
    335.1042215138598
 ------
 Sum of X is
   -311.1725408200209
 ========Subroutine cycle===========
 Sum of b is
    335.1042215138598
 ------
 Sum of X is
   -311.1725408200209
 ========Subroutine cycle===========

Whereas if I remove the synchronize, I get the right answer the first time through and then every subsequent time through, I see x being the same as b, which we know is wrong. Similarly, if I reduce the size of ipiv to 3 rather than 405, I just get x = b every time I call the subroutine.

I think these changes should help resolve the issues you’re seeing.

1 Like

Thanks for your suggestion. After adding the library cudaDeviceSynchronize into my codes, the codes work correctly. However, I am still confused which condition causes the competition during calculation. I just input a matrix and a vector to the library and set the function of outputting them below that library. The order is unreasonable? When I solve more dense linear systems of the equation in GPU parallel using different cude streams, I will meet this issue again.

This is a fundamental issue that users have to master when coming to GPU programming - you launched a bunch of work on the GPU in cusolverDnXgetrs, but that’s not a blocking call, so your host code keeps marching forward unless you add your own blocking mechanism in to wait on the GPU to finish. Had there been CPU host work that didn’t depend on the result of the call - you could have knocked it out whilst waiting for the GPU to return by putting a cudaDeviceSynchronize at the latest possible point in your code where dX returning with the GPU value versus still having the host value would make a difference.

I used to make the kernels to control the sequential execution. Example,

!$acc data

   istat = cusolverDnXgetrs(parameters)
   
   !$acc kernels
        dY = dX
   !$acc end kernels

!$acc end data

If “async” is not used in the kernels, the task should be executed sequentially. In other word, the command in the kernels will be run after finishing the cusolverDnXgetrs. But the results are still wrong.

So I think you’ve effectively launched cusolverDnXgetrs asynchronously, because it’s a nonblocking call, which allows the kernels construct to be executed in a different GPU stream - leading to the race condition. If you launched the kernels construct into the same stream as the cusolverDnXgetrs call, then I think it would be effectively blocking.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.