Whether the stream of acc_get_cuda_stream() and async() are the same

SkyCool · January 16, 2025, 2:50pm

Hi,
I try to do a cuda function, such as cusolverDnXgesvd. Before excuting this function, a handle should be created, which need to be tied a cuda stream. For example:

    istat = cusolverDnCreate(handle)
    istat = cusolverDnSetStream(handle,acc_get_cuda_stream(22))

When finishing the cusolverDnXgesvd function, the vector or matrix is usually copied to the host from device. However, this program must be excuted on the device, i.e. between a kernel with a specified thread. For example:

    !$acc kernels async(22) copy(S_c)
    !$acc loop 
        do I=1,cols
            S_c(i) = S(i)
        enddo
    !$acc end kernels

I am not sure that both of the stream tied the handle and the label in the async all indicate the same stream. Two part wiil be run following the order or concurrent? I can give you some codes I test the function I use.

SUBROUTINE test_cusolverDnXgesvd
    IMPLICIT NONE

    INTEGER(4)::i
    !============= CUSPARSE APIs =============
    type(cusolverDnHandle):: handle								! A handle that handle to the cuSolver library context 
    type(cusolverDnParams) :: params								! A handle that handle to the cuSolver library context 
    INTEGER(KIND=8), parameter ::rows=2, cols=2 											! The number of rows of the dense matrix
    INTEGER(KIND=4):: istat											! This data type represents the status returned by the library functions and it can have the values from 0 to 31 
    INTEGER(KIND=8), parameter :: lda=rows, ldu=rows, ldvt=cols
    CHARACTER(1) :: jobu, jobvt
    integer(8) :: workspaceInBytesOnDevice, workspaceInBytesOnHost
    REAL(KIND=8), device:: A(rows,cols)
    REAL(KIND=8), device:: S(cols)
    REAL(KIND=8), device:: U(ldu,rows)
    REAL(KIND=8), device:: VT(ldvt,cols)
    INTEGER(1), DIMENSION(:), device, ALLOCATABLE :: bufferOnDevice	
    INTEGER(1), DIMENSION(:), ALLOCATABLE :: bufferOnHost
    integer(4), device :: devinfo

    REAL(KIND=8):: S_c(cols)

    istat = cusolverDnCreate(handle)
    istat = cusolverDnSetStream(handle,acc_get_cuda_stream(22)) 
    istat = cusolverDnCreateParams(params)
    istat = cusolverDnSetAdvOptions(params, CUSOLVERDN_GETRF, CUSOLVER_ALG_0)

    jobu  = "A"
    jobvt = "A"
    S_c = 0

    A(1,:) =(/1,-2/)
    A(2,:) =(/1,2/) 
    ! A(3,:) =(/2,1/)

    istat = cusolverDnXgesvd_bufferSize(handle,params,jobu,jobvt,rows,cols,cudaDataType(CUDA_R_64F),&
                                A,lda,cudaDataType(CUDA_R_64F),S,cudaDataType(CUDA_R_64F),U,ldu,&
                                cudaDataType(CUDA_R_64F),VT,ldvt,cudaDataType(CUDA_R_64F),&
                                workspaceInBytesOnDevice, workspaceInBytesOnHost)
    allocate(bufferOnDevice(workspaceInBytesOnDevice) , bufferOnHost(workspaceInBytesOnHost)  )

    istat = cusolverDnXgesvd(handle,params,jobu,jobvt,rows,cols,cudaDataType(CUDA_R_64F),&
            A,lda,cudaDataType(CUDA_R_64F),S,cudaDataType(CUDA_R_64F),U,ldu,&
            cudaDataType(CUDA_R_64F),VT,ldvt,cudaDataType(CUDA_R_64F),&
            bufferOnDevice,workspaceInBytesOnDevice,&
            bufferOnHost,workspaceInBytesOnHost,&
            devinfo)

    !$acc kernels copy(S_c)
    !$acc loop 
        do I=1,cols
            S_c(i) = S(i)
        enddo
    !$acc end kernels
        
    write(*,*)S_C
   
    istat = cusolverDnDestroy(handle)
    istat = cusolverDnDestroyParams(params)
    istat = cudaStreamDestroy(acc_get_cuda_stream(22))
    deallocate(bufferOnDevice,bufferOnHost)
    stop
END SUBROUTINE test_cusolverDnXgesvd

You can call this subroutine directly and the correct results can be obtained (2.828427124, 1.4142135623). However, the results are incorrect if you add “async(22)” as flag of kernels.

Best wishes,
Amor

Topic		Replies	Views
cuda stream CUDA Programming and Performance	3	5903	April 6, 2011
Help with CUDA streams CUDA Programming and Performance	1	1647	April 2, 2010
Opencv cuda stream optimization CUDA Programming and Performance opencv , cuda	0	901	August 16, 2022
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1256	December 15, 2022
Segfaulint with acc_get_cuda_stream Legacy PGI Compilers	0	5412	July 17, 2014
Questions about STREAM CUDA Programming and Performance	0	561	November 22, 2011
Do I need to synchronize the stream / threads after a cusolver call? CUDA Programming and Performance cuda	1	394	March 29, 2022
is there need a streamsynchronize() between kernels and CULA function when use cuda stream? CUDA Programming and Performance	1	475	October 2, 2017
asyncAPI sample question CUDA Programming and Performance	9	5150	December 18, 2007
cudaMemcpyAsync CUDA Programming and Performance	10	22002	October 16, 2015

Whether the stream of acc_get_cuda_stream() and async() are the same

Related topics