Kernel call by CUBLAS or CUSPARSE library

petal · November 3, 2010, 5:47am

Hi,I am new to CUDA. I would like to know if the kernel is launched and terminated each time we use any of the library routines in CUBLAS or CUSPARSE since these routines can only be called from the host code.Considering an application that needs to make use of multiple such calls say,for eg. the conjugate gradient routine provided in the SDK. Is there any way speed up could be attained using the library routines.

Also how do we know the number of threads involved in the code that use these libraries.
Please help.

petal · November 3, 2010, 5:47am

Hi,I am new to CUDA. I would like to know if the kernel is launched and terminated each time we use any of the library routines in CUBLAS or CUSPARSE since these routines can only be called from the host code.Considering an application that needs to make use of multiple such calls say,for eg. the conjugate gradient routine provided in the SDK. Is there any way speed up could be attained using the library routines.

Also how do we know the number of threads involved in the code that use these libraries.
Please help.

LSChien · November 3, 2010, 5:59am

library assumes that data is ready in device memory, so no PCI overhead.

conjugate gradient in SDK uses csrmv and several blas1 operations, PCI overhead occurs before and after conjugate gradient routine.

Second you don’t need to know number of threads used in the library.

LSChien · November 3, 2010, 5:59am

library assumes that data is ready in device memory, so no PCI overhead.

conjugate gradient in SDK uses csrmv and several blas1 operations, PCI overhead occurs before and after conjugate gradient routine.

Second you don’t need to know number of threads used in the library.

petal · November 3, 2010, 6:03am

Thank you…one more thing if my application demands iterations (say 100) which means I may have to call these routines quite a number of times in a loop.Each time any of the CUSPARSE routines (say for example),when csrmv is called the message “[Launch of CUDA Kernel 0 (scal_kernel) on Device 0] [Termination of Kernel 0 (scal_kernel) on Device 0]” is obtained. Does this mean that the code returns to the host after execution on the device?
If this is the case there is a lot of time spent on the switching that takes place…Is this interpretation correct?Is this the right way to call such library routines?

petal · November 3, 2010, 6:03am

Thank you…one more thing if my application demands iterations (say 100) which means I may have to call these routines quite a number of times in a loop.Each time any of the CUSPARSE routines (say for example),when csrmv is called the message “[Launch of CUDA Kernel 0 (scal_kernel) on Device 0] [Termination of Kernel 0 (scal_kernel) on Device 0]” is obtained. Does this mean that the code returns to the host after execution on the device?
If this is the case there is a lot of time spent on the switching that takes place…Is this interpretation correct?Is this the right way to call such library routines?

LSChien · November 3, 2010, 7:01am

How do you obtain this message, from library call? if so, then which function do you call?

Second, penalty of global synchronization is 5~10 micro seconds.

for example, conjugate gradient in SDK calls one csrmv and 5 blas1 functions per iteration.

so penalty of global synchronization is about 6 x 10 = 60 micro second.

If time of csrmv is much larger than 60 micro second, then you don’t need to worry about penalty of global synchronization.

Remember that you cannot use one kernel to complete your application especially when your application requires global synchronization.

The only one way to do global synchronization is through driver, so you need go back to host part.

LSChien · November 3, 2010, 7:01am

How do you obtain this message, from library call? if so, then which function do you call?

Second, penalty of global synchronization is 5~10 micro seconds.

for example, conjugate gradient in SDK calls one csrmv and 5 blas1 functions per iteration.

so penalty of global synchronization is about 6 x 10 = 60 micro second.

If time of csrmv is much larger than 60 micro second, then you don’t need to worry about penalty of global synchronization.

Remember that you cannot use one kernel to complete your application especially when your application requires global synchronization.

The only one way to do global synchronization is through driver, so you need go back to host part.

petal · November 3, 2010, 7:54am

This message was obtained during the debugging of the code at the point of the function cusparseScsrmv.The similar launch and termination message appear during debugging on encountering CUBLAS routines as well but they are automatically launched in different kernels .Since my application requires only the CUBLAS and CUSPARSE APIs these APIs internally launch the kernels.How do I use more than one kernel externally from my side(i.e the programmer)to complete a code dealing with iterations similar to Conjugate Gradient using the library routines.

Thank you for your prompt reply

petal · November 3, 2010, 7:54am

This message was obtained during the debugging of the code at the point of the function cusparseScsrmv.The similar launch and termination message appear during debugging on encountering CUBLAS routines as well but they are automatically launched in different kernels .Since my application requires only the CUBLAS and CUSPARSE APIs these APIs internally launch the kernels.How do I use more than one kernel externally from my side(i.e the programmer)to complete a code dealing with iterations similar to Conjugate Gradient using the library routines.

Thank you for your prompt reply

LSChien · November 3, 2010, 9:18am

kernel launch is non-blocking, however API may be blocking, you have no information about internal behaviour of API.
if you have only one Fermi card and you still want to run concurrent kernel launch, then you can bind different library to different stream,
for example, cusparseSetKernelStream in cusparse library.
if you have more than two Fermi cards in one PC, then you can run CUBLAS on GPU0 and CUSPARSE on GPU1, however different GPU cannot see other’s device memory,
so if you need to consider communication overhead if you use multiple GPUs.

LSChien · November 3, 2010, 9:18am

kernel launch is non-blocking, however API may be blocking, you have no information about internal behaviour of API.
if you have only one Fermi card and you still want to run concurrent kernel launch, then you can bind different library to different stream,
for example, cusparseSetKernelStream in cusparse library.
if you have more than two Fermi cards in one PC, then you can run CUBLAS on GPU0 and CUSPARSE on GPU1, however different GPU cannot see other’s device memory,
so if you need to consider communication overhead if you use multiple GPUs.

petal · November 3, 2010, 12:24pm

This is with regard to the above statement.Can you throw some light as to how cusparseSetKernelStream works?Is there any kind of help available with reference to the use of such library routines(other than SDK).I am working on GeForce 9400GT.

petal · November 3, 2010, 12:24pm

This is with regard to the above statement.Can you throw some light as to how cusparseSetKernelStream works?Is there any kind of help available with reference to the use of such library routines(other than SDK).I am working on GeForce 9400GT.

LSChien · November 3, 2010, 2:08pm

sorry, maybe I am wrong, I follow section 3.2.7.5 of programming guide and bind cusparse library to different stream in conjugate gradient example (SDK)

cusparseHandle_t handle = 0;

    cusparseStatus_t status;

    status = cusparseCreate(&handle);

    if (status != CUSPARSE_STATUS_SUCCESS) {

        fprintf( stderr, "!!!! CUSPARSE initialization error\n" );

        return EXIT_FAILURE;

    }

    cudaStream_t stream ;

    cudaStreamCreate( &stream );

    status = cusparseSetKernelStream( handle, stream ) ;

    if ( CUSPARSE_STATUS_SUCCESS != status ){

        fprintf(stderr, "Error: bind to different stream fails\n");

        exit(1) ;

    }else{

        printf("stream is %p\n", stream );

    }

The result is correct. This is not reasonable because CUBLAS use default stream (stream 0), and dependence between csrmv and blas1 operations cannot be run out of order, so if csrmv and blas can run concurrently, then the result should be wrong.

LSChien · November 3, 2010, 2:08pm

sorry, maybe I am wrong, I follow section 3.2.7.5 of programming guide and bind cusparse library to different stream in conjugate gradient example (SDK)

cusparseHandle_t handle = 0;

    cusparseStatus_t status;

    status = cusparseCreate(&handle);

    if (status != CUSPARSE_STATUS_SUCCESS) {

        fprintf( stderr, "!!!! CUSPARSE initialization error\n" );

        return EXIT_FAILURE;

    }

    cudaStream_t stream ;

    cudaStreamCreate( &stream );

    status = cusparseSetKernelStream( handle, stream ) ;

    if ( CUSPARSE_STATUS_SUCCESS != status ){

        fprintf(stderr, "Error: bind to different stream fails\n");

        exit(1) ;

    }else{

        printf("stream is %p\n", stream );

    }

The result is correct. This is not reasonable because CUBLAS use default stream (stream 0), and dependence between csrmv and blas1 operations cannot be run out of order, so if csrmv and blas can run concurrently, then the result should be wrong.

Topic		Replies	Views
cusparse concurrency using streams CUDA Programming and Performance	3	1792	July 19, 2013
multi-threading with cusparse lib GPU-Accelerated Libraries	2	2148	March 20, 2013
cuBLAS kernels always run serially despite streams and AsyncMemCpy?!? CUDA Programming and Performance	17	6013	September 30, 2015
Concurrent Kernels with CUSPARSE Library CUDA Programming and Performance	4	914	October 22, 2015
Newbie question about cublas CUDA Programming and Performance	10	3464	December 2, 2010
How to synchronize CuSPARSE functions (as cusparseDcsrmv, ...) GPU-Accelerated Libraries	1	1410	June 5, 2014
Cusparselt can't choose the best kernel(sometime)? GPU-Accelerated Libraries cuda , kernel , cusparse	5	435	January 25, 2024
cuSPARSE control number of thread blocks executed GPU-Accelerated Libraries cublas , cusparse	5	630	May 3, 2023
Asynchronous Parameter Passing how cuSPARSE/cuBLAS can do this? CUDA Programming and Performance	6	4050	March 1, 2012
cuBLAS and dynamic parallelism CUDA Programming and Performance	3	1898	August 30, 2013

Kernel call by CUBLAS or CUSPARSE library

Related topics