Kernel call by CUBLAS or CUSPARSE library

Hi,I am new to CUDA. I would like to know if the kernel is launched and terminated each time we use any of the library routines in CUBLAS or CUSPARSE since these routines can only be called from the host code.Considering an application that needs to make use of multiple such calls say,for eg. the conjugate gradient routine provided in the SDK. Is there any way speed up could be attained using the library routines.

Also how do we know the number of threads involved in the code that use these libraries.
Please help.

Hi,I am new to CUDA. I would like to know if the kernel is launched and terminated each time we use any of the library routines in CUBLAS or CUSPARSE since these routines can only be called from the host code.Considering an application that needs to make use of multiple such calls say,for eg. the conjugate gradient routine provided in the SDK. Is there any way speed up could be attained using the library routines.

Also how do we know the number of threads involved in the code that use these libraries.
Please help.

library assumes that data is ready in device memory, so no PCI overhead.

conjugate gradient in SDK uses csrmv and several blas1 operations, PCI overhead occurs before and after conjugate gradient routine.

Second you don’t need to know number of threads used in the library.

library assumes that data is ready in device memory, so no PCI overhead.

conjugate gradient in SDK uses csrmv and several blas1 operations, PCI overhead occurs before and after conjugate gradient routine.

Second you don’t need to know number of threads used in the library.

Thank you…one more thing if my application demands iterations (say 100) which means I may have to call these routines quite a number of times in a loop.Each time any of the CUSPARSE routines (say for example),when csrmv is called the message “[Launch of CUDA Kernel 0 (scal_kernel) on Device 0] [Termination of Kernel 0 (scal_kernel) on Device 0]” is obtained. Does this mean that the code returns to the host after execution on the device?
If this is the case there is a lot of time spent on the switching that takes place…Is this interpretation correct?Is this the right way to call such library routines?

Thank you…one more thing if my application demands iterations (say 100) which means I may have to call these routines quite a number of times in a loop.Each time any of the CUSPARSE routines (say for example),when csrmv is called the message “[Launch of CUDA Kernel 0 (scal_kernel) on Device 0] [Termination of Kernel 0 (scal_kernel) on Device 0]” is obtained. Does this mean that the code returns to the host after execution on the device?
If this is the case there is a lot of time spent on the switching that takes place…Is this interpretation correct?Is this the right way to call such library routines?

How do you obtain this message, from library call? if so, then which function do you call?

Second, penalty of global synchronization is 5~10 micro seconds.

for example, conjugate gradient in SDK calls one csrmv and 5 blas1 functions per iteration.

so penalty of global synchronization is about 6 x 10 = 60 micro second.

If time of csrmv is much larger than 60 micro second, then you don’t need to worry about penalty of global synchronization.

Remember that you cannot use one kernel to complete your application especially when your application requires global synchronization.

The only one way to do global synchronization is through driver, so you need go back to host part.

How do you obtain this message, from library call? if so, then which function do you call?

Second, penalty of global synchronization is 5~10 micro seconds.

for example, conjugate gradient in SDK calls one csrmv and 5 blas1 functions per iteration.

so penalty of global synchronization is about 6 x 10 = 60 micro second.

If time of csrmv is much larger than 60 micro second, then you don’t need to worry about penalty of global synchronization.

Remember that you cannot use one kernel to complete your application especially when your application requires global synchronization.

The only one way to do global synchronization is through driver, so you need go back to host part.

This message was obtained during the debugging of the code at the point of the function cusparseScsrmv.The similar launch and termination message appear during debugging on encountering CUBLAS routines as well but they are automatically launched in different kernels .Since my application requires only the CUBLAS and CUSPARSE APIs these APIs internally launch the kernels.How do I use more than one kernel externally from my side(i.e the programmer)to complete a code dealing with iterations similar to Conjugate Gradient using the library routines.

Thank you for your prompt reply

This message was obtained during the debugging of the code at the point of the function cusparseScsrmv.The similar launch and termination message appear during debugging on encountering CUBLAS routines as well but they are automatically launched in different kernels .Since my application requires only the CUBLAS and CUSPARSE APIs these APIs internally launch the kernels.How do I use more than one kernel externally from my side(i.e the programmer)to complete a code dealing with iterations similar to Conjugate Gradient using the library routines.

Thank you for your prompt reply

  1. kernel launch is non-blocking, however API may be blocking, you have no information about internal behaviour of API.

  2. if you have only one Fermi card and you still want to run concurrent kernel launch, then you can bind different library to different stream,
    for example, cusparseSetKernelStream in cusparse library.

  3. if you have more than two Fermi cards in one PC, then you can run CUBLAS on GPU0 and CUSPARSE on GPU1, however different GPU cannot see other’s device memory,
    so if you need to consider communication overhead if you use multiple GPUs.

  1. kernel launch is non-blocking, however API may be blocking, you have no information about internal behaviour of API.

  2. if you have only one Fermi card and you still want to run concurrent kernel launch, then you can bind different library to different stream,
    for example, cusparseSetKernelStream in cusparse library.

  3. if you have more than two Fermi cards in one PC, then you can run CUBLAS on GPU0 and CUSPARSE on GPU1, however different GPU cannot see other’s device memory,
    so if you need to consider communication overhead if you use multiple GPUs.

This is with regard to the above statement.Can you throw some light as to how cusparseSetKernelStream works?Is there any kind of help available with reference to the use of such library routines(other than SDK).I am working on GeForce 9400GT.

This is with regard to the above statement.Can you throw some light as to how cusparseSetKernelStream works?Is there any kind of help available with reference to the use of such library routines(other than SDK).I am working on GeForce 9400GT.

sorry, maybe I am wrong, I follow section 3.2.7.5 of programming guide and bind cusparse library to different stream in conjugate gradient example (SDK)

cusparseHandle_t handle = 0;

    cusparseStatus_t status;

    status = cusparseCreate(&handle);

    if (status != CUSPARSE_STATUS_SUCCESS) {

        fprintf( stderr, "!!!! CUSPARSE initialization error\n" );

        return EXIT_FAILURE;

    }

    cudaStream_t stream ;

    cudaStreamCreate( &stream );

    status = cusparseSetKernelStream( handle, stream ) ;

    if ( CUSPARSE_STATUS_SUCCESS != status ){

        fprintf(stderr, "Error: bind to different stream fails\n");

        exit(1) ;

    }else{

        printf("stream is %p\n", stream );

    }

The result is correct. This is not reasonable because CUBLAS use default stream (stream 0), and dependence between csrmv and blas1 operations cannot be run out of order, so if csrmv and blas can run concurrently, then the result should be wrong.

sorry, maybe I am wrong, I follow section 3.2.7.5 of programming guide and bind cusparse library to different stream in conjugate gradient example (SDK)

cusparseHandle_t handle = 0;

    cusparseStatus_t status;

    status = cusparseCreate(&handle);

    if (status != CUSPARSE_STATUS_SUCCESS) {

        fprintf( stderr, "!!!! CUSPARSE initialization error\n" );

        return EXIT_FAILURE;

    }

    cudaStream_t stream ;

    cudaStreamCreate( &stream );

    status = cusparseSetKernelStream( handle, stream ) ;

    if ( CUSPARSE_STATUS_SUCCESS != status ){

        fprintf(stderr, "Error: bind to different stream fails\n");

        exit(1) ;

    }else{

        printf("stream is %p\n", stream );

    }

The result is correct. This is not reasonable because CUBLAS use default stream (stream 0), and dependence between csrmv and blas1 operations cannot be run out of order, so if csrmv and blas can run concurrently, then the result should be wrong.