Performance measurement and the cost to create handlers (cuBLAS and cuSparse)

I want to compute the total time that a Conjugate Gradient solver, written
in CUDA (cuBLAS + cuSparse), spend to solve a sparse linear system.
In the first try, the program was set to print out the total time needed to
solve an input sparse linear system only once. However, every time the
program was run for the same input linear system, the timing information
generated was slightly different (probably due timing fluctuations and
imprecisions).

Thus, to get more accurate timing info, the program was modified in order
to solve the same system n times (the solver was called from within
a loop) and output an average of each of the n individual time info.

However, for my surprise, the time spent to solve the linear system in the
first loop cycle was much larger then the time spent on the other rounds.
I thought that I could have forgotten to clean some data generated by
previous runs (the initial solution, for instance…). But that was not the
case. So, I’ve started stripping out commands from within CGSolver()
until I ended up with the following:

Loop that calls CGSolver()

int num_tests = 3;

for (int test_count = 0; test_count < num_tests; test_count++ )
{
    ...
    CGSolver( ..., &elapsedTime );
    ...
    std::cout << elapsedTime << std::endl;    
}

CGSolver definition

void CGSolver( ..., float *elapsedTime )
{
    // timer initialization
    cudaEvent_t startSolver;
    cudaEvent_t stopSolver;
    cudaEventCreate( &startSolver );
    cudaEventCreate( &stopSolver );
    cudaEventRecord( startSolver, 0);   // <<<<< start capturing time
  
    // handlers/descriptors creation
    cublasHandle_t cublasHandle = 0;
    cublasCreate( &cublasHandle ) ;
    cusparseHandle_t cusparseHandle = 0;
    cusparseCreate( &cusparseHandle ) ;
    cusparseMatDescr_t descr = 0;
    cusparseCreateMatDescr( &descr ) ;

    // handlers/descriptors destruction
    cusparseDestroyMatDescr( descr ) ;
    cusparseDestroy( cusparseHandle ) ;
    cublasDestroy( cublasHandle );

    // compute total time
    cudaEventRecord( stopSolver, 0 );   // <<<<< stop capturing time
    cudaEventSynchronize( stopSolver );
    cudaEventElapsedTime( elapsedTime, startSolver, stopSolver );
}

In the program above, the loop just calls CGSolver() three times.
Everytime CGSolver() is called, it just initialize the CUDA timer,
create handlers (for cuBLAS and cuSPARSE), destroy handlers and compute
the total time spent. This is the actual output of the code above:

316.213
0.454592 
0.423904

I understand that handler creation should represent some cost. However,
the handlers are being created from scratch for each call of CGSolver(),
and the cost to create those handlers should be embedded in each of the
measurements above.

So, these are my questions:

  1. Why the first measurement is larger?
  2. Is this the correct way to measure the execution time of programs that use cuSparse/cuBLAS functions? If not, which is the correct way?

Just out of curiosity, I’ve modified the CGSolver() function such
that time capture starts right after handler/descriptors initialization:

Modified CGSolver()

void CGSolver( ..., float *elapsedTime )
{
    // timer initialization
    cudaEvent_t startSolver;
    cudaEvent_t stopSolver;
    cudaEventCreate( &startSolver );
    cudaEventCreate( &stopSolver );
  
    // handlers/descriptors creation
    cublasHandle_t cublasHandle = 0;
    cublasCreate( &cublasHandle ) ;
    cusparseHandle_t cusparseHandle = 0;
    cusparseCreate( &cusparseHandle ) ;
    cusparseMatDescr_t descr = 0;
    cusparseCreateMatDescr( &descr ) ;

    cudaEventRecord( startSolver, 0);   // <<<<< start capturing time

    // handlers/descriptors destruction
    cusparseDestroyMatDescr( descr ) ;
    cusparseDestroy( cusparseHandle ) ;
    cublasDestroy( cublasHandle );

    // compute total time
    cudaEventRecord( stopSolver, 0 );   // <<<<< stop capturing time
    cudaEventSynchronize( stopSolver );
    cudaEventElapsedTime( elapsedTime, startSolver, stopSolver );
}

Now, I’ve got the following timings:

0.334752 
0.306208 
0.114816

It can be seen that timings are almost equal now. Would this be the correct
way of measuring time in this context?

Thank you very much for the help!
Christian

The libraries in question have relatively large initialization costs. Much of these costs will occur on the create call (e.g. cublasCreate).

There is probably no single definition of the “correct way”. It depends on what you are concerned about.

If you have an application that solves a problem on the CPU, and another application that solves the problem on the GPU, then the correct measurement method is probably just to measure walltime of application execution.

On the other hand, if you are benchmarking individual functions, or routines, such as the sparse-matrix vector multiply routine in cuSparse that would be commonly used in such a CG Solver, then you probably don’t want to measure the library initialization times, since this routine would be called on each CG iteration whereas the library initialization overhead would not be incurred on each iteration.

Since you now know that that create calls can have significant overhead, a possible application design best practice is to minimize the use of these, eg. call cublasCreate once, at initialization, and then reuse the returned handle as needed. don’t create/destroy/create/destroy needlessly.

That was helpful txbob!

Some other questions... 

Would it be the case that the first call to these "create" functions (which together represent ~300ms) would be related to actually loading the corresponding libraries from disc? In this case, I would expect the OS to keep recently used libraries into memory (right after the first load), in order to reduce the loading time in subsequent program executions. 

Christian

CUDA, the CUDA runtime, and most CUDA-based libraries have internal state that needs to be initialized. Downloading of device code to the GPU is part of this state initialization. Besides that not many details about the initialization process are publicly available. As txbob said, you would want to make sure that contexts are created only once per application run.

Thank you njuffa!

Christian