EVSL Lib is 190 ms on Quadro P520, but 82 ms on Titan RTX2080

Something is clearly wrong here, I would expect a speedup of at least 10 to 20 x

RTX2080 = 2944 Turing cores @ 1.5GHz 448 GB/s memory bw
P520 = 384 Pascal cores @1.3 GHz 48 GB/s memory bw

This is running the EVSL eigenvalue solver from umn.edu. It’s a sparse matrix of size 8700x8700. To build it,

./configure --with-cuda

I am running TESTS/PLanN/MMPlanN.ex

Expecting application scaling proportional to things like GPU single-precision compute throughput or GPU memory bandwidth makes the assumption that the code in question is bound or limited by those characteristics. When you don’t find that kind of scaling, it’s a pretty solid disproof of the underlying assumption.

As a simple example, the code may do much of its work on the CPU. Another example would be a code that launches many very small kernels, so it ends up being bound by GPU kernel launch overhead, rather than anything else.

I started by downloading the zip file indicated here, i.e. EVSLv1.1.0.zip

That zip archive does not even have a configure script in it.

So then I grabbed the latest from github. When I did make, I got an error on cusparseHybMat_t usage. That usage was deprecated in CUDA 10.1 and is no longer available in recent versions of CUDA.

Switching to CUDA 10.2 I was able to get things to build. (it’s important to also set CUDA_HOME and EVSL_CUDA_SM env vars correctly) I then ran the indicated test:

EVSL-master/EVSL_1.1.1/TESTS/PLanN$ ./MMPLanN.ex
There are 8 devices supporting CUDA

Device 0: "Tesla V100-SXM2-32GB"
  Major revision number:          7
  Minor revision number:          0
  Total amount of global memory:  34.09 GB

Running on Device 0: "Tesla V100-SXM2-32GB"
MATRIX: stiff1...
 Timing (sec):
   Iteration time (tot)     :  3.018114
   - - - - - - - - - - - - - - - - -
   Pol(A)*v                 :  0.222485 (    1212, avg 0.000184)
   Reorthogonalization      :  0.256173
   LAPACK eig               :  2.428596
   Compute Ritz vectors     :  0.070712
 Timing (sec):
   Iteration time (tot)     :  2.365402
   - - - - - - - - - - - - - - - - -
   Pol(A)*v                 :  0.292050 (    1112, avg 0.000263)
   Reorthogonalization      :  0.172390
   LAPACK eig               :  1.803673
   Compute Ritz vectors     :  0.060340
 Timing (sec):
   Iteration time (tot)     :  3.378132
   - - - - - - - - - - - - - - - - -
   Pol(A)*v                 :  0.498017 (    1252, avg 0.000398)
   Reorthogonalization      :  0.203755
   LAPACK eig               :  2.559796
   Compute Ritz vectors     :  0.075044
 Timing (sec):
   Iteration time (tot)     :  2.031462
   - - - - - - - - - - - - - - - - -
   Pol(A)*v                 :  0.441989 (    1012, avg 0.000437)
   Reorthogonalization      :  0.151184
   LAPACK eig               :  1.351475
   Compute Ritz vectors     :  0.053499
 Timing (sec):
   Iteration time (tot)     :  1.723298
   - - - - - - - - - - - - - - - - -
   Pol(A)*v                 :  0.526416 (     912, avg 0.000577)
   Reorthogonalization      :  0.136493
   LAPACK eig               :  0.982696
   Compute Ritz vectors     :  0.045362

It wasn’t clear from that which numbers might correspond to your 190ms and 82ms numbers. Next, running under nvprof:

$ nvprof ./MMPLanN.ex
==30127== NVPROF is profiling process 30127, command: ./MMPLanN.ex
There are 8 devices supporting CUDA

Device 0: "Tesla V100-SXM2-32GB"
  Major revision number:          7
  Minor revision number:          0
  Total amount of global memory:  34.09 GB

Running on Device 0: "Tesla V100-SXM2-32GB"
MATRIX: stiff1...
 Timing (sec):
   Iteration time (tot)     :  3.068205
   - - - - - - - - - - - - - - - - -
   Pol(A)*v                 :  0.250324 (    1212, avg 0.000207)
   Reorthogonalization      :  0.229083
   LAPACK eig               :  2.429348
   Compute Ritz vectors     :  0.085626
 Timing (sec):
   Iteration time (tot)     :  2.460952
   - - - - - - - - - - - - - - - - -
   Pol(A)*v                 :  0.315166 (    1112, avg 0.000283)
   Reorthogonalization      :  0.207395
   LAPACK eig               :  1.800784
   Compute Ritz vectors     :  0.075038
 Timing (sec):
   Iteration time (tot)     :  3.502068
   - - - - - - - - - - - - - - - - -
   Pol(A)*v                 :  0.535242 (    1252, avg 0.000428)
   Reorthogonalization      :  0.240271
   LAPACK eig               :  2.564591
   Compute Ritz vectors     :  0.090022
 Timing (sec):
   Iteration time (tot)     :  2.145351
   - - - - - - - - - - - - - - - - -
   Pol(A)*v                 :  0.469250 (    1012, avg 0.000464)
   Reorthogonalization      :  0.186616
   LAPACK eig               :  1.354089
   Compute Ritz vectors     :  0.074912
 Timing (sec):
   Iteration time (tot)     :  1.854002
   - - - - - - - - - - - - - - - - -
   Pol(A)*v                 :  0.562847 (     912, avg 0.000617)
   Reorthogonalization      :  0.158828
   LAPACK eig               :  1.020996
   Compute Ritz vectors     :  0.058711
==30127== Profiling application: ./MMPLanN.ex
==30127== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   41.11%  1.35931s    196864  6.9040us  6.5600us  12.991us  void csrMv_kernel<double, double, double, int=128, int=3>(cusparseCsrMvParams<double, double, double>)
                    9.46%  312.88ms     28914  10.821us  8.8310us  15.711us  void nrm2_kernel<double, double, double, int=0, int=0, int=128>(cublasNrm2Params<double, double>)
                    9.46%  312.72ms    192732  1.6220us  1.5350us  13.023us  evsl_chebAv_kernel(int, int, double, double, double, double*, double*, double*, double*)
                    8.66%  286.22ms      7226  39.609us  2.6240us  79.422us  void gemvNSP_kernel<double, double, double, int=1, int=16, int=4, int=1024, cublasGemvParams<cublasGemvTensorStridedBatched<double const >, cublasGemvTensorStridedBatched<double>, double>>(double const )
                    6.08%  201.09ms    109257  1.8400us  1.5990us  13.151us  void dot_kernel<double, int=128, int=0, cublasDotParams<cublasGemvTensor<double const >, cublasGemvTensorStridedBatched<double>>>(double const )
                    6.01%  198.64ms    111003  1.7890us  1.5030us  7.5190us  void reduce_1Block_kernel<double, int=128, int=7, cublasGemvTensorStridedBatched<double>, cublasGemvTensorStridedBatched<double>>(double const *, double, double, int, double const *, double, cublasGemvTensorStridedBatched<double>, cublasPointerMode_t, cublasLtEpilogue_t, cublasGemvTensorStridedBatched<biasType<cublasGemvTensorStridedBatched<double>value_type, double>::type const >)
                    5.63%  186.16ms    123714  1.5040us  1.3430us  3.3910us  [CUDA memcpy DtoH]
                    5.31%  175.65ms      3940  44.580us  3.9040us  76.509us  void gemv2T_kernel_val<int, int, double, double, double, int=128, int=16, int=4, int=4, bool=0, bool=0, cublasGemvParams<cublasGemvTensorStridedBatched<double const >, cublasGemvTensorStridedBatched<double>, double>>(double const , double, double)
                    5.00%  165.32ms    114883  1.4390us     832ns  12.703us  void axpy_kernel_val<double, double>(cublasAxpyParamsVal<double, double, double>)
                    1.26%  41.575ms     14501  2.8670us  1.6960us  2.2576ms  [CUDA memcpy HtoD]
                    0.90%  29.751ms      1746  17.039us  1.6640us  35.166us  void dot_kernel<double, int=128, int=0, cublasDotParams<cublasGemvTensorStridedBatched<double const >, cublasGemvTensorStridedBatched<double>>>(double const )
                    0.54%  17.731ms     15163  1.1690us  1.0870us  1.9200us  void scal_kernel_val<double, double>(cublasScalParamsVal<double, double>)
                    0.36%  12.007ms     11000  1.0910us  1.0550us  2.5600us  [CUDA memcpy DtoD]
                    0.21%  7.0824ms      3135  2.2590us  1.9200us  7.1360us  void splitKreduce_kernel<double, double, double, double>(cublasSplitKParams<double>, double const *, double const *, double*, double const *, double const *, double const *)
                    0.00%  60.124us        42  1.4310us  1.3120us  3.0710us  void copy_kernel<double>(cublasCopyParams<double>)
                    0.00%  12.191us         6  2.0310us  1.8230us  2.7520us  void gemmk1_kernel<double, int=256, int=5, bool=0, bool=0, bool=0, bool=0, cublasGemvTensorStridedBatched<double const >, cublasGemvTensorStridedBatched<double>, double>(cublasGemmk1Params<double, double const , cublasGemvTensorStridedBatched<double const >, double, biasType<cublasGemvTensorStridedBatched<double const >value_type, double>::type>)
      API calls:   39.64%  3.31312s    784911  4.2210us  3.4570us  5.3350ms  cudaLaunchKernel
                   26.31%  2.19878s    138171  15.913us  4.7810us  61.292ms  cudaMemcpyAsync
                   17.15%  1.43343s        48  29.863ms     287ns  662.59ms  cudaFree
                    5.99%  500.40ms     23021  21.736us     844ns  291.86ms  cudaDeviceSynchronize
                    3.48%  291.11ms    138166  2.1060us  1.7030us  7.7628ms  cudaFuncGetAttributes
                    1.99%  166.61ms   1053811     158ns     112ns  564.24us  cudaGetLastError
                    1.23%  102.74ms    123714     830ns     735ns  534.01us  cudaStreamSynchronize
                    1.07%  89.388ms    128595     695ns     577ns  927.49us  cudaEventRecord
                    1.06%  88.486ms    128595     688ns     538ns  556.34us  cudaEventQuery
                    0.91%  75.958ms     11044  6.8770us  4.5640us  2.3981ms  cudaMemcpy
                    0.42%  34.699ms        40  867.46us  620.69us  2.2723ms  cuDeviceTotalMem
                    0.40%  33.058ms      3880  8.5200us     118ns  709.71us  cuDeviceGetAttribute
                    0.12%  9.6551ms        39  247.57us  4.8970us  1.1462ms  cudaMalloc
                    0.11%  8.8220ms        10  882.20us  674.56us  915.43us  cudaGetDeviceProperties
                    0.10%  8.3000ms     12918     642ns     402ns  473.02us  cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
                    0.04%  3.1489ms        40  78.722us  63.139us  172.80us  cuDeviceGetName
                    0.00%  71.926us       120     599ns     312ns  10.513us  cudaFuncSetAttribute
                    0.00%  39.925us        48     831ns     143ns  6.6270us  cuDeviceGet
                    0.00%  28.428us         8  3.5530us  1.8570us  8.8330us  cuDeviceGetPCIBusId
                    0.00%  15.939us         3  5.3130us  3.8260us  8.1880us  cudaGetDevice
                    0.00%  11.207us        18     622ns     363ns  2.7690us  cudaEventDestroy
                    0.00%  11.104us        18     616ns     372ns  1.8870us  cudaEventCreateWithFlags
                    0.00%  10.374us        22     471ns     251ns  1.8590us  cudaDeviceGetAttribute
                    0.00%  10.060us         4  2.5150us  2.2380us  2.7350us  cuInit
                    0.00%  9.8290us        40     245ns     124ns     850ns  cuDeviceGetUuid
                    0.00%  6.9860us         1  6.9860us  6.9860us  6.9860us  cudaSetDevice
                    0.00%  3.9080us         7     558ns     147ns  1.8920us  cuDeviceGetCount
                    0.00%  1.5500us         4     387ns     240ns     608ns  cuDriverGetVersion
                    0.00%     790ns         1     790ns     790ns     790ns  cudaGetDeviceCount

If we look at the top 3 items in the GPU activities list (accounting for 60% of the GPU timeline), we see that each corresponds to thousands or hundreds of thousands of kernel calls, where the average kernel execution duration is on the order of 2-10 microseconds. GPU kernel launch overhead is in the range of 2-10 microseconds, and we would not expect GPU kernel launch overhead to scale meaningfully going from one GPU to another. Therefore I would coin a term and say this code seems to be “kernel launch bound”, but that is just another way of saying the problem size is too small to make interesting use of larger GPUs. I don’t doubt that the original problem “size” may indeed seem large enough to be interesting on a GPU, but I’m suggesting the decomposition of work undertaken by the algorithm is creating individual problem sizes (i.e. kernel launch configs, etc.) that are too small to scale nicely on/to larger/better GPUs.

We can also observe that the overall application runtime appears to be about 15s, whereas the sum-total of GPU kernel activities from the profiler output appears to be less than 3.5s. Therefore on this particular GPU, approximately 75% of the application timeline (on Tesla V100) appears to consist of activities that we would not expect to be affected by the factors you mention, like number of CUDA cores and GPU memory bandwidth. We can use this data to estimate scaling behavior, but it should be evident that proportional scaling could not be expected here.

That is my initial read of the situation, and a possible contributing factor to the lack of expected scaling. (I’m not presenting any scaling data here.)

Here is the data that nvprof shows for my runs

column G shows the percent of time spent in launch overhead, assuming 2 microseconds per launch. It’s 1 percent in the P520 case and rises to 10 percent for the RTX2080.
The raw capacity of the 2080 device is 10x greater than the P520 yet we see only about 50% speedup. I don’t think 10% launch overhead can explain such a large discrepency.

Unless something has changed in the latest hardware, the fastest rate at which kernels can be issued to the GPU is 200,000 per second, i.e. minimum launch overhead is 5 microsecond. That lower limit has been stable for over a decade, so I’d be surprised if launch overhead of substantially less than 5 microseconds is achieved on any PCIe gen3 connected GPU.

On a Windows system with WDDM driver launch overhead can fluctuate quite widely due to the driver employing launch batching to (partially) overcome the high overhead of submitting work to the GPU inherent in WDDM. I have observed up to 25 microseconds of launch overhead in that scenario.

On my Quadro RTX 4000 (I think that is roughly equivalent to an RTX 2060?) under Windows 10 I measure just shy of 5 microsecond launch overhead with the simple program shown below.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#include <windows.h>
double second (void)
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
#error unsupported platform

// null kernel
__global__ void kernel (void) { }

#define N  (1 << 16)
int main (void)
    double start, stop;
    start = second();
    for (int i = 0; i < N; i++) {
    stop = second();
    printf ("%d kernels launched in %.5f seconds (%.3f microseconds per kernel)\n", N, stop - start, ((stop-start)/N)*1e6);
    return EXIT_SUCCESS;

The output on my system looks like this:

65536 kernels launched in 0.32331 seconds (4.933 microseconds per kernel)

Note that this is using null-kernels. Kernel launch time for non-null kernels tends to be a somewhat higher due to the need to transmit more data to the GPU.

Robert Corvella’s profiler output in the post above show an average kernel execution time of 6.9040us for csrMv_kernel, so this does look like it’s is mostly bounded by launch overhead. Because launch overhead has roughly stayed the same for many years but GPUs have gotten much faster, the risk of code becoming launch-throughput bound due to short-running kernels has increased substantially.

I have not had a chance to work with a PCIe gen4 connected GPU yet, so I cannot say whether this achieves a reduction in launch overhead. It is prudent to not expect too much in this regard. There is an old saying “you can pay for bandwidth, but latency is forever” and that very much applies to common interconnects.

I think that is one factor of what we are seeing but it does not explain most of the discrepency. I plugged 6 usec into my spreadsheet as the kernel launch overhead. The result is that the 2080 spends 30% of its time in launch overhead. If that were the only factor at play here we would see a factor of 7 speedup, but I only see a factor of 0.43.

I addressed one particular aspect. I never claimed that this explains all observations. From my own experience modelling computer systems, I know that modelling is hard, in particular where DRAM is involved. Reducing modelling error to within 10% on a memory-intensive application like sparse matrix work would be an excellent result.

If you want a fairly accurate model, you would want to also consider, at minimum, computational throughput, memory throughput, and memory latency. In my experience a simple analytical model is unlikely to ever provide satisfactory accuracy. That held true twenty years ago and even more so today. The dynamic interactions in complex processors are just too complicated.

On modern GPUs, NVIDIA’s GPU profilers do allow capture of a wide variety of events that can point to various bottlenecks, which may well be different between two GPUs. I haven’t looked closely at this thread, but want to point out that it is important to perform controlled experiments where only one variable changes between experiments. So if you want to compare two GPUs, physically swap the GPUs in the system while keeping all other hardware and software exactly the same.

we appear to be measuring different things

I ran the suggested app, and measured hundreds of thousands of kernel calls. You appear to be measuring something else consisting of a few thousand kernel calls. Good luck!

I am running MMPlanN.ex for a nonstandard range of eigenvalues, that is why my job runs shorter than yours. The executable is the same and the input matrix is the same.

In that case, one thing you may want to check is whether the reduced problem size offers enough parallelism to fully utilize the RTX 2080.

It’s the same problem size with the same parallelism. The only difference is that it terminates sooner because it is finding fewer eigenvalues.

This is a classic instance of strong scaling, running a fixed size problem on varying numbers of computers. I understand the loss of 30% due to launch overheads. I don’t know where the rest of the performance goes.

This would be a good time to get further acquainted with the profiler. This should allow you to pinpoint where the bottlenecks are.

If you’re trying to explain a speedup of ~2.5x (190:82) the profiling data you have already captured will not do it, i.e. cannot be the explanation. The data you have already captured shows a ratio of ~6x (344:58) So clearly it is some other aspect of the application. If the total time (58ms) is a relatively small portion of the overall runtime for your workload, then a situation where the workload is still doing much of its work on the host rather than the device is still a possible contributing factor (i.e not ruled out by the data already presented).

It also seems evident that these numbers cannot be consistent. Although 58ms kernel time could conceivably be part of a 82ms “workload” time, 344ms could not be a component of a 190ms “workload” time. It doesn’t seem to add up. I’m unable to figure out what is being measured.