Expecting application scaling proportional to things like GPU single-precision compute throughput or GPU memory bandwidth makes the assumption that the code in question is bound or limited by those characteristics. When you don’t find that kind of scaling, it’s a pretty solid disproof of the underlying assumption.
As a simple example, the code may do much of its work on the CPU. Another example would be a code that launches many very small kernels, so it ends up being bound by GPU kernel launch overhead, rather than anything else.
I started by downloading the zip file indicated here, i.e. EVSLv1.1.0.zip
That zip archive does not even have a configure script in it.
So then I grabbed the latest from github. When I did make, I got an error on cusparseHybMat_t
usage. That usage was deprecated in CUDA 10.1 and is no longer available in recent versions of CUDA.
Switching to CUDA 10.2 I was able to get things to build. (it’s important to also set CUDA_HOME
and EVSL_CUDA_SM
env vars correctly) I then ran the indicated test:
EVSL-master/EVSL_1.1.1/TESTS/PLanN$ ./MMPLanN.ex
=========================================
There are 8 devices supporting CUDA
Device 0: "Tesla V100-SXM2-32GB"
Major revision number: 7
Minor revision number: 0
Total amount of global memory: 34.09 GB
...
Running on Device 0: "Tesla V100-SXM2-32GB"
=========================================
MATRIX: stiff1...
Timing (sec):
Iteration time (tot) : 3.018114
- - - - - - - - - - - - - - - - -
Pol(A)*v : 0.222485 ( 1212, avg 0.000184)
Reorthogonalization : 0.256173
LAPACK eig : 2.428596
Compute Ritz vectors : 0.070712
Timing (sec):
Iteration time (tot) : 2.365402
- - - - - - - - - - - - - - - - -
Pol(A)*v : 0.292050 ( 1112, avg 0.000263)
Reorthogonalization : 0.172390
LAPACK eig : 1.803673
Compute Ritz vectors : 0.060340
Timing (sec):
Iteration time (tot) : 3.378132
- - - - - - - - - - - - - - - - -
Pol(A)*v : 0.498017 ( 1252, avg 0.000398)
Reorthogonalization : 0.203755
LAPACK eig : 2.559796
Compute Ritz vectors : 0.075044
Timing (sec):
Iteration time (tot) : 2.031462
- - - - - - - - - - - - - - - - -
Pol(A)*v : 0.441989 ( 1012, avg 0.000437)
Reorthogonalization : 0.151184
LAPACK eig : 1.351475
Compute Ritz vectors : 0.053499
Timing (sec):
Iteration time (tot) : 1.723298
- - - - - - - - - - - - - - - - -
Pol(A)*v : 0.526416 ( 912, avg 0.000577)
Reorthogonalization : 0.136493
LAPACK eig : 0.982696
Compute Ritz vectors : 0.045362
$
It wasn’t clear from that which numbers might correspond to your 190ms and 82ms numbers. Next, running under nvprof:
$ nvprof ./MMPLanN.ex
==30127== NVPROF is profiling process 30127, command: ./MMPLanN.ex
=========================================
There are 8 devices supporting CUDA
Device 0: "Tesla V100-SXM2-32GB"
Major revision number: 7
Minor revision number: 0
Total amount of global memory: 34.09 GB
...
Running on Device 0: "Tesla V100-SXM2-32GB"
=========================================
MATRIX: stiff1...
Timing (sec):
Iteration time (tot) : 3.068205
- - - - - - - - - - - - - - - - -
Pol(A)*v : 0.250324 ( 1212, avg 0.000207)
Reorthogonalization : 0.229083
LAPACK eig : 2.429348
Compute Ritz vectors : 0.085626
Timing (sec):
Iteration time (tot) : 2.460952
- - - - - - - - - - - - - - - - -
Pol(A)*v : 0.315166 ( 1112, avg 0.000283)
Reorthogonalization : 0.207395
LAPACK eig : 1.800784
Compute Ritz vectors : 0.075038
Timing (sec):
Iteration time (tot) : 3.502068
- - - - - - - - - - - - - - - - -
Pol(A)*v : 0.535242 ( 1252, avg 0.000428)
Reorthogonalization : 0.240271
LAPACK eig : 2.564591
Compute Ritz vectors : 0.090022
Timing (sec):
Iteration time (tot) : 2.145351
- - - - - - - - - - - - - - - - -
Pol(A)*v : 0.469250 ( 1012, avg 0.000464)
Reorthogonalization : 0.186616
LAPACK eig : 1.354089
Compute Ritz vectors : 0.074912
Timing (sec):
Iteration time (tot) : 1.854002
- - - - - - - - - - - - - - - - -
Pol(A)*v : 0.562847 ( 912, avg 0.000617)
Reorthogonalization : 0.158828
LAPACK eig : 1.020996
Compute Ritz vectors : 0.058711
==30127== Profiling application: ./MMPLanN.ex
==30127== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 41.11% 1.35931s 196864 6.9040us 6.5600us 12.991us void csrMv_kernel<double, double, double, int=128, int=3>(cusparseCsrMvParams<double, double, double>)
9.46% 312.88ms 28914 10.821us 8.8310us 15.711us void nrm2_kernel<double, double, double, int=0, int=0, int=128>(cublasNrm2Params<double, double>)
9.46% 312.72ms 192732 1.6220us 1.5350us 13.023us evsl_chebAv_kernel(int, int, double, double, double, double*, double*, double*, double*)
8.66% 286.22ms 7226 39.609us 2.6240us 79.422us void gemvNSP_kernel<double, double, double, int=1, int=16, int=4, int=1024, cublasGemvParams<cublasGemvTensorStridedBatched<double const >, cublasGemvTensorStridedBatched<double>, double>>(double const )
6.08% 201.09ms 109257 1.8400us 1.5990us 13.151us void dot_kernel<double, int=128, int=0, cublasDotParams<cublasGemvTensor<double const >, cublasGemvTensorStridedBatched<double>>>(double const )
6.01% 198.64ms 111003 1.7890us 1.5030us 7.5190us void reduce_1Block_kernel<double, int=128, int=7, cublasGemvTensorStridedBatched<double>, cublasGemvTensorStridedBatched<double>>(double const *, double, double, int, double const *, double, cublasGemvTensorStridedBatched<double>, cublasPointerMode_t, cublasLtEpilogue_t, cublasGemvTensorStridedBatched<biasType<cublasGemvTensorStridedBatched<double>value_type, double>::type const >)
5.63% 186.16ms 123714 1.5040us 1.3430us 3.3910us [CUDA memcpy DtoH]
5.31% 175.65ms 3940 44.580us 3.9040us 76.509us void gemv2T_kernel_val<int, int, double, double, double, int=128, int=16, int=4, int=4, bool=0, bool=0, cublasGemvParams<cublasGemvTensorStridedBatched<double const >, cublasGemvTensorStridedBatched<double>, double>>(double const , double, double)
5.00% 165.32ms 114883 1.4390us 832ns 12.703us void axpy_kernel_val<double, double>(cublasAxpyParamsVal<double, double, double>)
1.26% 41.575ms 14501 2.8670us 1.6960us 2.2576ms [CUDA memcpy HtoD]
0.90% 29.751ms 1746 17.039us 1.6640us 35.166us void dot_kernel<double, int=128, int=0, cublasDotParams<cublasGemvTensorStridedBatched<double const >, cublasGemvTensorStridedBatched<double>>>(double const )
0.54% 17.731ms 15163 1.1690us 1.0870us 1.9200us void scal_kernel_val<double, double>(cublasScalParamsVal<double, double>)
0.36% 12.007ms 11000 1.0910us 1.0550us 2.5600us [CUDA memcpy DtoD]
0.21% 7.0824ms 3135 2.2590us 1.9200us 7.1360us void splitKreduce_kernel<double, double, double, double>(cublasSplitKParams<double>, double const *, double const *, double*, double const *, double const *, double const *)
0.00% 60.124us 42 1.4310us 1.3120us 3.0710us void copy_kernel<double>(cublasCopyParams<double>)
0.00% 12.191us 6 2.0310us 1.8230us 2.7520us void gemmk1_kernel<double, int=256, int=5, bool=0, bool=0, bool=0, bool=0, cublasGemvTensorStridedBatched<double const >, cublasGemvTensorStridedBatched<double>, double>(cublasGemmk1Params<double, double const , cublasGemvTensorStridedBatched<double const >, double, biasType<cublasGemvTensorStridedBatched<double const >value_type, double>::type>)
API calls: 39.64% 3.31312s 784911 4.2210us 3.4570us 5.3350ms cudaLaunchKernel
26.31% 2.19878s 138171 15.913us 4.7810us 61.292ms cudaMemcpyAsync
17.15% 1.43343s 48 29.863ms 287ns 662.59ms cudaFree
5.99% 500.40ms 23021 21.736us 844ns 291.86ms cudaDeviceSynchronize
3.48% 291.11ms 138166 2.1060us 1.7030us 7.7628ms cudaFuncGetAttributes
1.99% 166.61ms 1053811 158ns 112ns 564.24us cudaGetLastError
1.23% 102.74ms 123714 830ns 735ns 534.01us cudaStreamSynchronize
1.07% 89.388ms 128595 695ns 577ns 927.49us cudaEventRecord
1.06% 88.486ms 128595 688ns 538ns 556.34us cudaEventQuery
0.91% 75.958ms 11044 6.8770us 4.5640us 2.3981ms cudaMemcpy
0.42% 34.699ms 40 867.46us 620.69us 2.2723ms cuDeviceTotalMem
0.40% 33.058ms 3880 8.5200us 118ns 709.71us cuDeviceGetAttribute
0.12% 9.6551ms 39 247.57us 4.8970us 1.1462ms cudaMalloc
0.11% 8.8220ms 10 882.20us 674.56us 915.43us cudaGetDeviceProperties
0.10% 8.3000ms 12918 642ns 402ns 473.02us cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
0.04% 3.1489ms 40 78.722us 63.139us 172.80us cuDeviceGetName
0.00% 71.926us 120 599ns 312ns 10.513us cudaFuncSetAttribute
0.00% 39.925us 48 831ns 143ns 6.6270us cuDeviceGet
0.00% 28.428us 8 3.5530us 1.8570us 8.8330us cuDeviceGetPCIBusId
0.00% 15.939us 3 5.3130us 3.8260us 8.1880us cudaGetDevice
0.00% 11.207us 18 622ns 363ns 2.7690us cudaEventDestroy
0.00% 11.104us 18 616ns 372ns 1.8870us cudaEventCreateWithFlags
0.00% 10.374us 22 471ns 251ns 1.8590us cudaDeviceGetAttribute
0.00% 10.060us 4 2.5150us 2.2380us 2.7350us cuInit
0.00% 9.8290us 40 245ns 124ns 850ns cuDeviceGetUuid
0.00% 6.9860us 1 6.9860us 6.9860us 6.9860us cudaSetDevice
0.00% 3.9080us 7 558ns 147ns 1.8920us cuDeviceGetCount
0.00% 1.5500us 4 387ns 240ns 608ns cuDriverGetVersion
0.00% 790ns 1 790ns 790ns 790ns cudaGetDeviceCount
$
If we look at the top 3 items in the GPU activities list (accounting for 60% of the GPU timeline), we see that each corresponds to thousands or hundreds of thousands of kernel calls, where the average kernel execution duration is on the order of 2-10 microseconds. GPU kernel launch overhead is in the range of 2-10 microseconds, and we would not expect GPU kernel launch overhead to scale meaningfully going from one GPU to another. Therefore I would coin a term and say this code seems to be “kernel launch bound”, but that is just another way of saying the problem size is too small to make interesting use of larger GPUs. I don’t doubt that the original problem “size” may indeed seem large enough to be interesting on a GPU, but I’m suggesting the decomposition of work undertaken by the algorithm is creating individual problem sizes (i.e. kernel launch configs, etc.) that are too small to scale nicely on/to larger/better GPUs.
We can also observe that the overall application runtime appears to be about 15s, whereas the sum-total of GPU kernel activities from the profiler output appears to be less than 3.5s. Therefore on this particular GPU, approximately 75% of the application timeline (on Tesla V100) appears to consist of activities that we would not expect to be affected by the factors you mention, like number of CUDA cores and GPU memory bandwidth. We can use this data to estimate scaling behavior, but it should be evident that proportional scaling could not be expected here.
That is my initial read of the situation, and a possible contributing factor to the lack of expected scaling. (I’m not presenting any scaling data here.)