SingleAsianOption Performance vs Tensorflow/Cupy

When I run this sample code on a V100, using a profiler, there are 3 kernel calls and the sum total duration of those 3 is less than 1ms. However the reported “Time” is 177ms without the profiler to 381ms with the profiler. So although that sample code does report “performance”, I doubt it is sensibly calculated for the comparison you are trying to do.

What I note in the profiler in the case where 381ms is being reported, is that there is a call to cudaFuncGetAttributes that is using 378ms. This sort of activity is not necessary and should not be part of a careful performance benchmark, IMO. You can find the calls to cudaFuncGetAttributes to study its usage, in the source code. Some of this may simply be due to CUDA start-up overhead. A careful benchmarking exercise (IMO) should do a warm-up run before computing measured values. Yes, I guess you are probably not doing this with your cupy code. But I can definitely spot some problems in the comparison.

So I’m fairly convinced what you’re doing is not an apples-to-apples comparison.

Here is a run of the sample code in nvprof:

$ nvprof /usr/local/cuda/samples/bin/x86_64/linux/release/MC_SingleAsianOptionP
Monte Carlo Single Asian Option (with PRNG)
===========================================

==9059== NVPROF is profiling process 9059, command: /usr/local/cuda/samples/bin/x86_64/linux/release/MC_SingleAsianOptionP
Pricing option on GPU (Tesla V100-PCIE-32GB)

Precision:      single
Number of sims: 100000

   Spot    |   Strike   |     r      |   sigma    |   tenor    |  Call/Put  |   Value    |  Expected  |
-----------|------------|------------|------------|------------|------------|------------|------------|
        40 |         35 |       0.03 |        0.2 |   0.333333 |       Call |    5.17083 |    5.16253 |

MonteCarloSingleAsianOptionP, Performance = 283991.68 sims/s, Time = 352.12(ms), NumDevsUsed = 1, Blocksize = 128
==9059== Profiling application: /usr/local/cuda/samples/bin/x86_64/linux/release/MC_SingleAsianOptionP
==9059== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   86.82%  779.19us         1  779.19us  779.19us  779.19us  initRNG(curandStateXORWOW*, unsigned int)
                    6.57%  58.977us         1  58.977us  58.977us  58.977us  void generatePaths<float>(float*, curandStateXORWOW*, AsianOption<float> const *, unsigned int, unsigned int)
                    6.13%  55.042us         1  55.042us  55.042us  55.042us  void computeValue<float>(float*, float const *, AsianOption<float> const *, unsigned int, unsigned int)
                    0.28%  2.4960us         1  2.4960us  2.4960us  2.4960us  [CUDA memcpy DtoH]
                    0.20%  1.7920us         1  1.7920us  1.7920us  1.7920us  [CUDA memcpy HtoD]
      API calls:   96.65%  349.33ms         3  116.44ms  2.6540us  349.32ms  cudaFuncGetAttributes  ********************************
                    1.29%  4.6503ms         4  1.1626ms  350.57us  3.2604ms  cuDeviceTotalMem
                    0.80%  2.8774ms       404  7.1220us     282ns  586.75us  cuDeviceGetAttribute
                    0.26%  936.12us         2  468.06us  464.23us  471.89us  cudaGetDeviceProperties
                    0.26%  925.14us         2  462.57us  35.554us  889.58us  cudaMemcpy
                    0.24%  872.45us        20  43.622us     797ns  221.09us  cudaDeviceGetAttribute
                    0.18%  648.45us         4  162.11us  7.1600us  239.08us  cudaMalloc
                    0.16%  588.48us         4  147.12us  16.129us  242.57us  cudaFree
                    0.14%  497.40us         4  124.35us  48.910us  232.13us  cuDeviceGetName
                    0.02%  72.498us         3  24.166us  8.7220us  51.983us  cudaLaunchKernel
                    0.01%  18.186us         4  4.5460us  2.9720us  7.2940us  cuDeviceGetPCIBusId
                    0.00%  10.438us         1  10.438us  10.438us  10.438us  cudaSetDevice
                    0.00%  10.355us         8  1.2940us     427ns  4.7330us  cuDeviceGet
                    0.00%  4.1130us         2  2.0560us     653ns  3.4600us  cudaGetDeviceCount
                    0.00%  3.0210us         4     755ns     493ns  1.1600us  cuDeviceGetUuid
                    0.00%  2.6480us         3     882ns     525ns  1.3450us  cuDeviceGetCount
$

Note that the reported time is 352ms, and of that 349ms is consumed by the first call to cudaFuncGetAttributes.

2 Likes