When I run this sample code on a V100, using a profiler, there are 3 kernel calls and the sum total duration of those 3 is less than 1ms. However the reported “Time” is 177ms without the profiler to 381ms with the profiler. So although that sample code does report “performance”, I doubt it is sensibly calculated for the comparison you are trying to do.
What I note in the profiler in the case where 381ms is being reported, is that there is a call to cudaFuncGetAttributes that is using 378ms. This sort of activity is not necessary and should not be part of a careful performance benchmark, IMO. You can find the calls to cudaFuncGetAttributes to study its usage, in the source code. Some of this may simply be due to CUDA start-up overhead. A careful benchmarking exercise (IMO) should do a warm-up run before computing measured values. Yes, I guess you are probably not doing this with your cupy code. But I can definitely spot some problems in the comparison.
So I’m fairly convinced what you’re doing is not an apples-to-apples comparison.
Here is a run of the sample code in nvprof:
$ nvprof /usr/local/cuda/samples/bin/x86_64/linux/release/MC_SingleAsianOptionP
Monte Carlo Single Asian Option (with PRNG)
===========================================
==9059== NVPROF is profiling process 9059, command: /usr/local/cuda/samples/bin/x86_64/linux/release/MC_SingleAsianOptionP
Pricing option on GPU (Tesla V100-PCIE-32GB)
Precision: single
Number of sims: 100000
Spot | Strike | r | sigma | tenor | Call/Put | Value | Expected |
-----------|------------|------------|------------|------------|------------|------------|------------|
40 | 35 | 0.03 | 0.2 | 0.333333 | Call | 5.17083 | 5.16253 |
MonteCarloSingleAsianOptionP, Performance = 283991.68 sims/s, Time = 352.12(ms), NumDevsUsed = 1, Blocksize = 128
==9059== Profiling application: /usr/local/cuda/samples/bin/x86_64/linux/release/MC_SingleAsianOptionP
==9059== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 86.82% 779.19us 1 779.19us 779.19us 779.19us initRNG(curandStateXORWOW*, unsigned int)
6.57% 58.977us 1 58.977us 58.977us 58.977us void generatePaths<float>(float*, curandStateXORWOW*, AsianOption<float> const *, unsigned int, unsigned int)
6.13% 55.042us 1 55.042us 55.042us 55.042us void computeValue<float>(float*, float const *, AsianOption<float> const *, unsigned int, unsigned int)
0.28% 2.4960us 1 2.4960us 2.4960us 2.4960us [CUDA memcpy DtoH]
0.20% 1.7920us 1 1.7920us 1.7920us 1.7920us [CUDA memcpy HtoD]
API calls: 96.65% 349.33ms 3 116.44ms 2.6540us 349.32ms cudaFuncGetAttributes ********************************
1.29% 4.6503ms 4 1.1626ms 350.57us 3.2604ms cuDeviceTotalMem
0.80% 2.8774ms 404 7.1220us 282ns 586.75us cuDeviceGetAttribute
0.26% 936.12us 2 468.06us 464.23us 471.89us cudaGetDeviceProperties
0.26% 925.14us 2 462.57us 35.554us 889.58us cudaMemcpy
0.24% 872.45us 20 43.622us 797ns 221.09us cudaDeviceGetAttribute
0.18% 648.45us 4 162.11us 7.1600us 239.08us cudaMalloc
0.16% 588.48us 4 147.12us 16.129us 242.57us cudaFree
0.14% 497.40us 4 124.35us 48.910us 232.13us cuDeviceGetName
0.02% 72.498us 3 24.166us 8.7220us 51.983us cudaLaunchKernel
0.01% 18.186us 4 4.5460us 2.9720us 7.2940us cuDeviceGetPCIBusId
0.00% 10.438us 1 10.438us 10.438us 10.438us cudaSetDevice
0.00% 10.355us 8 1.2940us 427ns 4.7330us cuDeviceGet
0.00% 4.1130us 2 2.0560us 653ns 3.4600us cudaGetDeviceCount
0.00% 3.0210us 4 755ns 493ns 1.1600us cuDeviceGetUuid
0.00% 2.6480us 3 882ns 525ns 1.3450us cuDeviceGetCount
$
Note that the reported time is 352ms, and of that 349ms is consumed by the first call to cudaFuncGetAttributes.