Hello,
I’m getting started with C++ CUDA and running the SingleAsianOption Cuda Sample on Visual Studio and it runs in around 140 ms on my NVIDIA GeForce GTX 1650.
If I run a naive Tensorflow or Cupy code on Google colab I get the same result in around 2 ms. I would have expected the C++ CUDA code to run much faster than Tensorflow or Cupy.
Is there a simple explanation?
Thanks!
Please explain the question. You are comparing two completely unrelated pieces of software running on two different hardware platforms. In addition the tasks you are accomplishing using Tensorflow and Cupy are unspecified, and CUDA sample codes are not supposed to be used for benchmarking. There are definitely things you can do with Tensorflow and Cupy that take a lot longer than 2 ms.
I’m trying to get a general understanding.
I also ran the CUDA sample on Colab and it ran around 700 ms on Tesla T4.
This is the cupy code. It gives the same result as the parameters in the C++ sample code.
import cupy as cp
def asian_arithmetic_call_price_cp(S0, K, r, sigma, T, n_paths, dist):
n_time_steps = int(T*261)
dt = T/n_time_steps
# ==========================================================================
average_stock_price_for_paths = cp.mean (S0 * cp.cumprod(cp.exp( (r - 0.5*sigma**2) *dt + sigma *cp.sqrt(dt) * cp.random.randn(n_paths, n_time_steps) ) , 1 ) , axis=1)
call_price_for_paths = (average_stock_price_for_paths- K) * ((average_stock_price_for_paths-K) >0)
return cp.mean(call_price_for_paths)*cp.exp(-r*T)
# Test
S0 = 40
K = 35
r = 0.03
sigma = 0.2
T = 1.0/3.0
n = int(1e5)
asian_arithmetic_call_price_cp(S0, K, r, sigma, T, n, dist)
This code runs in around 3 ms.
Are you saying that the CUDA sample codes are not optimized to run fast?
CUDA sample codes are each designed to demonstrate the use of a particular design principle, language feature, or associated library. They are not designed for benchmarking purposes and will usually tell you that in the output.
The SingleAsianOption sample code does no longer seem to exist in the current CUDA version, but using Google I could find a repository with an older version. The sample appears to have served as a demonstration for the use of the CUDAND library.
I don’t know anything about option computations, but from a quick look at the source code, it seems to run a selectable number of simulations for a selectable number of timesteps. Presumably the sample’s runtime scales with these two parameters.
So now I am wondering how you compute Asian options with Tensorflow.
The samples are on Github now:
I used the same parameters in the cupy implementation.
I know the samples are on GitHub, but I could not find SingleAsianOption there when I checked in a hurry. I guess it did not occur to me to look under 2_Concepts_and_Techniques.
I would think the example Cupy code you posted does something different from the CUDA sample code, in particular I do not see it running a Monte Carlo simulation.
Both codes do that same thing as far as I can tell. The Monte Carlo part is in the “cp.random.randn” part.
When I run this sample code on a V100, using a profiler, there are 3 kernel calls and the sum total duration of those 3 is less than 1ms. However the reported “Time” is 177ms without the profiler to 381ms with the profiler. So although that sample code does report “performance”, I doubt it is sensibly calculated for the comparison you are trying to do.
What I note in the profiler in the case where 381ms is being reported, is that there is a call to cudaFuncGetAttributes that is using 378ms. This sort of activity is not necessary and should not be part of a careful performance benchmark, IMO. You can find the calls to cudaFuncGetAttributes to study its usage, in the source code. Some of this may simply be due to CUDA start-up overhead. A careful benchmarking exercise (IMO) should do a warm-up run before computing measured values. Yes, I guess you are probably not doing this with your cupy code. But I can definitely spot some problems in the comparison.
So I’m fairly convinced what you’re doing is not an apples-to-apples comparison.
Here is a run of the sample code in nvprof:
$ nvprof /usr/local/cuda/samples/bin/x86_64/linux/release/MC_SingleAsianOptionP
Monte Carlo Single Asian Option (with PRNG)
===========================================
==9059== NVPROF is profiling process 9059, command: /usr/local/cuda/samples/bin/x86_64/linux/release/MC_SingleAsianOptionP
Pricing option on GPU (Tesla V100-PCIE-32GB)
Precision: single
Number of sims: 100000
Spot | Strike | r | sigma | tenor | Call/Put | Value | Expected |
-----------|------------|------------|------------|------------|------------|------------|------------|
40 | 35 | 0.03 | 0.2 | 0.333333 | Call | 5.17083 | 5.16253 |
MonteCarloSingleAsianOptionP, Performance = 283991.68 sims/s, Time = 352.12(ms), NumDevsUsed = 1, Blocksize = 128
==9059== Profiling application: /usr/local/cuda/samples/bin/x86_64/linux/release/MC_SingleAsianOptionP
==9059== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 86.82% 779.19us 1 779.19us 779.19us 779.19us initRNG(curandStateXORWOW*, unsigned int)
6.57% 58.977us 1 58.977us 58.977us 58.977us void generatePaths<float>(float*, curandStateXORWOW*, AsianOption<float> const *, unsigned int, unsigned int)
6.13% 55.042us 1 55.042us 55.042us 55.042us void computeValue<float>(float*, float const *, AsianOption<float> const *, unsigned int, unsigned int)
0.28% 2.4960us 1 2.4960us 2.4960us 2.4960us [CUDA memcpy DtoH]
0.20% 1.7920us 1 1.7920us 1.7920us 1.7920us [CUDA memcpy HtoD]
API calls: 96.65% 349.33ms 3 116.44ms 2.6540us 349.32ms cudaFuncGetAttributes ********************************
1.29% 4.6503ms 4 1.1626ms 350.57us 3.2604ms cuDeviceTotalMem
0.80% 2.8774ms 404 7.1220us 282ns 586.75us cuDeviceGetAttribute
0.26% 936.12us 2 468.06us 464.23us 471.89us cudaGetDeviceProperties
0.26% 925.14us 2 462.57us 35.554us 889.58us cudaMemcpy
0.24% 872.45us 20 43.622us 797ns 221.09us cudaDeviceGetAttribute
0.18% 648.45us 4 162.11us 7.1600us 239.08us cudaMalloc
0.16% 588.48us 4 147.12us 16.129us 242.57us cudaFree
0.14% 497.40us 4 124.35us 48.910us 232.13us cuDeviceGetName
0.02% 72.498us 3 24.166us 8.7220us 51.983us cudaLaunchKernel
0.01% 18.186us 4 4.5460us 2.9720us 7.2940us cuDeviceGetPCIBusId
0.00% 10.438us 1 10.438us 10.438us 10.438us cudaSetDevice
0.00% 10.355us 8 1.2940us 427ns 4.7330us cuDeviceGet
0.00% 4.1130us 2 2.0560us 653ns 3.4600us cudaGetDeviceCount
0.00% 3.0210us 4 755ns 493ns 1.1600us cuDeviceGetUuid
0.00% 2.6480us 3 882ns 525ns 1.3450us cuDeviceGetCount
$
Note that the reported time is 352ms, and of that 349ms is consumed by the first call to cudaFuncGetAttributes
.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.