SingleAsianOption Performance vs Tensorflow/Cupy

I’m getting started with C++ CUDA and running the SingleAsianOption Cuda Sample on Visual Studio and it runs in around 140 ms on my NVIDIA GeForce GTX 1650.
If I run a naive Tensorflow or Cupy code on Google colab I get the same result in around 2 ms. I would have expected the C++ CUDA code to run much faster than Tensorflow or Cupy.
Is there a simple explanation?

Please explain the question. You are comparing two completely unrelated pieces of software running on two different hardware platforms. In addition the tasks you are accomplishing using Tensorflow and Cupy are unspecified, and CUDA sample codes are not supposed to be used for benchmarking. There are definitely things you can do with Tensorflow and Cupy that take a lot longer than 2 ms.

I’m trying to get a general understanding.
I also ran the CUDA sample on Colab and it ran around 700 ms on Tesla T4.

This is the cupy code. It gives the same result as the parameters in the C++ sample code.

import cupy as cp

def asian_arithmetic_call_price_cp(S0, K, r, sigma, T, n_paths, dist):

    n_time_steps = int(T*261)
    dt = T/n_time_steps

    # ==========================================================================
    average_stock_price_for_paths = cp.mean (S0  * cp.cumprod(cp.exp( (r - 0.5*sigma**2) *dt + sigma *cp.sqrt(dt) * cp.random.randn(n_paths, n_time_steps) )  , 1 ) , axis=1)
    call_price_for_paths = (average_stock_price_for_paths- K) * ((average_stock_price_for_paths-K) >0)
    return cp.mean(call_price_for_paths)*cp.exp(-r*T)

# Test
S0    = 40
K     = 35
r     = 0.03
sigma = 0.2
T = 1.0/3.0
n = int(1e5)
asian_arithmetic_call_price_cp(S0, K, r, sigma, T, n, dist)

This code runs in around 3 ms.

Are you saying that the CUDA sample codes are not optimized to run fast?

CUDA sample codes are each designed to demonstrate the use of a particular design principle, language feature, or associated library. They are not designed for benchmarking purposes and will usually tell you that in the output.

The SingleAsianOption sample code does no longer seem to exist in the current CUDA version, but using Google I could find a repository with an older version. The sample appears to have served as a demonstration for the use of the CUDAND library.

I don’t know anything about option computations, but from a quick look at the source code, it seems to run a selectable number of simulations for a selectable number of timesteps. Presumably the sample’s runtime scales with these two parameters.

So now I am wondering how you compute Asian options with Tensorflow.

The samples are on Github now:

I used the same parameters in the cupy implementation.

I know the samples are on GitHub, but I could not find SingleAsianOption there when I checked in a hurry. I guess it did not occur to me to look under 2_Concepts_and_Techniques.

I would think the example Cupy code you posted does something different from the CUDA sample code, in particular I do not see it running a Monte Carlo simulation.

Both codes do that same thing as far as I can tell. The Monte Carlo part is in the “cp.random.randn” part.

When I run this sample code on a V100, using a profiler, there are 3 kernel calls and the sum total duration of those 3 is less than 1ms. However the reported “Time” is 177ms without the profiler to 381ms with the profiler. So although that sample code does report “performance”, I doubt it is sensibly calculated for the comparison you are trying to do.

What I note in the profiler in the case where 381ms is being reported, is that there is a call to cudaFuncGetAttributes that is using 378ms. This sort of activity is not necessary and should not be part of a careful performance benchmark, IMO. You can find the calls to cudaFuncGetAttributes to study its usage, in the source code. Some of this may simply be due to CUDA start-up overhead. A careful benchmarking exercise (IMO) should do a warm-up run before computing measured values. Yes, I guess you are probably not doing this with your cupy code. But I can definitely spot some problems in the comparison.

So I’m fairly convinced what you’re doing is not an apples-to-apples comparison.

Here is a run of the sample code in nvprof:

$ nvprof /usr/local/cuda/samples/bin/x86_64/linux/release/MC_SingleAsianOptionP
Monte Carlo Single Asian Option (with PRNG)

==9059== NVPROF is profiling process 9059, command: /usr/local/cuda/samples/bin/x86_64/linux/release/MC_SingleAsianOptionP
Pricing option on GPU (Tesla V100-PCIE-32GB)

Precision:      single
Number of sims: 100000

   Spot    |   Strike   |     r      |   sigma    |   tenor    |  Call/Put  |   Value    |  Expected  |
        40 |         35 |       0.03 |        0.2 |   0.333333 |       Call |    5.17083 |    5.16253 |

MonteCarloSingleAsianOptionP, Performance = 283991.68 sims/s, Time = 352.12(ms), NumDevsUsed = 1, Blocksize = 128
==9059== Profiling application: /usr/local/cuda/samples/bin/x86_64/linux/release/MC_SingleAsianOptionP
==9059== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   86.82%  779.19us         1  779.19us  779.19us  779.19us  initRNG(curandStateXORWOW*, unsigned int)
                    6.57%  58.977us         1  58.977us  58.977us  58.977us  void generatePaths<float>(float*, curandStateXORWOW*, AsianOption<float> const *, unsigned int, unsigned int)
                    6.13%  55.042us         1  55.042us  55.042us  55.042us  void computeValue<float>(float*, float const *, AsianOption<float> const *, unsigned int, unsigned int)
                    0.28%  2.4960us         1  2.4960us  2.4960us  2.4960us  [CUDA memcpy DtoH]
                    0.20%  1.7920us         1  1.7920us  1.7920us  1.7920us  [CUDA memcpy HtoD]
      API calls:   96.65%  349.33ms         3  116.44ms  2.6540us  349.32ms  cudaFuncGetAttributes  ********************************
                    1.29%  4.6503ms         4  1.1626ms  350.57us  3.2604ms  cuDeviceTotalMem
                    0.80%  2.8774ms       404  7.1220us     282ns  586.75us  cuDeviceGetAttribute
                    0.26%  936.12us         2  468.06us  464.23us  471.89us  cudaGetDeviceProperties
                    0.26%  925.14us         2  462.57us  35.554us  889.58us  cudaMemcpy
                    0.24%  872.45us        20  43.622us     797ns  221.09us  cudaDeviceGetAttribute
                    0.18%  648.45us         4  162.11us  7.1600us  239.08us  cudaMalloc
                    0.16%  588.48us         4  147.12us  16.129us  242.57us  cudaFree
                    0.14%  497.40us         4  124.35us  48.910us  232.13us  cuDeviceGetName
                    0.02%  72.498us         3  24.166us  8.7220us  51.983us  cudaLaunchKernel
                    0.01%  18.186us         4  4.5460us  2.9720us  7.2940us  cuDeviceGetPCIBusId
                    0.00%  10.438us         1  10.438us  10.438us  10.438us  cudaSetDevice
                    0.00%  10.355us         8  1.2940us     427ns  4.7330us  cuDeviceGet
                    0.00%  4.1130us         2  2.0560us     653ns  3.4600us  cudaGetDeviceCount
                    0.00%  3.0210us         4     755ns     493ns  1.1600us  cuDeviceGetUuid
                    0.00%  2.6480us         3     882ns     525ns  1.3450us  cuDeviceGetCount

Note that the reported time is 352ms, and of that 349ms is consumed by the first call to cudaFuncGetAttributes.


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.