Black-Scholes performance on G80 how many pricings/second can be done?


According to this document:…NVIDIA_Cuda.pdf

using CUDA on G80 one can price 4.7 billion options per second based on Black-Scholes model.

Does any one achieve this number? I run the sample code in CUDA SDK, on one 8800 GTX with Intel Kentsfield CPU, and get 1.6 billion, which is very impressive but still lower than the number claimed int the above document.

Thanks in advance!!!

Sample code gives me 1.3b on 8800gts. So perhaps a touch more than yours after adjusting for hardware difference.

Don’t know about the 4.7 billions (it was before my time).
If you change the expf and logf to the fast one (__expf and __logf), you will get an additional boost ( you will need to rename some of the host device functions)

Executing GPU kernel…
GPU time: 0.878000 msecs.
L1 norm: 5.986007E-08
Max absolute error: 1.525879E-05

Executing GPU kernel…
GPU time: 0.638000 msecs.
L1 norm: 5.986171E-08
Max absolute error: 1.525879E-05

CUDA SDK BlackScholes sample currently has its input data size set to “only” 1000000 options, and execution time is less than 1ms. With such tiny execution times, depending on CPU/chipset, timing noise and launch overhead can possibly decrease performance rates. In order to observe real GPU performance, execution times should be at least 5-10ms. Increasing input data size to 10M-30M options should do it.

Also note that ‘BlackScholes’ calculates both call and put option values.

Illegal seems a bit strong a word. But your point is well taken. the numbers go up to 1.6b on 8800 gts (15 million instead of 1 million options). And I remember that you can speed this code up some more by for example entering discount fact instead of rate etc. Big picture its very fast… Small picture not sure you can get to 4b.

4.7 GOpt/s was given for GeForce8800GTX, counting each input index as two options, since both call and put is calculated. Sorry for the confusion.

BlackScholes CUDA SDK sample currently reports effective 4.45 Gopt/s, which is slightly lower than 4.7 Gopt/s, but the primary goal of the sample is to be a simple demonstration of a streaming application.

As for optimization, here are some general recommendations:

  1. Warp occupancy can be increased with the help of -po maxrregcount=<…> Depending on how many threads per block you shoot for, optimal register count may vary, so by default the compiler doesn’t try to minimize local register count as much as it can. But with this option the compiler is forced to stay on the budget. However be warned that too low maxrregcount can force local memory spills, in many cases slowing things down in spite of increased warp occupancy.
  2. Synchronization overhead and timing precision, depending on OS/CPU/chipset, can influence observed performance. So doing multiple iterations, measuring total time and dividing by the number of iterations should give more precise results. This is especially important in Linux, where depending on the kernel build version timer resolution (gettimeofday() sys. call) can be as low as 100Hz (10ms)

Concerning timing, the profiler (enabled with environment variable CUDA_PROFILE=1) reports kernel and memcpy times using high-precision internal GPU timers (in addition to CPU ones) without any need in cudaThreadSynchronize() call in user programs. There are plans to expose these timers in CUDA API.