I am implementing Binomial Pricing options on CUDA. I was benchmarking my implementation with the one that comes with NVIDIA’s SDK.
My kernel is a very simple one and uses just 1 WARP and shared-memory to generate the required output. I dont use double-buffering as i have only one WARP. At any given reducing step, the threads in this WARP share the burden of deriving the next level from the existing nodes. THe level is processed 32 elements by 32 elements one after another. I dont access global memory at all.
For 2048 level tree, I get nearly 8x performance when compared to CPU. I can easily multiply this number with 16 as Multiprocessors would process them in parallel. So it brings me to a decent 128x – I have unrolled my loop 8 times to get a better X factor.
But STILL – I find that NVIDIA’s implementation beats my kernel by a non-negligible amount of ticks (I used QueryPerformanceCounter() to profile my code as well as NVIDIA’s code (I tweaked a bit)).
I went through NVIDIA’s implementation. I find that it accesses lot of global memory and even performs redundant computation (The nodes in the CACHE_DELTA column in Figure 3 are calculated twice. All such boundaries are calculated twice.). I dont understand how this code outperforms mine.
Victor,
Does the NVIDIA SDK code have some optimizations that dont meet the eye so easily ? THanks for any inputs.
I have a basic assumption in my code about the CUDA computing model.
If my kernel performs operations only in shared memory and has little or nothing to do with global memory then having a single warp would suffice…
Because multiple-warps are again time-shared in the Multi-Processor. So, it does NOT make sense to have multiple warps unless you intend to overlap computation with global memory and other stalling operations. When all warps perform only computation then there is NO need for multiple warps.
Section 5.1.2.5 of the programming guide says you will still want multiple warps to avoid stalls when there are read-after-write register dependencies. If one instruction needs to read a register written by the immediately preceding instruction, then it will have to stall while the writing instruction propagates through the pipeline. The guide says these delays are no longer visible if you have at least 192 threads (so, 6 warps) running per block.
There is also mention of register memory (not shared memory) bank conflicts, but I don’t know what those are, exactly. Avoiding them requires the number of threads be divisible by 64.
Aaaah… THanks a lot. I will give it a try. Actually, I was really clue-less when I found that NVIDIA’s complex implementation is faster than the most simplistic implementation of mine. THanks for the info.
Yeah right. I ran a search for “regsiter memory” and I could find it only in that place. The same section says that there is no need to pack data as float4 or int4. That indicates to me that he is actually talking only about “Shared memory”.
And that section directly follows the “shared memory bank conflict” section.
Not sure. It could be something else too. Anyway, Thanks for the info.
Hello, Sarnath. Thanks for the interest in my sample. As you were already answered, watch multiprocessor occupancy and thread block count: these numbers can heavily influence the observed performance. I didn’t really tune my implementation that much and there is some room for optimization:
Set thread block size equal to the shared memory “cache” size, avoiding the nasty innermost for(int pos = threadIdx.x; …; pos += blockDim.x) loops.
Precalculate the per-option(per-thread block) quotients, used in the main computation body, on the host side. Making the hardware do the same (rather complex) computations over and over again (by every thread of threadblock) is not very reasonable, so depending on CPU you might see a minor speedup (or slowdown :) )
Thanks for answering. I was looking to avoid double-buffering by having only one warp. But it spoilt my entire show. Now, I realize why there is a need to have at least 6 warps.
I am also trying to implement the following paper to see if it would give any improvement in performance:
Did you consider this paper while implementing binomial tree? I am planning to use multiple blocks to perform binomial tree calculation for the same option. I am planning to do this across multiple kernel launches. I am not sure if the kernel-launch overhead is going to be an overkill factor. Let us see.
No, I haven’t yet read this article, but thanks for the link.
If you set thread block size equal to the shared memory “cache” size you can do the double buffering without the additional shared memory buffer! Just calculate new node value in local registers and synchronize before writing the new value back to the “cache”, i.e. perform double buffering in local registers. There is really no need in warp-sized thread blocks, since we don’t have to sync after each instruction, only before reading and writing shared memory in order to avoid read-after-write/write-after-read hazards.
As for multiple blocks per single option, it will be probably advantageous for certain (big?) tree sizes, so definitely worth trying. You’ll need to benchmark it against the “thread block per option” kernel and have host-side heuristics to pick best performing kernel for every input “configuration”. In order to avoid unnecessary overhead it will be better to utilize both grid dimension instead of multiple kernel launches: option indices along the Y axis, thread blocks processing the same option along the X axis.
These are the ideas we had already implemented in the updated MonteCarlo sample from CUDA SDK 1.1.
Yes, I did try that option out. But there is a small difference. You have to SYNCTHREADS before WRITING to shared memory – warp boundaries pose a problem! It will work only if there was a single WARP. And, with a single WARP we have a performance problem anyway (192 threads r required to saturate computation within shared memory)
So, I had to do two syncthreads for each reduction. This comes with a performance penalty. It was NOT worth it. Thats why I abandoned the whole single-buffer computation.
Anyway, Thanks for sharing the idea. It is indeed a good one. But just not the optimal one.
Sure, I am heading towards that idea. I am planning to implement the “paper” that I had quoted. It is an USENIX publication.
I am looking to compute binomial options for inputs with different timesteps so that it could be as generic as possible.
btw, Do you know what kindaa requirements banks have? Do they calculate fixed time step for a bunch of options? OR Do they calculate options with different time steps? And, WHat is the most commonly used time-step by banks?
And, WHat if Banking applications require instant results – For example – If the banking application submits 60,000 options to me and wants me to process sampels and send out results as soon as they are available then the entire CUDA thing goes for a task…
These are all something that are nagging me. I am yet to finalize the requirement spec for my program.
The only thing I know is it depends. :) It’s pretty reasonable to expect the requirements for varying time step count support (for different options in the input batch), but global(fixed) time step count might turn out being still acceptable.
Throughput vs. latency is a known issue. You need to have big enough batch size in order to keep the machine as close as possible to its peak perf (peak computation troughput), but the bigger is the batch size – the bigger is the results latency, because from the point of view of host current devices are numb until the computations are 100% complete, having no means to report the job status. So you have to identify the minimum batch size with acceptable perf. rate relative to the peak rate (with “infinite” batch), say, 80-90%. I think that throughput indicator (opt./s, etc.) above the desired limit and latency indicator(Hz) below the desired limit is rather common situation, so trading one for another might also turn out being acceptable, allowing you to further decrease the batch size and improve latency (though dropping the relative performance percentage to a lower but still acceptable number).
I managed to implement the usenix (or acm?) paper by Alexandros on CUDA. I am able to get a speedup of around 1.4x to 1.45x than your implementation. And, the implementation can process input-batches of varying time-steps as well.
This speedup is inclusive of the kernel launch, memory copying and some CPU house-keeping stuff as well. Actually these overheads are very minimal when compared to the GPU processing time.
I am not sure if it is appropriate to discuss the paper here. So, I am not going to give details about the paper without verifying if I am legally correct.
btw,
I use only 32 threads per block and I DONT use __syncthreads() at all.
I would be profiling your code with this setup and see how well your code scales. If you intend to do this yourself, Please share your results. It will save some time for me :) Thanks.
It’s great that you are getting speedups! We’re eager to hear details of how.
What occupancy do you achieve? Unless you are able to run many thread blocks per multiprocessor, using only 32 threads will leave performance on the table (because you won’t be able to hide all the arithmetic pipeline and memory latency).
If you change Victor’s SDK code to use 32 threads it will run slower, so if you plan to compare your code to his, please run his code as is (unless you find a way to make it faster). :)
I am not sure if I can discuss the ACM paper here. I need to first know if I can before I can post the exact details.
The speedup I measured was for 1000 options each with 2048 timesteps. Victor’s implementation took around 262 millisecs. Mine takes 190 millisecs.
My occupancy is very less… less than 0.333 (8/24 warps max). But again, as you had mentioned – occupancy is not everything.
Since I use 32 threads per block, I need atleast 6 blocks scheduled on a multiprocessor to hide the register latecny. And, my experiments showed me that 6 is more than enough to hide global memory latency as well. So, I need to schedule atleast 96 blocks on the GTX to hide latencies. Since CUDA makes sense only for batch processing, 96 blocks is a very very reasonable number. In my implementation, I spawn 4000 blocks for processing 1000 options each with 2047 timesteps.
On the contrary, I think making it 32 will rock. I can remove un-necessary syncthreads(), I can get rid of double-buffering, I can get rid of the overhead associated with the FOR loops and so on. So, let me give a try.
Victor’s version with 32 threads (+ minor cleanup of code to take the 32 advantage) matches my performance. It takes 194ms for 1000 options of 2048 timesteps. Previously the same program used to take 260ms. I spawn 96 blocks to saturate the 8800 GTX (Victor’s used 64).
I am attaching an XLS sheet that compares the execution time between victor’s original implementation and victor’s implementation tailored for 32 threads. ThirtyTwoThreads.xls (16.5 KB)
I don’t know what kind of algorithm this is and don’t have the examples here. Is this code very compute-intensive per global memory load? Because I can imagine that having only 32 threads per block can be beneficial in that case
Yes, it is a very compute intensive algorithm. Quite a lot of computation is being done. It is there in the “projects\binomialOptions” directory inside the SDK.
This is the modified version of the function that benefits from 32-thread model. Apart from this, I spawn 96 blocks to saturate 8800GTX. Just search for “Sarnath” to locate the modified portions.
__global__ void binomialOptionsGPU(
float *d_CallResult,
float *d_StockPrice,
float *d_OptionStrike,
float *d_OptionYears,
float Riskfree,
float Volatility,
int OptN
){
//Prices cache
__shared__ float dataA[CACHE_SIZE];
// __shared__ float dataB[CACHE_SIZE]; // Sarnath
//End prices array for current thread block
float *p = &steps[blockIdx.x * (NUM_STEPS + 16)];
for(int opt = blockIdx.x; opt < OptN; opt += gridDim.x){
const float S = d_StockPrice[opt];
const float X = d_OptionStrike[opt];
const float T = d_OptionYears[opt];
//Time per step
const float dT = T * (1.0f / NUM_STEPS);
const float VsdT = Volatility * sqrtf(dT);
const float RdT = Riskfree * dT;
//Per-step cont. compounded riskless rate
const float R = expf(RdT);
const float Rinv = 1.0f / R;
//Up move corresponding to given variance
const float u = expf(VsdT);
//Corresponding down move
const float d = 1.0f / u;
//Pseudo-probability of upward move
const float Pu = (R - d) / (u - d);
//Pseudo-probability of downard move
const float Pd = 1.0f - Pu;
//Compounded quotents
const float PuByR = Pu * Rinv;
const float PdByR = Pd * Rinv;
///////////////////////////////////////////////////////////////////////
// Compute values at expiration date:
// call option value at period end is V(T) = S(T) - X
// if S(T) is greater than X, or zero otherwise.
// The computation is similar for put options.
///////////////////////////////////////////////////////////////////////
for(int i = threadIdx.x; i <= NUM_STEPS; i += blockDim.x){
float price = S * expf(VsdT * (2.0f * i - NUM_STEPS));
p[i] = fmaxf(price - X, 0);
}
////////////////////////////////////////////////////////////////////////
// Walk backwards up binomial tree.
// Can't do in-place reduction, since warps are scheduled out of order.
// So double-buffer and synchronize to avoid read-after-write hazards.
////////////////////////////////////////////////////////////////////////
for(int i = NUM_STEPS; i > 0; i -= CACHE_DELTA)
for(int c_base = 0; c_base < i; c_base += CACHE_STEP){
//Start and end positions within shared memory cache
int c_start = min(CACHE_SIZE - 1, i - c_base);
int c_end = c_start - CACHE_DELTA;
//Read data(with apron) to shared memory cache
// __syncthreads(); // Sarnath
for(int k = threadIdx.x; k <= c_start; k += blockDim.x)
dataA[k] = p[c_base + k];
//Calculations within shared memory
for(int k = c_start - 1; k >= c_end; k--){ // Sarnath
//Compute discounted expected value
//__syncthreads(); // Sarnath
for(int l = threadIdx.x; l <= k; l += blockDim.x)
dataA[l] = PuByR * dataA[l + 1] + PdByR * dataA[l];
// k--; // Sarnath
//Compute discounted expected value
/* __syncthreads(); // Sarnath
for(int l = threadIdx.x; l <= k; l += blockDim.x)
dataA[l] = PuByR * dataB[l + 1] + PdByR * dataB[l];
k--;*/
}
//Flush shared memory cache
// __syncthreads(); // Sarnath
for(int k = threadIdx.x; k <= c_end; k += blockDim.x)
p[c_base + k] = dataA[k];
}
//Write the value at the top of the tree to destination buffer
if(threadIdx.x == 0) d_CallResult[opt] = dataA[0];
}
}
My code is based on this ACM paper. It is a very simple algorithm to implement.
Also note that the tailored 32-thread version of NVIDIA (i have posted code in a post above) performs as good as my algorithm. So, you can follow NVIDIA’s algorithm as well.
However NVIDIA’s algorithm has certain amount of redundancy which my code does NOT have. So, I should actually perform somewhat better than the 32-thread version of NVIDIA. I am working on spawning fixed number of blocks which would work on all the jobs instead of spawning as many blocks as there are jobs.
I hope you can understand my stand on not sharing code. I will be glad to help you in all other possible ways.