Option Pricing on GPU - The Trinomial Tree Financial Algorithms

Sorry about the double-posting:

Is any1 out there who has worked on speeding up financial algos?

I am working on the trinomail tree method of option pricing. I am getting speedups of the order of 55 to 65x , which I feel is kindaa less…

The speedup was measured against an un-optimized CPU program running on an Intel 2.42GHz with 2GB RAM. The GPU used was 8800 GTX.

I am yet to analyse and optimize my GPU program.

All comments welcome!

Thank you
Best Regards,
Sarnath

Well, Interestingly, I increased the number of blocks equal to number of options and I am getting 85x speedup… No other change in Algo… OOps. CAn you believe it ?

Interesting…

I had a FOR loop like this:

for (int i=blockIdx.x; i<NUM_OPTIONS; i+=gridDim.x)
{

     FetchOption(i);
     ProcessOption(i);
     StoreBackResult(i);

}

I used 1000 options and 80 blocks ( 5 * 16 = 80. ) and got 55x speedup.
I used 1000 options and 1000 blocks and got 82x speedup…

I cant justify this anyway. Its crazy… But it gets me performance… Thats good enuf for me…

My shared mem usage is such a way that I could run 5 active blocks per MP. And, I have 16 Multiprocessors and hence I went for 80 blocks – which seemed to be an optimum number.

btw, I have just 32 threads per block.

well, more blocks means relatively less overhead. How long is each kernel running in the first case? If it is very short, you have relatively large kernel-calling overhead.

more blocks means less overhead??? Can you explain that a bit more?

To answer your first question, therez just a question of 1 kernel. To process 1000 options, it takes 138milliseconds for the GPU and 11.17 secs for CPU and hence the speed up of 80x. So, This is clearly not a question of kernel-calling overhead.

more blocks means relatively less overhead. If you have a kernel running for 10 usec & overhead is 5 usec, it leads to 15 usec. If your kernel is running for 1 sec, it leads to 1,000005 sec.

But, as you state, when using 80 blocks, you feed the MP with only 1 share of blocks. You need to overload them with blocks. So what you are seeing is probably the effect of latency-hiding. If each MP can process 5 blocks at a time and you feed it 5 blocks, only the first block will have the optimum latency hiding (after it is finished, the rest of the blocks have less active threads to hide the latency left. If you feed it 1005 blocks, the first 1001 blocks will have optimum latency hiding.

Now, I understand your other post in the topic about grid-level synchronization. You were feeding the GPU as much as it can chew at a time. The way to get optimal performance is to overfeed it :D

Aah… Right. This explanation makes sense to me. But NVIDIA’s SDK sampel on Binomial Tree does NOT over-feed it… The more I think about it, the more confused I get. I think I will just take this as granted and feel happy about it.

Thanks for your time.