I am implementing Binomial Pricing options on CUDA. I was benchmarking my implementation with the one that comes with NVIDIA’s SDK.
My kernel is a very simple one and uses just 1 WARP and shared-memory to generate the required output. I dont use double-buffering as i have only one WARP. At any given reducing step, the threads in this WARP share the burden of deriving the next level from the existing nodes. THe level is processed 32 elements by 32 elements one after another. I dont access global memory at all.
For 2048 level tree, I get nearly 8x performance when compared to CPU. I can easily multiply this number with 16 as Multiprocessors would process them in parallel. So it brings me to a decent 128x – I have unrolled my loop 8 times to get a better X factor.
But STILL – I find that NVIDIA’s implementation beats my kernel by a non-negligible amount of ticks (I used QueryPerformanceCounter() to profile my code as well as NVIDIA’s code (I tweaked a bit)).
I went through NVIDIA’s implementation. I find that it accesses lot of global memory and even performs redundant computation (The nodes in the CACHE_DELTA column in Figure 3 are calculated twice. All such boundaries are calculated twice.). I dont understand how this code outperforms mine.
Does the NVIDIA SDK code have some optimizations that dont meet the eye so easily ? THanks for any inputs.