4.7 GOpt/s was given for GeForce8800GTX, counting each input index as two options, since both call and put is calculated. Sorry for the confusion.
BlackScholes CUDA SDK sample currently reports effective 4.45 Gopt/s, which is slightly lower than 4.7 Gopt/s, but the primary goal of the sample is to be a simple demonstration of a streaming application.
As for optimization, here are some general recommendations:
- Warp occupancy can be increased with the help of -po maxrregcount=<…> Depending on how many threads per block you shoot for, optimal register count may vary, so by default the compiler doesn’t try to minimize local register count as much as it can. But with this option the compiler is forced to stay on the budget. However be warned that too low maxrregcount can force local memory spills, in many cases slowing things down in spite of increased warp occupancy.
- Synchronization overhead and timing precision, depending on OS/CPU/chipset, can influence observed performance. So doing multiple iterations, measuring total time and dividing by the number of iterations should give more precise results. This is especially important in Linux, where depending on the kernel build version timer resolution (gettimeofday() sys. call) can be as low as 100Hz (10ms)
Concerning timing, the profiler (enabled with environment variable CUDA_PROFILE=1) reports kernel and memcpy times using high-precision internal GPU timers (in addition to CPU ones) without any need in cudaThreadSynchronize() call in user programs. There are plans to expose these timers in CUDA API.