Differences in Monte Carlo Option Pricing Multi GPU code sample on CUDA 3.2 and CUDA 4.0

Hi,

I am currently running and timing some code samples, and I noticed that there is a major difference in Monte Carlo Option Pricing using Multi-GPU when run against CUDA 3.2 and CUDA 4.0. Following is the program output when I use CUDA 3.2:

[MonteCarloMultiGPU] starting...

main(): generating input data...

main(): starting 4 host threads...

main(): waiting for GPU results...

Resetting device 3

Resetting device 1

Resetting device 0

Resetting device 2

main(): GPU statistics, threaded

GPU #0

Options         : 64

Simulation paths: 262144

Total time (ms.): 1115.595947

Options per sec.: 229.473763

GPU #1

Options         : 64

Simulation paths: 262144

Total time (ms.): 14.766000

Options per sec.: 17337.126071

GPU #2

Options         : 64

Simulation paths: 262144

Total time (ms.): 568.174988

Options per sec.: 450.565416

GPU #3

Options         : 64

Simulation paths: 262144

Total time (ms.): 1677.020996

Options per sec.: 152.651637

main(): comparing Monte Carlo and Black-Scholes results...

Shutting down...

Test Summary...

L1 norm        : 2.979117E-06

Average reserve: 384.457409

[MonteCarloMultiGPU] test results...

PASSED

Now when I use CUDA 4.0, I get this:

[MonteCarloMultiGPU] starting...

main(): generating input data...

main(): starting 4 host threads...

main(): waiting for GPU results...

Resetting device 0

Resetting device 3

Resetting device 1

Resetting device 2

main(): GPU statistics, threaded

GPU #0

Options         : 64

Simulation paths: 262144

Total time (ms.): 5.523000

Options per sec.: 46351.622481

GPU #1

Options         : 64

Simulation paths: 262144

Total time (ms.): 5.845000

Options per sec.: 43798.119622

GPU #2

Options         : 64

Simulation paths: 262144

Total time (ms.): 9.681000

Options per sec.: 26443.549887

GPU #3

Options         : 64

Simulation paths: 262144

Total time (ms.): 4.173000

Options per sec.: 61346.755010

main(): comparing Monte Carlo and Black-Scholes results...

Shutting down...

Test Summary...

L1 norm        : 2.979117E-06

Average reserve: 384.457409

[MonteCarloMultiGPU] test results...

PASSED

The total time field is quite different, and I would want to know the reason of such a divergence, when the total wall clock time remains the same. Any help appreciated.

Thanks,

Sayan

Cross reference to thread on stackoverflow.com:

http://stackoverflow.com/questions/6474389/monte-carlo-multi-gpu-code-against-cuda-3-2-and-cuda-4-0