Performance of executing a CFD code on a Cluster is better when using CUDA 4.0 than CUDA 5.5

Hi, everyone

I developed and performed a CFD code on a GPU cluster recently, the GPU cluster assembles up to 250 GPU cards, Telsa M2070. but the weird stuff is that I compile the code with CUDA 4.0 is faster than that compiling with CUDA 5.5. the both are using the same code and the same compile options (-O3). Only one difference is the Runtime library. one is 4.0 and the other is 5.5. the CUDA driver is 5.5 in the cluster.

the CFD code includes the kernels as following :

Kernel 1 : about 85% of elapsed time.
Kernel 2 + Kernel 3 + kernel 3 : about 15% time of total elapsed time.

++++++++++++++++++++++++++++++++++++++++++

  1. In the case of CUDA driver 5.5 + Runtime 4.0

    Kernel 1 : 215.95 sec.
    Kernel 2 : 20.89
    Kernel 3 : 9.865
    Kernel 4 : 1.996

  2. In the other case, CUDA driver 5.5 + Runtime 5.5

    Kernel 1 : 246.66 sec. (add 30 sec. compared to the first case)
    Kernel 2 : 20.88
    Kernel 3 : 10.16
    Kernel 4 : 2.15

and I have another small cluster, 4 nodes, assembling 4 M2070 cards per node and installing CUDA 6.0 (Driver and Runtime)

  1. In my small cluster, CUDA driver 6.0 + Runtime 6.0

    Kernel 1 : 229.1 sec. (add 13 sec. compared to the first case)
    Kernel 2 : 20.98
    Kernel 3 : 10.17
    Kernel 4 : 2.07

Actually, I don’t know why. the compile option is " nvcc -O3 -c -arch=sm_20 "

Have any option/comment to improve the performance of the newer CUDA Runtime ?

Thanks.

Solved,

I rewrite my auto-tuning function in my code, the new feature of CUDA 5.5/6.0 has been updated in my code. the performance :

Kernel 1 : 213.5 sec.
Kernel 2 : 20.84
Kernel 3 : 9.87
Kernel 4 : 1.974

it should be accepted.