Hi, everyone
I developed and performed a CFD code on a GPU cluster recently, the GPU cluster assembles up to 250 GPU cards, Telsa M2070. but the weird stuff is that I compile the code with CUDA 4.0 is faster than that compiling with CUDA 5.5. the both are using the same code and the same compile options (-O3). Only one difference is the Runtime library. one is 4.0 and the other is 5.5. the CUDA driver is 5.5 in the cluster.
the CFD code includes the kernels as following :
Kernel 1 : about 85% of elapsed time.
Kernel 2 + Kernel 3 + kernel 3 : about 15% time of total elapsed time.
++++++++++++++++++++++++++++++++++++++++++
-
In the case of CUDA driver 5.5 + Runtime 4.0
Kernel 1 : 215.95 sec.
Kernel 2 : 20.89
Kernel 3 : 9.865
Kernel 4 : 1.996 -
In the other case, CUDA driver 5.5 + Runtime 5.5
Kernel 1 : 246.66 sec. (add 30 sec. compared to the first case)
Kernel 2 : 20.88
Kernel 3 : 10.16
Kernel 4 : 2.15
and I have another small cluster, 4 nodes, assembling 4 M2070 cards per node and installing CUDA 6.0 (Driver and Runtime)
-
In my small cluster, CUDA driver 6.0 + Runtime 6.0
Kernel 1 : 229.1 sec. (add 13 sec. compared to the first case)
Kernel 2 : 20.98
Kernel 3 : 10.17
Kernel 4 : 2.07
Actually, I don’t know why. the compile option is " nvcc -O3 -c -arch=sm_20 "
Have any option/comment to improve the performance of the newer CUDA Runtime ?
Thanks.