Hi everyone,
I have two Tesla C2070 GPUs running on Linux system(x64) but on different CUDA toolkit and SDK release versions:
- GPU 1: CUDA (toolkit and SDK) version 3.2.16
- GPU 2: CUDA (toolkit and SDK) version 4.0.17
I have executed on both GPUs (GPU 1 & GPU 2) the same kernels/programs, obviously, compiled with the appropriate compilers. For all examples that I have experienced, I have noticed that my kernels are solwer on CUDA 4 than on CUDA 3.2.
For example, for the reduction kernel provided in the CUDA samples, we have these performances:
-
GPU 1 with CUDA 3.2:
Reducing array of type int
16777216 elements
256 threads (max)
64 blocks
Reduction, Throughput = 115.8948 GB/s, Time = 0.00058 s, Size = 16777216 Elements, NumDevsUsed = 1, Workgroup = 256 -
GPU 2 with CUDA 4.0:
Reducing array of type int
16777216 elements
256 threads (max)
64 blocks
Reduction, Throughput = 82.4342 GB/s, Time = 0.00081 s, Size = 16777216 Elements, NumDevsUsed = 1, Workgroup = 256
For my own CUDA programs (which are more time-consumming), CUDA 4.0 is about 15% solwer than CUDA 3.2.
Does anyone know where this waste of time come from?
Thanks in advance for any help.