Slow CUDA kernels/programs in cuda 4.0

Hi everyone,

I have two Tesla C2070 GPUs running on Linux system(x64) but on different CUDA toolkit and SDK release versions:

  • GPU 1: CUDA (toolkit and SDK) version 3.2.16
  • GPU 2: CUDA (toolkit and SDK) version 4.0.17
    I have executed on both GPUs (GPU 1 & GPU 2) the same kernels/programs, obviously, compiled with the appropriate compilers. For all examples that I have experienced, I have noticed that my kernels are solwer on CUDA 4 than on CUDA 3.2.

For example, for the reduction kernel provided in the CUDA samples, we have these performances:

  • GPU 1 with CUDA 3.2:
    Reducing array of type int
    16777216 elements
    256 threads (max)
    64 blocks
    Reduction, Throughput = 115.8948 GB/s, Time = 0.00058 s, Size = 16777216 Elements, NumDevsUsed = 1, Workgroup = 256

  • GPU 2 with CUDA 4.0:
    Reducing array of type int
    16777216 elements
    256 threads (max)
    64 blocks
    Reduction, Throughput = 82.4342 GB/s, Time = 0.00081 s, Size = 16777216 Elements, NumDevsUsed = 1, Workgroup = 256

For my own CUDA programs (which are more time-consumming), CUDA 4.0 is about 15% solwer than CUDA 3.2.
Does anyone know where this waste of time come from?

Thanks in advance for any help.

is the memory’s ECC mode configured identically on both GPUs? Also try CUDA 4.1 RC2 if you have access to it. They promise the LLVM based compiler will boost performance by 5-10%

Yes indeed, the memory’s ECC on the GPU 1 is disabled whereas on the GPU 2 is enabled.

Now, the kernels on the GPU 2 are as fast as on GPU 1.

Thanks for your help.