Slow CUDA kernels/programs in cuda 4.0

ziane · January 23, 2012, 10:10am

Hi everyone,

I have two Tesla C2070 GPUs running on Linux system(x64) but on different CUDA toolkit and SDK release versions:

GPU 1: CUDA (toolkit and SDK) version 3.2.16
GPU 2: CUDA (toolkit and SDK) version 4.0.17
I have executed on both GPUs (GPU 1 & GPU 2) the same kernels/programs, obviously, compiled with the appropriate compilers. For all examples that I have experienced, I have noticed that my kernels are solwer on CUDA 4 than on CUDA 3.2.

For example, for the reduction kernel provided in the CUDA samples, we have these performances:

GPU 1 with CUDA 3.2:
Reducing array of type int
16777216 elements
256 threads (max)
64 blocks
Reduction, Throughput = 115.8948 GB/s, Time = 0.00058 s, Size = 16777216 Elements, NumDevsUsed = 1, Workgroup = 256
GPU 2 with CUDA 4.0:
Reducing array of type int
16777216 elements
256 threads (max)
64 blocks
Reduction, Throughput = 82.4342 GB/s, Time = 0.00081 s, Size = 16777216 Elements, NumDevsUsed = 1, Workgroup = 256

For my own CUDA programs (which are more time-consumming), CUDA 4.0 is about 15% solwer than CUDA 3.2.
Does anyone know where this waste of time come from?

Thanks in advance for any help.

cbuchner1 · January 23, 2012, 11:37am

is the memory’s ECC mode configured identically on both GPUs? Also try CUDA 4.1 RC2 if you have access to it. They promise the LLVM based compiler will boost performance by 5-10%

ziane · January 24, 2012, 10:10am

Yes indeed, the memory’s ECC on the GPU 1 is disabled whereas on the GPU 2 is enabled.

Now, the kernels on the GPU 2 are as fast as on GPU 1.

Thanks for your help.

Topic		Replies	Views
Performance of executing a CFD code on a Cluster is better when using CUDA 4.0 than CUDA 5.5 CUDA Programming and Performance	1	472	August 18, 2014
CUDA 4.0 -arch sm_20 39% slower Tesla C2050 ECC on __device__ function PTX duplicated? CUDA Programming and Performance	3	1712	May 21, 2012
CUDA Driver Version / Runtime Version problem? CUDA Programming and Performance	4	1366	January 25, 2019
Comparison of a CUDA kernel performance running on different GPUs/Toolkits/Drivers CUDA Programming and Performance	2	932	July 7, 2014
CUDA slower than CPU? CUDA Programming and Performance	7	797	August 18, 2023
Suboptimal performance of CUDA port CUDA Programming and Performance	3	1714	April 7, 2012
Dual-GPU system is very slow: driver problem? CUDA Programming and Performance	3	877	April 24, 2012
Same Implementation in CUDA and OpenCL but different performance, and OpenCL Faster? CUDA Programming and Performance	2	1216	October 11, 2013
Why does kernel with __syncthreads() and conditional checks run faster than kernel without on NVIDIA Tesla K20M? CUDA Programming and Performance	0	400	January 3, 2018
CUDA v2.0 beta is slower than CUDA v1.1 Is it just temporarily ? CUDA Programming and Performance	3	2664	July 20, 2008

Slow CUDA kernels/programs in cuda 4.0

Related topics