Poor CUBLAS performance after upgrading display driver

Hi All,

I’m using CUDA on Windows and recently installed the 258.96 WHQL display driver in order to be able to use the 3.1 toolkit. While testing the new driver with both the old 2.3 and the new 3.1 toolkit, CUBLAS showed poor performance compared to the old 197.47 driver. I employ CUBLAS to solve a dense linear system by LU-decomposition and closely follow the reference LAPACK program DGETRF, using CUDA to accelerate the BLAS. The following results were obtained for a GTX 260 using the 32-bit CUBLAS library. Calculations are performed in double precision, [font=“Courier New”]N[/font] denotes the problem size, and all times are listed in ms.

[codebox]

     XP x64       XP x64      Win 7 x64   Win 7 x64

CUDA 2.3     CUDA 3.1     CUDA 2.3    CUDA 3.1

N 197.47 258.96 258.96 197.47 258.96 258.96


100 22 23 23 78 78 74

200 48 137 139 169 1752 1762

300 80 213 216 261 2650 2658

400 120 297 302 339 3564 3567

500 163 382 387 460 4473 4476

600 247 507 513 609 5429 5419

700 330 630 639 759 6366 6371

800 429 770 779 921 7375 7349

900 539 917 926 1045 8268 8284

1000 679 1095 1108 1296 9256 9272

1500 1460 2037 2055 2307 13714 13729

2000 2481 3169 3198 3535 17320 17299

[/codebox]

On XP, the vast majority of time is spent on [font=“Courier New”]cublasIdamax[/font], which is used to find the pivot element. For [font=“Courier New”]N[/font] = 2000, the runtime of the 256.96 driver is 27% higher than that of the 197.47 driver.

On Windows 7, [font=“Courier New”]cublasIdamax[/font] also requires the largest amount of time for the factorization when using the 197.47 driver. With the 258.96 driver, for both CUDA 2.3 and 3.1, a large amount is also spent in [font=“Courier New”]cublasGetVector[/font], which is used to download the pivot element to host memory. The exact times in ms are

[codebox]

Win 7, N = 2000 Total time cublasIdamax cublasGetVector

197.47, CUDA 2.3 3535 3171 247

258.96, CUDA 2.3 17320 10783 6426

258.96, CUDA 3.1 17325 10781 6430

[/codebox]

Since only a single element, the pivot, is downloaded each time [font=“Courier New”]cublasGetVector[/font] is called, the launch overhead often reported for Win 7 seems to have tremendously increased for the 258.96 driver.

The calculated results are correct for all driver versions and toolkit releases. Handwritten and compiled kernels seem to be not affected. As the performance drop occurs for a library shipped as binary, it looks like the driver is the cause. Reverting back to the old 197.47 release did indeed restore the previous performance, but of course forbids using newer toolkits.

I am aware that there are faster approaches than using a standard LAPACK program with substituted BLAS. Nevertheless, CUBLAS was serving my purposes very well. I am moreover concerned that installing a new and validated driver apparently introduces performance drops of such extents. Is this a known issue, are other systems affected, and what can be done to obtain the original performance again? Any hints and comments are highly appreciated.

Did someone else notice the performance issue I’m experiencing for CUBLAS with the most recent display driver?
Comments and suggestions are still highly appreciated.

Did someone else notice the performance issue I’m experiencing for CUBLAS with the most recent display driver?
Comments and suggestions are still highly appreciated.