any backward compatibility issue for CUDA 1.1?

Do we have to rewrite codes that are developed under CUDA 1.0

Not except for new bugs. In our ~100 kernels, one kernel triggered a possibly shared memory load/store optimization bug, and another kernel triggered the floating point negation bug. Other kernels are ok. The entire program compiles without modification, even we heavily used both driver and runtime API.
After we reverted the compilers (but not the driver) back to 1.0, everything runs fine.

Do you use old cuda.dll or newer nvcuda.dll which now is integrated in driver?

I have program which detects GPU presence in runtime. With CUDA 1.0 everything was simple: link program with cuda.lib which puts cuda.dll into import table of my exe, and redistribut cuda.dll. Now I can’t link to cuda.lib as this will put nvcuda.dll into import table, which is not present on non-CUDA systems and which I’m not allowed to redistribute… Any easy workaround for this (except for LoadLibrary()/GetPRocAddress() stuff)?

And one more: among your 100 kernels have you noticed any changes in performance with 1.1?

You can always put the driver in your software requirement list…
Our ~100 kernels are in two programs. Program one totally crashed due to the possible shared memory bug, and isolating the bug is too difficult. Program two runs despite the float negation bug, which only produced very ugly shading.
With respect to program two, OpenGL interop became significantly faster and boosted the rendering fps by ~50%. However, we didn’t notice any performance change due to the compiler change. Driver overhead seems still the same.
By reverting the compiler, we managed to get both programs running, and the GL interop performance is still there.

In the case of our application (just 1 frequently used kernel), updating the driver to 169.01, but keeping the old compiled kernel, had no impact on kernel performance. Updating the toolkit to 1.1 beta and recompiling provided 25% increase in performance. Most of this seems to be the result of more efficient register allocation in nvcc, which increased the occupancy.

If you link your app against nvcuda.dll with /delayload, surround your cuInit() call with a structured exception handling block and you can gracefully handle the exception that occurs if nvcuda.dll cannot be loaded (i.e. if CUDA is not present on the system).

This comment was cobbled together from MSDN and other sources, e.g.

so if it works for you please report back on the forums here :-)

I’m also rather interested in the performance of new CUDA libraries. I didn’t find any substantial improvement in CUBLAS compared to version 1.0. Say, latencies were supposed to be much smaller, but I don’t see that. SGEMM was supposed to be faster and apparently it is (peak of 128 Gflop/s vs. old 123 Gflop/s on 8800 GTX), but is that it? Other BLAS3 routines also run as slow as they did, though it seems rather trivial to improve them.

Did anyone found any substantial speedup in the new CUBLAS?

If you have code that runs faster than CUBLAS3 routines, please post it here (along with perf results). It’s not as trivial as it may look.


I have written Python wrappers for CUDA and used SAXPY, SGEMM and MatrixMulDrv for tests.

On an 8600GTS SGEMM (from CUBLAS) does around 35 GFlops, MatrixMulDrv used to do about

10 GFlops with CUDA 1.0, but now does 20 GFlops with 1.1beta. The CPU (Athlon X2 5600+),

using AMD’s ACML 3.6.0, using Fortran89 code and Intel’s ifort, does about 8.7 GFlops.

One change in syntax I did find:

cuda 1.0: __saturate

cuda 1.1: __saturatef

Note the extra ‘f’


Yes, that works just fine! Thank you :thumbup:

In case anyone is interested everything is done in few simple steps:

  1. [font=“Courier”]#include <delayimp.h>[/font] and tell linker to link against delayimp.lib (Project Properties or [font=“Courier”]#pragma comment( lib, “delayimp.lib” )[/font] )

  2. Tell linker to mark nvcuda.dll as delay-loaded (Project Properties or [font=“Courier”]#pragma comment( linker, “/delayload:nvcuda.dll” )[/font] )

  3. Use try-except (NOT try-catch) to handle exception if nvcuda.dll is not present. Or you can try to load all imported functions from nvcuda.dll at once (this simplifies error management), but make sure that it’s done before any call to cu* functions.

int esprGpuDetect()


    __try {

        if( FAILED( __HrLoadAllImportsForDll( "nvcuda.dll" ) ) ) return 0;


    __except( EXCEPTION_EXECUTE_HANDLER ) { return 0; }

  /* ... get devices, check for compute capability etc here*/


Ok, here is an example with TRSM. I use traditional (left-looking) blocked algorithm, that offloads (1-nb/n) fraction of flops to GEMM:

void block_strsm( int m, int n, float *A, int lda, float *B, int ldb )


    const int nb = 32;

    for( int j = 0; j < n; j += nb )


        int jb = min( nb, n-j );

        cublasSgemm( 'N', 'N', m, jb, j, -1, B, ldb, A+lda*j, lda, 1, B+ldb*j, ldb );

        cublasStrsm( 'R', 'U', 'N', 'U', m, jb, 1, A+lda*j+j, lda, B+ldb*j, ldb );



It is equivalent to cublasStrsm( ‘R’, ‘U’, ‘N’, ‘U’, m, n, 1, A, lda, B, ldb ) but runs at 108 Gflop/s instead of 66 Gflop/s for m=n=4096.

Thank you for posting this, will take a look.


I think I may be encountering the same “shared memory bug.” I upgraded to 1.1 and one of my programs crashes when trying to fill a shared memory array. Have you been able to resolve it?