latency with CUDA 4.0 and cublas


I am testing cublas with cuda 4.0 in a simple standalone code,
esp. the dgemm routine.

I am quite accustomed with cuda and cublas library, so I guess
my code should work fine. Plus I rechecked checked the cublas
documentation for the changes that could lead to problems.

To be sure, I tried both legacy and new api.

The problem, quite simply put:
. cublasInit() (eq. cublasCreate) takes 3 seconds
. the first call to cublas_dgemm() takes 3 seconds

The subsequent calls to cublas_dgemm take the expected times.

Is this normal? I am missing something?

Thanks for helping,


Yes, driver/context initialization and module load can be expensive in the released CUDA 4.0 RC1 driver (and in previous driver).

The next CUDA 4.0 RC driver will include optimizations which will substantially decrease driver and context initialization time and module load time.

Ok, thanks for confirming the “issue”, I will wait until

the next RC.