CUBLAS question cublasGetVector() call

Hi All!

I have a question about using the CUBLAS library I am hoping someone can answer. I am using this library for a real-time system and its processing time is great! (my kernel is taking about 1ms to complete in what will take 2-3min in Matlab). The main delay seems to be sending the data to device. I understand that data transfer between the CPU and the device is slow, but I have a 1024x1024 array of floats and the cublasGetVector() call is taking about 28ms.
Does this time sounds correct?
Is there anyway this can be improved.?
Using a ASUS Commando motherboard with a GTX275 board. Are the Tesla boards quicker for memory transfer?
Thanks for your help!

That is 102410244/28e-3 = 149.6Mb/s, which is improbably low for a full 16 lane PCI-e 1.0 slot. You can confirm the pinned and pageable bandwidth performance of your card/motherboard with the SDK bandwidth test, and I am guessing you will get numbers at least 10x the cublasGetVector() throughput you are quoting. But I have a feeling that what is really happening is that your timings are wrong. Your kernel is taking much longer than you think it is, and what you are attributing to memory copy time is really kernel running time (remember that all kernel launches, including cublas, are asynchronous to the host).

Your GTX-275 should be about as good as it gets in host-device bandwidth. No current Tesla will be any faster.


Thanks a lot for your response!

I get about 745Mb/s from Host to device and about 685 mb/s from device to host

Here is the way I am doing the time measurements


unsigned int timer = 0;



cublasSgemv ('n', ROWS, COLUMNS, 1/(float)COLUMNS,

			d_back_buff, ROWS, d_unitary_vector,

			1, 0.0f, d_back_average, 1);

status = cublasGetError();

if (status != CUBLAS_STATUS_SUCCESS) {

    fprintf (stderr, "!!!! kernel execution error.\n");

    return EXIT_FAILURE;


cublasSger (ROWS, COLUMNS, -1.0f, d_back_average,

			1, d_unitary_vector, 1, d_image_buff,


status = cublasGetError();

if (status != CUBLAS_STATUS_SUCCESS) {

    fprintf (stderr, "!!!! kernel execution error.\n");

    return EXIT_FAILURE;


// setup execution parameters

dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

dim3 grid(COLUMNS / threads.x, ROWS / threads.y);

// execute the kernel

realToComplex<<< grid, threads >>>(d_image_buff,d_image_complex_buff,COLUMNS,ROWS);

// check if kernel execution generated and error

cutilCheckMsg("Kernel execution failed");

alphaParameterComplex.x = 1.0f;

alphaParameterComplex.y = 0.0f;

betaParameterComplex.x = 0.0f;

betaParameterComplex.y = 0.0f;

cublasCgemm ('n', 'n', ROWS, COLUMNS,

			ROWS, alphaParameterComplex, d_vander_Monde_buff, ROWS,

			d_image_complex_buff, ROWS, betaParameterComplex,

			d_result_buff, ROWS);	

status = cublasGetError();

if (status != CUBLAS_STATUS_SUCCESS) {

    fprintf (stderr, "!!!! kernel execution error.\n");

    return EXIT_FAILURE;


// setup execution parameters

dim3 threads2(BLOCK_SIZE, BLOCK_SIZE);

dim3 grid2(COLUMNS / threads.x, ROWS / threads.y);

// execute the kernel

abs_complex<<< grid2, threads2 >>>(d_image_buff,d_result_buff,COLUMNS);

// stop and destroy timer


printf("Processing time: %f (ms) \n", cutGetTimerValue(timer));


// Getting result back to create the complex image matrix

status = cublasGetVector(lSize, sizeof(d_image_buff[0]), d_image_buff, 1, h_image_buff, 1);



When I run the code as displayed about I get around 0.6ms - 1ms.

However, when I stop the timer after the cublasGetVector() call I get around 30ms. WHen I put the timer only around the cublasGetVector() cdall, I get around 29ms.

Is this the correct way of measuring the time?

Is there a way to ensure that all kernels have completed execution? (can I use _syncthreads() outside a kernel?)

I put this code into a DLL and are using it for my realtime system (using Labview for acquisition) but I am having problems. Sometimes the cublasSetVector() call will just fail and I am assuming it is because it has not completed other kernels that are using this data. It would be great if someone could give me a hint/comment about how to ensure that the kernels have completed execution.

Thanks a lot in advance!

That is very slow. For a 16 lane PCI-e v1, I would expect something closer to 2Gb/s. Is the CUDA card in the 16 lane or 4 lane x16 slot?

There is a function cudaThreadSynchronize() which can and should use for timing kernels or asynchronous operations (this includes cublas functions, but not copy operations). So to time a kernel execution you should do something like this (in psuedocode):





That should ensure that the host blocks until the kernel finishes and your timing is correct. I didn’t read your code, so I am not sure whether you are timing correctly or not. I find those scrolling code boxes to be intolerably hard to read, it is like trying to read a newspaper through a letterbox slot…