CUBLAS question cublasGetVector() call

bluestorm · November 17, 2009, 7:53pm

Hi All!

I have a question about using the CUBLAS library I am hoping someone can answer. I am using this library for a real-time system and its processing time is great! (my kernel is taking about 1ms to complete in what will take 2-3min in Matlab). The main delay seems to be sending the data to device. I understand that data transfer between the CPU and the device is slow, but I have a 1024x1024 array of floats and the cublasGetVector() call is taking about 28ms.
Does this time sounds correct?
Is there anyway this can be improved.?
Using a ASUS Commando motherboard with a GTX275 board. Are the Tesla boards quicker for memory transfer?
Thanks for your help!

avidday · November 17, 2009, 9:09pm

That is 102410244/28e-3 = 149.6Mb/s, which is improbably low for a full 16 lane PCI-e 1.0 slot. You can confirm the pinned and pageable bandwidth performance of your card/motherboard with the SDK bandwidth test, and I am guessing you will get numbers at least 10x the cublasGetVector() throughput you are quoting. But I have a feeling that what is really happening is that your timings are wrong. Your kernel is taking much longer than you think it is, and what you are attributing to memory copy time is really kernel running time (remember that all kernel launches, including cublas, are asynchronous to the host).

Your GTX-275 should be about as good as it gets in host-device bandwidth. No current Tesla will be any faster.

bluestorm · November 19, 2009, 3:45pm

Hi!

Thanks a lot for your response!

I get about 745Mb/s from Host to device and about 685 mb/s from device to host

Here is the way I am doing the time measurements

[codebox]

unsigned int timer = 0;

cutilCheckError(cutCreateTimer(&timer));

cutilCheckError(cutStartTimer(timer));

cublasSgemv ('n', ROWS, COLUMNS, 1/(float)COLUMNS,

			d_back_buff, ROWS, d_unitary_vector,

			1, 0.0f, d_back_average, 1);

status = cublasGetError();

if (status != CUBLAS_STATUS_SUCCESS) {

    fprintf (stderr, "!!!! kernel execution error.\n");

    return EXIT_FAILURE;

}

cublasSger (ROWS, COLUMNS, -1.0f, d_back_average,

			1, d_unitary_vector, 1, d_image_buff,

			ROWS);

status = cublasGetError();

if (status != CUBLAS_STATUS_SUCCESS) {

    fprintf (stderr, "!!!! kernel execution error.\n");

    return EXIT_FAILURE;

}

// setup execution parameters

dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

dim3 grid(COLUMNS / threads.x, ROWS / threads.y);

// execute the kernel

realToComplex<<< grid, threads >>>(d_image_buff,d_image_complex_buff,COLUMNS,ROWS);

// check if kernel execution generated and error

cutilCheckMsg("Kernel execution failed");



alphaParameterComplex.x = 1.0f;

alphaParameterComplex.y = 0.0f;

betaParameterComplex.x = 0.0f;

betaParameterComplex.y = 0.0f;

cublasCgemm ('n', 'n', ROWS, COLUMNS,

			ROWS, alphaParameterComplex, d_vander_Monde_buff, ROWS,

			d_image_complex_buff, ROWS, betaParameterComplex,

			d_result_buff, ROWS);	

status = cublasGetError();

if (status != CUBLAS_STATUS_SUCCESS) {

    fprintf (stderr, "!!!! kernel execution error.\n");

    return EXIT_FAILURE;

}

// setup execution parameters

dim3 threads2(BLOCK_SIZE, BLOCK_SIZE);

dim3 grid2(COLUMNS / threads.x, ROWS / threads.y);

// execute the kernel

abs_complex<<< grid2, threads2 >>>(d_image_buff,d_result_buff,COLUMNS);

// stop and destroy timer

cutilCheckError(cutStopTimer(timer));

printf("Processing time: %f (ms) \n", cutGetTimerValue(timer));

cutilCheckError(cutDeleteTimer(timer));

// Getting result back to create the complex image matrix

status = cublasGetVector(lSize, sizeof(d_image_buff[0]), d_image_buff, 1, h_image_buff, 1);

if (status != CUBLAS_STATUS_SUCCESS) {return ERR_CUDA_CUBLAS;}

[/codebox]

When I run the code as displayed about I get around 0.6ms - 1ms.

However, when I stop the timer after the cublasGetVector() call I get around 30ms. WHen I put the timer only around the cublasGetVector() cdall, I get around 29ms.

Is this the correct way of measuring the time?

Is there a way to ensure that all kernels have completed execution? (can I use _syncthreads() outside a kernel?)

I put this code into a DLL and are using it for my realtime system (using Labview for acquisition) but I am having problems. Sometimes the cublasSetVector() call will just fail and I am assuming it is because it has not completed other kernels that are using this data. It would be great if someone could give me a hint/comment about how to ensure that the kernels have completed execution.

Thanks a lot in advance!

avidday · November 19, 2009, 3:59pm

That is very slow. For a 16 lane PCI-e v1, I would expect something closer to 2Gb/s. Is the CUDA card in the 16 lane or 4 lane x16 slot?

There is a function cudaThreadSynchronize() which can and should use for timing kernels or asynchronous operations (this includes cublas functions, but not copy operations). So to time a kernel execution you should do something like this (in psuedocode):

start_timer();

kernel<<<blocks,threads,memory>>>();

cudaThreadSynchronize();

stop_timer();

That should ensure that the host blocks until the kernel finishes and your timing is correct. I didn’t read your code, so I am not sure whether you are timing correctly or not. I find those scrolling code boxes to be intolerably hard to read, it is like trying to read a newspaper through a letterbox slot…

Topic		Replies	Views
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	9994	March 24, 2014
CUBLAS Level 1 and Level 2 BLAS has 0 computaional time. Is it correct? Assesment of the CUBLAS leve CUDA Programming and Performance	3	3684	April 24, 2009
Time Measurement for CUBLAS why time (clock()) for CUBLAS is always 0 ms for any array size? CUDA Programming and Performance	2	2614	March 21, 2009
How can I improve this code which only reduces half time for the same code using MATLAB, thanks! CUDA Programming and Performance	5	1631	September 29, 2009
Optimizing cuBlas in kernels CUDA Programming and Performance	3	705	April 9, 2015
CUBLAS problem CUDA Programming and Performance	16	3501	July 1, 2010
Why cublas is much slower than Matlab runs on CPU CUDA Programming and Performance	15	4961	February 10, 2011
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28016	February 1, 2011
Slow CUDA SGEMM CUDA Programming and Performance	5	576	September 15, 2022
Time taken by cublasSetVector() ? makes my application worst CUDA Programming and Performance	10	11485	October 25, 2007

CUBLAS question cublasGetVector() call

Related topics