Time taken by cublasSetVector() ? makes my application worst

preetib · October 24, 2007, 5:33am

I am using cublasSetVector() im my application to copy data from CPU to GPU.
But it takes so much time to this approx. 12 ms.(Array size is 2,07,360) :o

If this is the performance then there is no meaning to use nVidia Graphics card for me as it makes my application’s performance worst :( .

Anyways, can anyone help me out in this issue??? External Image

sicb0161 · October 24, 2007, 8:42am

Are u sure about the 12 ms ? for me it takes less than 2ms with a datatransfer rate about 2 GB/s. (I used cudaMallocHost for pinned-memory allocation).

Ideally, you should design your algorithm so that you first allocate memory (as much you need and of course as much as there is available), then operate on the data without any data transfer from and to the hosts RAM, and send the results back after you have finished your computation. You might have to redesign your algorithm if your code uses too many host ↔ device transfer ops.

Cem

preetib · October 24, 2007, 9:26am

I want to add something to my prev question.

Ya, it takes 12 ms for copying 2 arrays each of size 2,07,360 to GPU memory.

And in my application it’s necessary to copy these arrays to GPU as only after doing this I will be able to apply cublas APIs on them.

so wt to do for this? :wacko:

Thanks in advance :)

sicb0161 · October 24, 2007, 10:19am

Hmm, I am not sure with 12 ms. An array of size N = 207360 is needs 810 KB of memory if the element type is float.

Here are some results I get when I use cublasSetVector

TransferTime - cublasSetVector

datarate - cublasSetVector

Of course the transfer time depends on your pci express interface and memory latency. If your motherboard supports ony PCIe x1 then you have a maximum transfer rate of 250 MB/s for one direction. I have a PCIe x16 so that i might theoretically have rates of 4GB/s.

If your code is computational-intensive (e.g. many blas 3 operations) then you can be sure that you will have a great speed-up.

vvolkov · October 24, 2007, 2:54pm

on my machine cublasSetVector has 13 ms latency (GeForce 8800 GTX, Windows XP). Same about all CUBLAS calls — they have ~10 ms overhead. However, NVIDIA promises to reduce these overheads substantially in next release of CUDA that is coming soon.

I wonder how you could get less than 1ms as in the graph in the current version of CUDA. May be it depends on operating system?

UPD: I mean microseconds, not ms! 13 us latency, ~10 us overhead and no wondering any more. Sorry for the “typo”. I was misguided by number 12, that looks so painfully familiar.

MisterAnderson42 · October 24, 2007, 3:40pm

Are you timing the first copy? The first call to any cuda function will initialize the driver context, copy stuff to the GPU, etc… and takes a while. After the first call, the overhead for a copy should only be ~20 microseconds.

sicb0161 · October 24, 2007, 4:35pm

I dont know why I am under 1 ms but I just tested it again. Maybe its because i am using pinned memory allocation ? The transfer times are below 1ms. :blink:

vvolkov · October 24, 2007, 7:44pm

oh my, I said ms! Of course microseconds, not milliseconds. I’m sorry for the confusion :">

In that case 12 ms is unusual. In this time you should be able to transfer ~36 MB at PCIe x16 rate. If you don’t use pinned memory, you may get half of it, ~18 MB.

Does size 2,07,360 mean 207360 floats?

preetib · October 25, 2007, 9:40am

Yes, my array contains float data.(207360). And it takes 12 miliseconds to copy data from CPU to GPU.

From, your results it seems that somthing is going worng on my side.

Can you please tell me in details how you are using this API?

And what is pinned memory??

Thanks in advance. :)

sicb0161 · October 25, 2007, 11:49am

okay , I was a bit confused because of the notation ms. so ms means really milliseconds and us means microseconds.

Well I am using the the api like it is described in the cublas guide. Dont know what when can do wrong … maybe you should post your code ?

pinned memory means that you force the data to be placed on the physical RAM and not on the virtual one. This type of allocation allows better transfer rates as it avoids reads from the harddisk.

However one should be careful using this type of allocation because it might let your system crash due to lack of memory which is needed be your os and other processes running on the system.

MisterAnderson42 · October 25, 2007, 12:51pm

Rules of troubleshooting: search for root cause.

Run the bandwidth test from the SDK and see what kind of transfer rates you get. You should be getting ~1.5GiB/s for normal memory and ~3.0GiB/s in the pinned mode. Please post the output here.
You never answered my question if you were timing the first call or not. Modify your code to make the setVector call 10 times in a row. Then set the CUDA_PROFILE environment variable to 1 (export CUDA_PROFILE=1 if you are using bash). Then run your program. Post the output of cuda_profile.log here.

Topic		Replies	Views
CUBLAS question cublasGetVector() call CUDA Programming and Performance	3	5599	November 19, 2009
cudaFreeHost consistently 20x slower than free/cudaFree (full runnable example code available) CUDA Programming and Performance	5	912	July 26, 2022
cublasSgemv & TransferTime CUDA Programming and Performance	3	10313	August 18, 2007
Memory copy very slow memory copy, image CUDA Programming and Performance	10	12480	April 7, 2011
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	9994	March 24, 2014
DATA tranfer from CPU to GPU CUDA Programming and Performance	6	4805	April 23, 2008
Question about vector access performance CUDA Programming and Performance	4	529	December 21, 2018
[Beginner]: CUDA slower than serial implementation fill Operation on entire image CUDA Programming and Performance	18	13516	September 15, 2011
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
About Data transfer speed between CPU and GPU? How to increase the data transfer speed? CUDA Programming and Performance	7	15493	December 11, 2009

Time taken by cublasSetVector() ? makes my application worst

Related topics