Time taken by cublasSetVector() ? makes my application worst

I am using cublasSetVector() im my application to copy data from CPU to GPU.
But it takes so much time to this approx. 12 ms.(Array size is 2,07,360) :o

If this is the performance then there is no meaning to use nVidia Graphics card for me as it makes my application’s performance worst :( .

Anyways, can anyone help me out in this issue??? :ermm:

Are u sure about the 12 ms ? for me it takes less than 2ms with a datatransfer rate about 2 GB/s. (I used cudaMallocHost for pinned-memory allocation).

Ideally, you should design your algorithm so that you first allocate memory (as much you need and of course as much as there is available), then operate on the data without any data transfer from and to the hosts RAM, and send the results back after you have finished your computation. You might have to redesign your algorithm if your code uses too many host <-> device transfer ops.


I want to add something to my prev question.

Ya, it takes 12 ms for copying 2 arrays each of size 2,07,360 to GPU memory.

And in my application it’s necessary to copy these arrays to GPU as only after doing this I will be able to apply cublas APIs on them.

so wt to do for this? :wacko:

Thanks in advance :)

Hmm, I am not sure with 12 ms. An array of size N = 207360 is needs 810 KB of memory if the element type is float.

Here are some results I get when I use cublasSetVector

TransferTime - cublasSetVector

datarate - cublasSetVector

Of course the transfer time depends on your pci express interface and memory latency. If your motherboard supports ony PCIe x1 then you have a maximum transfer rate of 250 MB/s for one direction. I have a PCIe x16 so that i might theoretically have rates of 4GB/s.

If your code is computational-intensive (e.g. many blas 3 operations) then you can be sure that you will have a great speed-up.

on my machine cublasSetVector has 13 ms latency (GeForce 8800 GTX, Windows XP). Same about all CUBLAS calls — they have ~10 ms overhead. However, NVIDIA promises to reduce these overheads substantially in next release of CUDA that is coming soon.

I wonder how you could get less than 1ms as in the graph in the current version of CUDA. May be it depends on operating system?

UPD: I mean microseconds, not ms! 13 us latency, ~10 us overhead and no wondering any more. Sorry for the “typo”. I was misguided by number 12, that looks so painfully familiar.

Are you timing the first copy? The first call to any cuda function will initialize the driver context, copy stuff to the GPU, etc… and takes a while. After the first call, the overhead for a copy should only be ~20 microseconds.

I dont know why I am under 1 ms but I just tested it again. Maybe its because i am using pinned memory allocation ? The transfer times are below 1ms. :blink:

oh my, I said ms! Of course microseconds, not milliseconds. I’m sorry for the confusion :">

In that case 12 ms is unusual. In this time you should be able to transfer ~36 MB at PCIe x16 rate. If you don’t use pinned memory, you may get half of it, ~18 MB.

Does size 2,07,360 mean 207360 floats?

Yes, my array contains float data.(207360). And it takes 12 miliseconds to copy data from CPU to GPU.

From, your results it seems that somthing is going worng on my side.

Can you please tell me in details how you are using this API?

And what is pinned memory??

Thanks in advance. :)

okay , I was a bit confused because of the notation ms. so ms means really milliseconds and us means microseconds.

Well I am using the the api like it is described in the cublas guide. Dont know what when can do wrong … maybe you should post your code ?

pinned memory means that you force the data to be placed on the physical RAM and not on the virtual one. This type of allocation allows better transfer rates as it avoids reads from the harddisk.

However one should be careful using this type of allocation because it might let your system crash due to lack of memory which is needed be your os and other processes running on the system.

Rules of troubleshooting: search for root cause.

  1. Run the bandwidth test from the SDK and see what kind of transfer rates you get. You should be getting ~1.5GiB/s for normal memory and ~3.0GiB/s in the pinned mode. Please post the output here.

  2. You never answered my question if you were timing the first call or not. Modify your code to make the setVector call 10 times in a row. Then set the CUDA_PROFILE environment variable to 1 (export CUDA_PROFILE=1 if you are using bash). Then run your program. Post the output of cuda_profile.log here.