Memory Latencies for Small Data Transfers


I am currently using CUDA for supplementing my CPU in calculation of several numerical integrations. For this purpose, I have only a small amount of data, which has to be transferred between the host and the device, i.e. 7 floats from h2d and the results (~30 integral values) from d2h.
During benchmark of my program I realized the time for copying the data seems to be quite relevant. To check this timing (latency + copy time) I wrote a small program - it just copies 336 bytes from d2h (non async). So these are my results:

Minimal latency ~28us, mean latency ~38us

Here is what the program does:

cudaStatus = cudaMallocPitch(&dev_IntegralValues, &pitch, sizeof(float) * NO_OF_INTEGRALS, NO_OF_VARIABLES);

for(int i = 0; i < 100000; i++)
  t1 = high_resolution_clock::now();

  cudaStatus = cudaMemcpy2D(IntegralValues, sizeof(float) * NO_OF_INTEGRALS, dev_IntegralValues, pitch, sizeof(float) * NO_OF_INTEGRALS, NO_OF_VARIABLES, cudaMemcpyDeviceToHost);

  t2 = high_resolution_clock::now();
  time_tmp = duration_cast<duration<double>>(t2 - t1).count();
  if(time_tmp < time_min)
    time_min = time_tmp;
  if(time_tmp > time_max)
    time_max = time_tmp;
  time_mean += time_tmp;

After some reading I noticed time mentioned for latency in small data copies should be around 10us. Does anybody have any suggestion why the latencies are that large in my case? I already tried different memory copy options (with and without “unified” memory, page lock and non page locked, other GPU (GTX650)) but timings do not vary much. I guess it is some kind of Windows driver problem, but am not sure about that.

Following is my setup:
Windows 7 64 Bit
GTX 750Ti @PCIe 16x
Intel i7 2600K
Gigabyte Z68AP-D3

Any help is kindly appreciated!

cudaMemcpy2D is going to be slower than an ordinary cudaMemcpy in many cases, for the same amount of data transferred. cudaMemcpy2D is scheduling multiple (NO_OF_VARIABLES) DMA operations under the hood, whereas an ordinary cudaMemcpy with the same amount of data to be transferred can schedule the entire operation with a single DMA transfer, because the data is contiguous. The magic performed by cudaMemcpy2D is not free, in terms of timing. If it’s only a few values, and it’s convenient to do so, you will probably see better results if you can group those values contiguously to facilitate an ordinary cudaMemcpy to the host.

And yes, windows may be interfering, due to batching of WDDM commands. If you search elsewhere on this board, you’ll find comments from Greg@NV about this effect and possible workarounds for WDDM devices. If you use the latest nsight VSE, you can get an idea of the WDDM command queue at any point in time.

Thanks for the fast response! Honestly, I feel a littlebit like a newbee not recognizing the WDDM facts :( Anyway, I replaced the cudaMemcpy2D by cudaMemcpy not really resulting in a better performance. I guess WDDM (in combination with queuing) is responsible for large delays. Is there any chance to get rid of these problems (w/o explicit triggering of events) except buying a Quadro or Tesla card?
As an alternative, one could surely use Linux - does anyone have checked the latencies of memory copies for this OS?

This is what I get on CUDA 6.0, RHEL 5.5, Quadro5000 GPU (PCIE 2.0, cc2.0):

$ cat
#include <stdio.h>
#include <stdlib.h>

#include <time.h>
#include <sys/time.h>

#define DSIZE 32
#define LOOPS 100000
#define uS_PER_SEC 1000000

int main(){

  int *d_data, *h_data;
  timeval t1, t2;
  h_data = (int *)malloc(DSIZE*sizeof(int));
  cudaMalloc(&d_data, DSIZE*sizeof(int));

  gettimeofday(&t1, NULL);
  for (int i=0; i < LOOPS; i++)
    cudaMemcpy(h_data, d_data, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
  gettimeofday(&t2, NULL);
  float t_diff = (float)(((t2.tv_sec*uS_PER_SEC)+t2.tv_usec)-((t1.tv_sec*uS_PER_SEC)+t1.tv_usec));
  float t_per = t_diff/(float)LOOPS;
  printf("average time per copy: %fus\n", t_per);
  return 0;

$ nvcc -arch=sm_20 -o t462
$ ./t462
average time per copy: 9.841300us

Have many thanks! I will try to port my programm to linux in the next days - after some further investigations I saw the batch queue was filled in my original program and such I will suffer from delays in ~100us to 1ms, which is not acceptable for my application. I will post results as soon as I collected them!

You can insert cudaStreamQuery(0) to force immediate submission of the current batch.

Also, can’t you just turn the 7 floats transferred to the device into parameters to the kernel?

Of course, that is right - but unfortunately this does not really solve my problem. Transferring the floats as parameters seems to be a suitable idea at the moment, but I do not know how many parameters this will be in the future… Anyway, thanks for the cudaStreamQuery hint, I read something about cudaStreamQuery before and wondered why this did not work.


in the meantime I installed Linux and ported my project, resulting in dramatically reduced latencies. I now have app. 10µs per memory/kernel access, which seems acceptable to me. Drivers and IDE are working without problems, I was astonished how easy the porting process was!