Memory Latencies for Small Data Transfers

UberSycko · July 7, 2014, 9:15am

Hi,

I am currently using CUDA for supplementing my CPU in calculation of several numerical integrations. For this purpose, I have only a small amount of data, which has to be transferred between the host and the device, i.e. 7 floats from h2d and the results (~30 integral values) from d2h.
During benchmark of my program I realized the time for copying the data seems to be quite relevant. To check this timing (latency + copy time) I wrote a small program - it just copies 336 bytes from d2h (non async). So these are my results:

Minimal latency ~28us, mean latency ~38us

Here is what the program does:

cudaStatus = cudaMallocPitch(&dev_IntegralValues, &pitch, sizeof(float) * NO_OF_INTEGRALS, NO_OF_VARIABLES);

for(int i = 0; i < 100000; i++)
{
  t1 = high_resolution_clock::now();

  cudaStatus = cudaMemcpy2D(IntegralValues, sizeof(float) * NO_OF_INTEGRALS, dev_IntegralValues, pitch, sizeof(float) * NO_OF_INTEGRALS, NO_OF_VARIABLES, cudaMemcpyDeviceToHost);

  t2 = high_resolution_clock::now();
  time_tmp = duration_cast<duration<double>>(t2 - t1).count();
  if(time_tmp < time_min)
  {
    time_min = time_tmp;
  }
  if(time_tmp > time_max)
  {
    time_max = time_tmp;
  }
  time_mean += time_tmp;
}

After some reading I noticed time mentioned for latency in small data copies should be around 10us. Does anybody have any suggestion why the latencies are that large in my case? I already tried different memory copy options (with and without “unified” memory, page lock and non page locked, other GPU (GTX650)) but timings do not vary much. I guess it is some kind of Windows driver problem, but am not sure about that.

Following is my setup:
Windows 7 64 Bit
CUDA 6
GTX 750Ti @PCIe 16x
Intel i7 2600K
Gigabyte Z68AP-D3

Any help is kindly appreciated!

Robert_Crovella · July 7, 2014, 12:57pm

cudaMemcpy2D is going to be slower than an ordinary cudaMemcpy in many cases, for the same amount of data transferred. cudaMemcpy2D is scheduling multiple (NO_OF_VARIABLES) DMA operations under the hood, whereas an ordinary cudaMemcpy with the same amount of data to be transferred can schedule the entire operation with a single DMA transfer, because the data is contiguous. The magic performed by cudaMemcpy2D is not free, in terms of timing. If it’s only a few values, and it’s convenient to do so, you will probably see better results if you can group those values contiguously to facilitate an ordinary cudaMemcpy to the host.

And yes, windows may be interfering, due to batching of WDDM commands. If you search elsewhere on this board, you’ll find comments from Greg@NV about this effect and possible workarounds for WDDM devices. If you use the latest nsight VSE, you can get an idea of the WDDM command queue at any point in time.

UberSycko · July 7, 2014, 2:40pm

Thanks for the fast response! Honestly, I feel a littlebit like a newbee not recognizing the WDDM facts :( Anyway, I replaced the cudaMemcpy2D by cudaMemcpy not really resulting in a better performance. I guess WDDM (in combination with queuing) is responsible for large delays. Is there any chance to get rid of these problems (w/o explicit triggering of events) except buying a Quadro or Tesla card?
As an alternative, one could surely use Linux - does anyone have checked the latencies of memory copies for this OS?

Robert_Crovella · July 7, 2014, 4:39pm

This is what I get on CUDA 6.0, RHEL 5.5, Quadro5000 GPU (PCIE 2.0, cc2.0):

$ cat t462.cu
#include <stdio.h>
#include <stdlib.h>

#include <time.h>
#include <sys/time.h>

#define DSIZE 32
#define LOOPS 100000
#define uS_PER_SEC 1000000

int main(){

  int *d_data, *h_data;
  timeval t1, t2;
  h_data = (int *)malloc(DSIZE*sizeof(int));
  cudaMalloc(&d_data, DSIZE*sizeof(int));

  gettimeofday(&t1, NULL);
  for (int i=0; i < LOOPS; i++)
    cudaMemcpy(h_data, d_data, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
  gettimeofday(&t2, NULL);
  float t_diff = (float)(((t2.tv_sec*uS_PER_SEC)+t2.tv_usec)-((t1.tv_sec*uS_PER_SEC)+t1.tv_usec));
  float t_per = t_diff/(float)LOOPS;
  printf("average time per copy: %fus\n", t_per);
  return 0;
}

$ nvcc -arch=sm_20 -o t462 t462.cu
$ ./t462
average time per copy: 9.841300us
$

UberSycko · July 7, 2014, 9:11pm

Have many thanks! I will try to port my programm to linux in the next days - after some further investigations I saw the batch queue was filled in my original program and such I will suffer from delays in ~100us to 1ms, which is not acceptable for my application. I will post results as soon as I collected them!

tera · July 7, 2014, 10:05pm

You can insert cudaStreamQuery(0) to force immediate submission of the current batch.

Also, can’t you just turn the 7 floats transferred to the device into parameters to the kernel?

UberSycko · July 8, 2014, 3:53pm

Of course, that is right - but unfortunately this does not really solve my problem. Transferring the floats as parameters seems to be a suitable idea at the moment, but I do not know how many parameters this will be in the future… Anyway, thanks for the cudaStreamQuery hint, I read something about cudaStreamQuery before and wondered why this did not work.

UberSycko · July 15, 2014, 4:12pm

Hi,

in the meantime I installed Linux and ported my project, resulting in dramatically reduced latencies. I now have app. 10µs per memory/kernel access, which seems acceptable to me. Drivers and IDE are working without problems, I was astonished how easy the porting process was!

Topic		Replies	Views
bets way to return a float value sync or assync CUDA Programming and Performance	26	10310	May 7, 2009
Memory copy very slow memory copy, image CUDA Programming and Performance	10	12485	April 7, 2011
how to speed up? data transfer CUDA Programming and Performance	22	3720	April 5, 2011
Handful of Slow Memory Transfers CUDA Programming and Performance	7	813	June 17, 2016
Very slow memory transfer problem Simple program executes very slowly, bandwidth test shows normal r CUDA Programming and Performance	2	907	February 7, 2011
Why it is so slow to use cudamemcpy（cudaMemcpyHostToHost）on tx2 Jetson TX2 cuda	5	1166	October 18, 2021
massive hiccups when transferring flags back to host challenging the commonly quoted 2-10us latency CUDA Programming and Performance	14	38113	February 1, 2011
Overhead using cudaMemcpyAsync CUDA Programming and Performance	5	3191	September 1, 2009
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
cudaMemcpy2D() and a few gray hairs It's very slow CUDA Programming and Performance	8	4530	February 13, 2009

Memory Latencies for Small Data Transfers

Related topics