some memcopy questions async, ping pong buffering, streaming


I’m trying to do some streaming like signal processing on my Tesla C870 and therefore have some memcopy questions:

I would like to use ping pong buffering.

  1. Is it possible to “hide” memcopies (at least host <-> device) behind processing algorithms? If a memcopy sets up a DMA this should be able to copy on its own.

  2. If so, how can I make sure a memcopy is finished before I use the copied data?

  3. Maybe the keyword here is “asynchronous” memcopy. But I don’t really know, how to deal with streams. Could anyone please explain briefly how to use them?

  4. If I use the profiler output of my program so far there is a lot of idle time before a memcopy device to host. This delay is depending on the size of the data to be copied back to host memory. For 280MB I see an idle time of about 0.31s which is more than the complete “busy” time of my program. I have already put the malloc to the initialization of the program, but this didn’t solve the problem. Could this be a latency connected to the DRAM on my host?



  1. See cudaMemcpyAsync in the programming guide.
  2. By running the kernel command in the same stream as the memcpy.
  3. Simply execute all *Async commands and kernels that must be run consecutively in the same stream.
  4. Can you provide an example code of how you are performing the timing? It is unclear from your description what the problem could be.

One additional note here: The GPU in the Tesla cards cannot overlap GPU execution and host<->device memory transfer. Newer GPUs have this capability, though. All GPUs, including Tesla, can overlap GPU execution and memory copies with CPU execution as MisterAnderson42 describes.

Thank you both for the quick responses to my questions.

Questions 2 and 3 are clear to me now.

I have profiler output timings that try to tell me the opposite of your statement:

the code does the following:

  1. Step: memcopy (not async!) host to device (~500kB of random data)

  2. Step: own algorithm to reallign the data (works fine with just emulating a memcopy device to device)

Profiler output:

timestamp=[ 3354.577 ] method=[ memcopy ] gputime=[ 207.712 ]

timestamp=[ 3396.815 ] method=[ memSort ] gputime=[ 18.816 ] cputime=[ 35.050 ] occupancy=[ 0.667 ]

==> The timings (if I read them correctly) look like an overlap between the memcopy and the memSort!?! (The visual profiler shows an overlap too)

To question 4:

the program:

  1. Step: random data to host to device (~500kB)

  2. Step: reallign and multiplicate data

  3. Step: FFT on data

  4. Step: transfer data back to host (~280MB)

// Copy host memory to device

CUDA_SAFE_CALL(cudaMemcpy(d_signal, h_signal, mem_size, cudaMemcpyHostToDevice));

// CUFFT plan

cufftHandle plan;

CUFFT_SAFE_CALL(cufftPlan1d(&plan, WINDOW_LENGTH, CUFFT_C2C, FFT_plan_length));

// block and grid dimensions

dim3 dimBlock(BLOCK_SIZE, 1);

dim3 dimGrid((SIGNAL_LENGTH - 1) / WINDOW_LENGTH, 1);

// memory alignment

memSort<<<dimGrid, dimBlock>>>(d_signal, aligned_signal);

// FFT

CUFFT_SAFE_CALL(cufftExecC2C(plan, (cufftComplex *) aligned_signal, (cufftComplex *) transformed_signal, CUFFT_FORWARD));

// copy data from device to host

CUFFT_SAFE_CALL(cudaMemcpy(r_signal, transformed_signal, mem_size_device, cudaMemcpyDeviceToHost));


timestamp=[ 144802.656 ] method=[ memcopy ] gputime=[ 208.960 ] 

timestamp=[ 144874.719 ] method=[ memSort ] gputime=[ 20987.617 ] cputime=[ 11.647 ] occupancy=[ 0.667 ]

timestamp=[ 144948.203 ] method=[ c2c_radix2 ] gputime=[ 49087.650 ] cputime=[ 20958.451 ] occupancy=[ 0.667 ]

timestamp=[ 551282.688 ] method=[ memcopy ] gputime=[ 148546.500 ]

Any ideas?



Edit: I just realized, that my memSort and the FFT “overlap” too. How do I prevent this? Also using “streams”? Or is there another way to make sure (from host!) that the kernel is finished?

Edit2: Forget the edit, cudaTreadSynchronize() does the trick.

Your code does not need cudaThreadSynchronize to be correct. Adding it could lower your performance, especially if you are wanting to use streams (though, I’m not sure if you can control the stream that cufft uses…)

The issue with the apparent overlaps you noticed is that the timestamp in the profiler appears to record the time when the operation was added to the queue: not when the operation actually executes on the GPU.

Alright, I was beginning to think so, thanks!

If so, then why does the host call the memcpyDeviceToHost that late? Even if I “serialize” the timings the call timestamp of the last transfer makes no sense.

Other point: If we cannot rely on the timestamps, is there any other option for direct performance measuring?