some memcopy questions async, ping pong buffering, streaming

VrahoK · April 29, 2008, 9:06am

Hi,

I’m trying to do some streaming like signal processing on my Tesla C870 and therefore have some memcopy questions:

I would like to use ping pong buffering.

Is it possible to “hide” memcopies (at least host <-> device) behind processing algorithms? If a memcopy sets up a DMA this should be able to copy on its own.
If so, how can I make sure a memcopy is finished before I use the copied data?
Maybe the keyword here is “asynchronous” memcopy. But I don’t really know, how to deal with streams. Could anyone please explain briefly how to use them?
If I use the profiler output of my program so far there is a lot of idle time before a memcopy device to host. This delay is depending on the size of the data to be copied back to host memory. For 280MB I see an idle time of about 0.31s which is more than the complete “busy” time of my program. I have already put the malloc to the initialization of the program, but this didn’t solve the problem. Could this be a latency connected to the DRAM on my host?

Thanks,

Vra

MisterAnderson42 · April 29, 2008, 12:22pm

See cudaMemcpyAsync in the programming guide.
By running the kernel command in the same stream as the memcpy.
Simply execute all *Async commands and kernels that must be run consecutively in the same stream.
Can you provide an example code of how you are performing the timing? It is unclear from your description what the problem could be.

seibert · April 29, 2008, 12:30pm

One additional note here: The GPU in the Tesla cards cannot overlap GPU execution and host<->device memory transfer. Newer GPUs have this capability, though. All GPUs, including Tesla, can overlap GPU execution and memory copies with CPU execution as MisterAnderson42 describes.

VrahoK · April 29, 2008, 2:04pm

Thank you both for the quick responses to my questions.

Questions 2 and 3 are clear to me now.

I have profiler output timings that try to tell me the opposite of your statement:

the code does the following:

Step: memcopy (not async!) host to device (~500kB of random data)
Step: own algorithm to reallign the data (works fine with just emulating a memcopy device to device)

Profiler output:

timestamp=[ 3354.577 ] method=[ memcopy ] gputime=[ 207.712 ]

timestamp=[ 3396.815 ] method=[ memSort ] gputime=[ 18.816 ] cputime=[ 35.050 ] occupancy=[ 0.667 ]

==> The timings (if I read them correctly) look like an overlap between the memcopy and the memSort!?! (The visual profiler shows an overlap too)

To question 4:

the program:

Step: random data to host to device (~500kB)
Step: reallign and multiplicate data
Step: FFT on data
Step: transfer data back to host (~280MB)

// Copy host memory to device

CUDA_SAFE_CALL(cudaMemcpy(d_signal, h_signal, mem_size, cudaMemcpyHostToDevice));

// CUFFT plan

cufftHandle plan;

CUFFT_SAFE_CALL(cufftPlan1d(&plan, WINDOW_LENGTH, CUFFT_C2C, FFT_plan_length));

// block and grid dimensions

dim3 dimBlock(BLOCK_SIZE, 1);

dim3 dimGrid((SIGNAL_LENGTH - 1) / WINDOW_LENGTH, 1);

// memory alignment

memSort<<<dimGrid, dimBlock>>>(d_signal, aligned_signal);

// FFT

CUFFT_SAFE_CALL(cufftExecC2C(plan, (cufftComplex *) aligned_signal, (cufftComplex *) transformed_signal, CUFFT_FORWARD));

// copy data from device to host

CUFFT_SAFE_CALL(cudaMemcpy(r_signal, transformed_signal, mem_size_device, cudaMemcpyDeviceToHost));

timings:

timestamp=[ 144802.656 ] method=[ memcopy ] gputime=[ 208.960 ] 

timestamp=[ 144874.719 ] method=[ memSort ] gputime=[ 20987.617 ] cputime=[ 11.647 ] occupancy=[ 0.667 ]

timestamp=[ 144948.203 ] method=[ c2c_radix2 ] gputime=[ 49087.650 ] cputime=[ 20958.451 ] occupancy=[ 0.667 ]

timestamp=[ 551282.688 ] method=[ memcopy ] gputime=[ 148546.500 ]

Any ideas?

Thanks!

Vra

Edit: I just realized, that my memSort and the FFT “overlap” too. How do I prevent this? Also using “streams”? Or is there another way to make sure (from host!) that the kernel is finished?

Edit2: Forget the edit, cudaTreadSynchronize() does the trick.

MisterAnderson42 · April 29, 2008, 2:32pm

Your code does not need cudaThreadSynchronize to be correct. Adding it could lower your performance, especially if you are wanting to use streams (though, I’m not sure if you can control the stream that cufft uses…)

The issue with the apparent overlaps you noticed is that the timestamp in the profiler appears to record the time when the operation was added to the queue: not when the operation actually executes on the GPU.

VrahoK · April 29, 2008, 3:05pm

Alright, I was beginning to think so, thanks!

If so, then why does the host call the memcpyDeviceToHost that late? Even if I “serialize” the timings the call timestamp of the last transfer makes no sense.

Other point: If we cannot rely on the timestamps, is there any other option for direct performance measuring?

Topic		Replies	Views
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1815	June 23, 2010
cudaMemcpyAsync Question Overlap HostToDevice and DeviceToHost trasfers CUDA Programming and Performance	2	5675	April 2, 2009
about streaming style sample code in Programming Guide ... why such a style? CUDA Programming and Performance	5	1468	January 23, 2009
Asynchronous memory copy from Host to Device CUDA Programming and Performance	5	3126	June 12, 2008
async memcpy only seems to overlap device->host CUDA Programming and Performance	0	967	August 17, 2009
memory copy overlap CUDA Programming and Performance	7	14787	March 29, 2008
Kernel Queueing CUDA Programming and Performance	8	9742	June 29, 2009
Bug when overlapping tranfert & data CUDA Programming and Performance	1	600	February 11, 2011
No Performance Improvement from Overlapping Kernel/Memcpy CUDA Programming and Performance	16	3249	July 14, 2010
is it possible to overlap computation with a device-to-device memcopy? CUDA Programming and Performance	2	1077	January 6, 2010

some memcopy questions async, ping pong buffering, streaming

Related topics