Transfer data between host and device dynamicly? Maybe it's a problem.

Austin · March 28, 2008, 2:00am

Hi,guys.
What i want to do is transfer data between the host (CPU) and devie(GPU)dynamicly,using a producer–consumer model .Since each block works parallel, each block with a fixed buffer area.If the buffer is not empty ,the block can get the data out and compute it ,else it would be blocked. For the host,get a data from a file,then if the relevant buffer is full, it would be blocked,otherwise,put the da into the relevant buffer. So the CPU and GPU can works parallel to some extent. Is it feasible?If not ,someone can put forward other means to realize this instead?
Thanks in advance!

DenisR · March 28, 2008, 6:23am

you cannot do what you write here, but you can use stream e.g.

put a memcopy in a stream, put a kernel call in, and another memcopy for each full block you have received on CPU.

Austin · March 28, 2008, 7:00am

I can’t make what you said clearly.Using stream?Can you give me a specific description about how the streams should be used and how many streams should be created?And also what every stream be responsible for ?Thanks.

MisterAnderson42 · March 28, 2008, 8:11am

DenisR means that you can make a pipeline using the “Streaming API” (see the CUDA programming guide). In this way, you can start processing, say, 200 blocks of data. While the kernel for those 200 is running in stream 0, you can copy the next 200 in line to the device in stream 1 (pinned host memory & compute 1.1 hardware only).

Basically, you have the same producer consumer model on the CPU, it is just that your consumer needs to queue up large batches of data to process on the GPU for each kernel call.

Austin · March 28, 2008, 8:55am

Yeah! Thanks for your additional explanation.After refered to the SimpleStreams in the SDK,I think i get what DenisR said somemore.

This is the first means I considered.But it will take a lot of time to read a large batches data from the file,and also I am afraid there is not enough space to store the data.So i put forward what i said in my first post.

If the nVidia can add a buffer between host and device which can be operated by the host and device ,I think the computation can be speedup to some specifically.

Am i right?

[snapback]352460[/snapback]

MisterAnderson42 · March 28, 2008, 9:56am

The GPU doesn’t even start to flex its muscles until you read “medium” sized batches of at least 50 blocks. You need 100 or more before you get into the fully linear performance region. Since everything is done in parallel, processing 1 to ~16 blocks (speaking loosely, this number can changed based on block dimensions, occupancy, memory accesss patterns…). Then there is a little step and 17-32 blocks take only a little longer than 16.

It is an unfortunate aspect of the architecture. But, if you are limited by disk I/O it doesn’t matter if you have a GPU or a CPU, the only way to speed up your application would be to increase disk I/O performance using RAID or going to the extreme of using parallel I/O :)

Edit: how much data are you reading? And how much would you process in one block? For optimal performance, each block doesn’t need to process much. Whole kernel calls that only read/write a few hundred kilobytes can attain maximum memory throughput on the device.

Austin · March 28, 2008, 10:50am

The files are always very large,the smallest one is about several Mbytes.And what i research now is about accelerating the simulation of cache.So I want to reduce the time spended on reading data ,to realize the parallel execution of the CPU and GPU maximal.And how much data in one block depends on the input data and the structure of my simulated cache.

Using the streams can partly increases the parallel of CPU and GPU,I think.

Austin · March 28, 2008, 12:24pm

MisterAnderson42 ,there are still something i don’t understand.Can the two streams (as the simple situation) be execuated parallel absolutely?Maybe partly parallel,since the two streams must be synchronizated at some points,i.e. one stream must wait for another stream to finish it’s work.

Why “pinned host memory & compute 1.1 hardware only”. I try the simpleStreams program with a G8800 GTX (compuate Capacity 1.0),but it cost as much time as no–stream. I find that there are only two section about stream in the programming guide 1.1. Where can i find more details about stream?

Sorry for my poor knowledge about this.Thanks!

MisterAnderson42 · March 28, 2008, 1:01pm

I’ve never used the streams myself, and everything I’m telling you I got from the programming guide (ok, it turns out one thing I got from the forums).

Streams, conceptually are computed in parallel, but are not really executed in parallel absolutely. For instance, only one kernel runs on the GPU at a time. But, you can overlap a host to device memcpy and a kernel execution if the hardware supports it. If your application requires some kind of synchronization among streams, you can accomplish it with the event API.

The pinned memory requirement is listed in the documentation for every cudaMemcpy*Async function call.

And section 4.5.1.5 mentions

It has been said on the forums that only compute 1.1 hardware has the CU_DEVICE_ATTRIBUTE_GPU_OVERLAP feature.

Austin · March 29, 2008, 2:35am

Thanks for your patient introduction.I think I get it…I should be read the programming guide to get more information about the mechanism about the stream.Thanks again. External Image

nwilt · March 29, 2008, 12:10pm

Since only two devices are present, you can get at most 2x performance by perfectly overlapping their execution. You should be able to ping-pong between two sets of buffers/streams and use cuStreamSynchronize() for the needed CPU/GPU synchronization.

The first lines of code “prime the pump” to get the GPU going while the CPU remains free to prepare the input for the next block of data. This “example” assumes you are doing something like video decoding, constructing something big in device memory with no copies back.

// CPU: prepare srcData[0]

cuMemcpyHtoDAsync( srcData[0] );

Launch kernel to process dstData[0]<-srcData[0]

i=1;

pingpong = 1;

while ( ! done ) {

 Â  Â cuStreamSynchronize(stream[pingpong]);

 Â  Â // CPU: prepare srcData[pingpong]

 Â  Â cuMemcpyHtoDAsync( srcData[pingpong] );

 Â  Â Launch kernel to process dstData[i]<-srcData[pingpong]

 Â  Â i++;

 Â  Â pingpong = 1 - pingpong;

}

The first few lines “prime the pump” and get the GPU going; the CPU is still free because the memcpy and launch were asynchronous; then the loop should get a speedup commensurate with how balanced the workload is between CPU and GPU.

In this example, the streams are not trying to execute concurrently. They serve as a CPU/GPU synchronization mechanism to make sure you aren’t overwriting an input buffer before the GPU is done copying from it. This synchronization also could be done with a pair of GPU events by putting a cuEventRecord call after the async memcpy calls and replacing cuStreamSynchronize by cuEventSynchronize.

Also note that the speedups due to CPU/GPU concurrency are available on all CUDA hardware.

Austin · March 30, 2008, 11:37am

Thanks for your useful information.I also have considered using ping-pong .

And I also have another question.I have only a G8800 GTX on my hand only with a compute capacity of 1.0.I want to know can it be speedup the performance if i use stream to handle the the tansition and the process of the data? If not ,Can i use Event to realize it ?

nwilt · April 2, 2008, 2:30pm

Yes, CPU/GPU concurrency can be exploited on any CUDA card, not just compute 1.1-capable cards.

Events and streams are both suitable ways to do the needed synchronization.

Topic		Replies	Views
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13132	March 30, 2011
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2242	January 18, 2023
multi-GPUs with streams. Seems only one device overlapping copies CUDA Programming and Performance	9	1635	October 30, 2015
GPU-CPU & GPU-GPU synchronization query on advanced CUDA features CUDA Programming and Performance	12	17423	June 14, 2008
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	581	April 6, 2023
Cannot get any stream parallelism. CUDA Programming and Performance	13	1299	December 31, 2019
CUDA and NPP Misc Issues CUDA Programming and Performance	6	1452	March 28, 2011
two (newbie?) questions asynchroneous host->device memcpy+events CUDA Programming and Performance	22	21969	December 11, 2008
Inferior Results on C2070 when Using Streams CUDA Programming and Performance	3	1927	March 6, 2012
Stream Concurrency (or lack thereof) on GTX 480 CUDA Programming and Performance	6	2490	July 15, 2010

Transfer data between host and device dynamicly? Maybe it's a problem.

Related topics