Transfer data between host and device dynamicly? Maybe it's a problem.

What i want to do is transfer data between the host (CPU) and devie(GPU)dynamicly,using a producer–consumer model .Since each block works parallel, each block with a fixed buffer area.If the buffer is not empty ,the block can get the data out and compute it ,else it would be blocked. For the host,get a data from a file,then if the relevant buffer is full, it would be blocked,otherwise,put the da into the relevant buffer. So the CPU and GPU can works parallel to some extent. Is it feasible?If not ,someone can put forward other means to realize this instead?
Thanks in advance!

you cannot do what you write here, but you can use stream e.g.

put a memcopy in a stream, put a kernel call in, and another memcopy for each full block you have received on CPU.

I can’t make what you said clearly.Using stream?Can you give me a specific description about how the streams should be used and how many streams should be created?And also what every stream be responsible for ?Thanks.

DenisR means that you can make a pipeline using the “Streaming API” (see the CUDA programming guide). In this way, you can start processing, say, 200 blocks of data. While the kernel for those 200 is running in stream 0, you can copy the next 200 in line to the device in stream 1 (pinned host memory & compute 1.1 hardware only).

Basically, you have the same producer consumer model on the CPU, it is just that your consumer needs to queue up large batches of data to process on the GPU for each kernel call.

Yeah! Thanks for your additional explanation.After refered to the SimpleStreams in the SDK,I think i get what DenisR said somemore.

This is the first means I considered.But it will take a lot of time to read a large batches data from the file,and also I am afraid there is not enough space to store the data.So i put forward what i said in my first post.

If the nVidia can add a buffer between host and device which can be operated by the host and device ,I think the computation can be speedup to some specifically.

Am i right?


The GPU doesn’t even start to flex its muscles until you read “medium” sized batches of at least 50 blocks. You need 100 or more before you get into the fully linear performance region. Since everything is done in parallel, processing 1 to ~16 blocks (speaking loosely, this number can changed based on block dimensions, occupancy, memory accesss patterns…). Then there is a little step and 17-32 blocks take only a little longer than 16.

It is an unfortunate aspect of the architecture. But, if you are limited by disk I/O it doesn’t matter if you have a GPU or a CPU, the only way to speed up your application would be to increase disk I/O performance using RAID or going to the extreme of using parallel I/O :)

Edit: how much data are you reading? And how much would you process in one block? For optimal performance, each block doesn’t need to process much. Whole kernel calls that only read/write a few hundred kilobytes can attain maximum memory throughput on the device.

The files are always very large,the smallest one is about several Mbytes.And what i research now is about accelerating the simulation of cache.So I want to reduce the time spended on reading data ,to realize the parallel execution of the CPU and GPU maximal.And how much data in one block depends on the input data and the structure of my simulated cache.

Using the streams can partly increases the parallel of CPU and GPU,I think.

MisterAnderson42 ,there are still something i don’t understand.Can the two streams (as the simple situation) be execuated parallel absolutely?Maybe partly parallel,since the two streams must be synchronizated at some points,i.e. one stream must wait for another stream to finish it’s work.

Why “pinned host memory & compute 1.1 hardware only”. I try the simpleStreams program with a G8800 GTX (compuate Capacity 1.0),but it cost as much time as no–stream. I find that there are only two section about stream in the programming guide 1.1. Where can i find more details about stream?

Sorry for my poor knowledge about this.Thanks!

I’ve never used the streams myself, and everything I’m telling you I got from the programming guide (ok, it turns out one thing I got from the forums).

Streams, conceptually are computed in parallel, but are not really executed in parallel absolutely. For instance, only one kernel runs on the GPU at a time. But, you can overlap a host to device memcpy and a kernel execution if the hardware supports it. If your application requires some kind of synchronization among streams, you can accomplish it with the event API.

The pinned memory requirement is listed in the documentation for every cudaMemcpy*Async function call.

And section mentions

It has been said on the forums that only compute 1.1 hardware has the CU_DEVICE_ATTRIBUTE_GPU_OVERLAP feature.

Thanks for your patient introduction.I think I get it…I should be read the programming guide to get more information about the mechanism about the stream.Thanks again. :shifty:

Since only two devices are present, you can get at most 2x performance by perfectly overlapping their execution. You should be able to ping-pong between two sets of buffers/streams and use cuStreamSynchronize() for the needed CPU/GPU synchronization.

The first lines of code “prime the pump” to get the GPU going while the CPU remains free to prepare the input for the next block of data. This “example” assumes you are doing something like video decoding, constructing something big in device memory with no copies back.

// CPU: prepare srcData[0]

cuMemcpyHtoDAsync( srcData[0] );

Launch kernel to process dstData[0]<-srcData[0]


pingpong = 1;

while ( ! done ) {

 Â  Â cuStreamSynchronize(stream[pingpong]);

 Â  Â // CPU: prepare srcData[pingpong]

 Â  Â cuMemcpyHtoDAsync( srcData[pingpong] );

 Â  Â Launch kernel to process dstData[i]<-srcData[pingpong]

 Â  Â i++;

 Â  Â pingpong = 1 - pingpong;


The first few lines “prime the pump” and get the GPU going; the CPU is still free because the memcpy and launch were asynchronous; then the loop should get a speedup commensurate with how balanced the workload is between CPU and GPU.

In this example, the streams are not trying to execute concurrently. They serve as a CPU/GPU synchronization mechanism to make sure you aren’t overwriting an input buffer before the GPU is done copying from it. This synchronization also could be done with a pair of GPU events by putting a cuEventRecord call after the async memcpy calls and replacing cuStreamSynchronize by cuEventSynchronize.

Also note that the speedups due to CPU/GPU concurrency are available on all CUDA hardware.

Thanks for your useful information.I also have considered using ping-pong .

And I also have another question.I have only a G8800 GTX on my hand only with a compute capacity of 1.0.I want to know can it be speedup the performance if i use stream to handle the the tansition and the process of the data? If not ,Can i use Event to realize it ?

Yes, CPU/GPU concurrency can be exploited on any CUDA card, not just compute 1.1-capable cards.

Events and streams are both suitable ways to do the needed synchronization.