streams vs. direct use of zero copy memory

Assume 1 gpu device. I am using Fermi with C2050. windows 7 64 bit. cpu with 8 cores.
The two options I have are
—First option------------

  1. prepare windows threads, one owns the gpu, and one writes GPU output to disk and
    one thread to read input from disk into zero copy mem.
  2. using windows thread api- mutex etc, overlap read and write and gpu kernel launch threads.
    GPU sees only zero copy host memory. There is no use of streams at all. The GPU waits
    for a windows mutex before it proceeds. There are several input and output buffers
    so their cpu side processing can proceed while one buffer set is tied up with the gpu.

----Second Option----
Using gpu streams, I can overlap gpu computation with copys from gpu memory to page locked host
memory, but I still need cpu threads to read and write the page locked host memory to and from

Only if the gpu runs lots faster with gpu memory does the second streams options make sense to me.
I understand that using a zerocopy gpu memory pointer to the pinned host memory might be problematic
if the gpu does multiple reads
/writes but they are not. the GPU Reads once and Writes once- both input and output buffers.

Is there any advantage to using streams (option 2) in this case, over option 1? Will cuda (nvcc) play ball nice with
standard win-32 thread api’s ?

It all boils down to the question how much latency your kernel can tolerate. A major influence is how far the compiler can move the read ahead of the actual use of the value read. Loop unrolling can help there, or you can manually move the load forward.
Occupancy is another factor. The higher the occupancy, the more latency can be hidden with calculations from different threads.

If your kernel can tolerate a lot of latency and is memory bandwidth bound, zero copy can actually be faster because the PCIe bandwidth is added to the global mem bandwidth.

I have occupancy HUGE. Thread blocks full size 1024, with 1024 blocks. no interthread block communication necessary. It takes 10 seconds to run such a kernel. and 3 seconds more of cpu time to process the output from a single kernel run. This is using zero-copy mem. If I switched to device device memory, and used streams to copy concurrently to some pinned host memory, you think this does not make sense, cause I can just use the zero-copy pinned memory directly?
I will push the device memory to its size limits if I decide to use this, but the cpu has lots of memory.

I will need to run win-32 threads to feed kernel input and process kernel output cycles in any case. This means I will need several copies of kernel input-output memory malloced in the cpu, but
I have plenty of memory.
I will use the win32 thread stuff then, and nvcc will be ok with that?

If I move to two nvidia 2050 cards on the same host, then I assume one gpu can be owned by one cpu thread and the second gpu owned by another cpu thread, and I can use a producer consumer queue to
feed input and process output from them both. Is there anything I should know about this plan?

you do know that you can use streams and zero-zopy ?

You know that your zero-copy is finished once the stream is synchronized.

for each stream
kernel<<<…, stream>>>(…)


ok_to_read_buffer = 1;


really? is there any documentation or sample code?

Unfortunately I haven’t been able to find any good documentation on it. But to me It’s a very basic feature to be able to determine when a zero-copy buffer (ex multi threaded application with portable buffers) is safe for reading or writing.

Since it is possible to use streams and zero-copy and the documentation says cudaStreamSynchronize() - “Blocks until stream has completed all operations.” it makes sense.

If I do decide to have several copies, say three, of the gpu data input and output, and these lies in zero copy cpu memory, I don’t have to move stuff into device memory, and one cpu thread can be working on preparing the next cuda kernel cycle input data, while one cpu thread hosts the gpu kernel and is working on one zero copy input image, and another cpu thread is busy processing the result of a completed kernel.
I am proceeding on this plan. How could streams help, given that there is no need to move data from the cpu into device memory, on the kernel input or output side, with zero copy memory. What good is a stream if a have a single gpu and I am not using device memory for the main input and output buffers anyway, and the main purpose of streams in a one device context is to hide latency involved in loading device memory. Please tell me if I am mistaken here and fill in a little more detail about what streams in a single gpu context with no heavy use of device memory will buy me.

Streams would help you to figure out when exactly a given buffer is safe for reading and writing IF you have a multi-threaded approach on the CPU side. If this is not an issue and you are using a single threaded approach anyways then you don’t need to worry about using streams for sequencing your zero-copies.

I think I may be missing something basic in the understanding of what streams can do. Of course I have to use a cudaThreadSychronize() after Kernel call cause the kernel call is async and returns immediately, and I have to
wait until the kernel is done doing whatever it is doing before I can proceed to process the output it has made and prepare for another kernel cycle. If my cpu side does not have to copy from device memory to cpu memory, cause I am using zero copy memory, then any notification that a stream could give me that the kernel is done banging on memory at time T1 is not useful if it coincides with the same time T2 at which I would fall off the wait from cudaThreadSychronize(). I think T1=T2 with zero copy memory. If this is not so, I need some education. Thanks for your patience. My producer consumer queue idiom is working nicely, so I can hide the pre and post kernel work with cpu threads, but I have to put the gpu kernel call into it now. I hope cuda is ok with win32 threads etc. I will find out real soon. The system is set up with a thread to feed and prepare input, one or more worker threads, (one if I have only one gpu for now but more than one for the cpu version) and a thread to dispose of the output from a given kernel execution. If I buy a system with multiple gpu’s, I am planning to have one cpu execution thread per gpu device, and it would work just like the current cpu version with more than one concurrent worker threads, a feeder thread, and a output-disposition thread.

If one has light weight kernels and several of them can be profitably run on the same GPU hardware, then streams make sense. One is not trying to overlap gpu kernel work and the copy from gpu memory to host memory, but rather to overlap kernel work. Is this right? I don’t have enough cuda experience to imagine what kind of problem would result in lots of lightweight cuda kernels that could profitably run concurrently. I little example would be appreciated. My particular situation is a very heavyweight cuda kernel with high thread occupancy and no cuda thread dependency. If I ran more than one of these monsters concurrently there is no more juice to be had. If you had the gpu memory in place already, then maybe multiple kernels could modify it quickly and leave, but that would mean atomic locks so the kernels could safely work out of the same device memory?

I think I focussed too much on streams vs. zerocopy in my previous posts. In reality if your goal is to move data from disk to GPU back to disk, the host <-> device copy is almost irrelevant, as it is about two orders of magnitude faster than the disk is.

The important part is to break up your single big kernel into multiple invocations of a kernel working on smaller blocks of data. And then launch the first kernel as soon as the first block of data has arrived from disk, before the second block is read. Then, once all data is read from disk, write out the first block of data back to disk as soon as the kernel processing it has finished.

Don’t worry about host <-> device copy, streams and zerocopy. Sticking to your current zerocopy implementation is just fine. Think how you properly pipeline disk read, kernel launch, and disk write. You don’t need threads or streams for that, as kernel invocations are asynchronous by default.

Something like this should do:

fread(block 0);

    for (int i=0; i<N; i++) {

        kernel<<<...,...>>>(block i);

        if (i+1<N)

            fread]block i+1);

        if (i>0)

            fwrite(block i-1);



   fwrite(block N-1);

Focusing instead on disk read/write throughput should in that case be relevant:

//This simple scheme will work fine for basic buffering.
//using async kernel launches.

// overlap of kernel and cpu read/process
B[0].in and B[0].out
B[1].in and B[1].out
for (int i=0; !(eof=cpuRead(B[ (i%2) ].in)) ; i++)
if ( i > 0)
wait for async Kernel to be free
cpuWrite B[ (i-1)%2] ).out
Launch Async gpu Kernel on B[ (i%2) ]

  1. it will not overlap cpuRead and and cpuWrite
  2. It will not perform buffering well
    in the face of unequal processing for contiguous
    buffers, as there is only a look-ahead of 1
    You are right that the pcq solution is too much work if you
    do not need 1 or 2.
    Thanks for bringing this to my attention.

The real reason I am going with pcq is not just buffering. My app will use either cpu or gpu processing threads, and with a cpu machine with many cores, I want multiple workers reading and writing from the common pool of input and output buffers. This is working fine now for the cpu case, so I can make use of 8 core cpu’s etc, but when I get a multiple gpu box, I want to do the same thing. If a single worker thread calls its own specific gpu device, is there anything I should know to make this work in the gpu case? I cant see a role for streams here. The gpu is tied to a particular cpu thread that does the kernel launches for its associated gpu device.
In my case, the gpu kernel can do a buffer in 20 seconds. (1000 kernel blocks in use in a kernel launch). and the output side cpu processing takes 3 seconds, so overlap is gravy, and I can hide the cpu output side and input sides completely but the real benefit of this pcq archetecture only is clear when the pcq has multiple worker threads, so in the same 20 kernel seconds, I can do multiples of that work for as many gpu’s as I have. (It takes the cpu 250 seconds to do what the gpu can do in 20).

The fist version was lame. It did not overlap write and kernel processing.
This one is correct I think.

Version 2

1 allocate B[0].in,B[0].out
2 allocate B[1].in,B[1].out
3 i=0
4 eof=0
5 for ( i=0; eof==0;i++)
6 {
7 eof=Read B[ (i%2) ].in
8 if ( i > 0 )
9 ___ wait for kernel to finish B[ (i-1)%2]
10 if ( eof == 0 )
11 ___ Launch async gpu Kernel on B[ (i%2) ]
12 if ( i > 0 )
13 ___ Write B[ (i-1)%2 ].out
14 }

Lets test this assuming exist three records,
1 line 5 i=0
2 line 7 read B[0] RECORD1
3 line 11 Kernel processes B[0] RECORD1
5 line 5 i=1
6 line 7 Read B[1] RECORD2
7 line 9 wait for Kernel to finish B[0] RECORD1
8 line 11 Kernel processes B[1] RECORD2
9 line 13 Write B[0].out RECORD1
11 line 5 i=2
12 line 7 Read B[0] RECORD3
13 line 8 Wait for Kernel to finish B[1] RECORD2
14 line 11 Kernel processes B[0] RECORD3
15 line 13 Write B[1].out RECORD2
17 Line 5 i=3
18 Line 7 Read attempt but get eof=1
19 Line 9 Wait for Kernel to finish B[0] RECORD3
20 Lin 13 Write B[0].out RECORD3