Assume 1 gpu device. I am using Fermi with C2050. windows 7 64 bit. cpu with 8 cores.
The two options I have are
prepare windows threads, one owns the gpu, and one writes GPU output to disk and one thread to read input from disk into zero copy mem.
- using windows thread api- mutex etc, overlap read and write and gpu kernel launch threads.
GPU sees only zero copy host memory. There is no use of streams at all. The GPU waits
for a windows mutex before it proceeds. There are several input and output buffers
so their cpu side processing can proceed while one buffer set is tied up with the gpu.
Using gpu streams, I can overlap gpu computation with copys from gpu memory to page locked host
memory, but I still need cpu threads to read and write the page locked host memory to and from
Only if the gpu runs lots faster with gpu memory does the second streams options make sense to me.
I understand that using a zerocopy gpu memory pointer to the pinned host memory might be problematic
if the gpu does multiple reads
/writes but they are not. the GPU Reads once and Writes once- both input and output buffers.
Is there any advantage to using streams (option 2) in this case, over option 1? Will cuda (nvcc) play ball nice with
standard win-32 thread api’s ?