Streaming Data to the GPU

We are experimenting with streaming data to the GPU for signal processing. We are having some speed issues which we presume must have something to do with the memory copy latency. We are trying to stream data continuously, at as high rates as we possibly can (the data is coming in on a 10gigabit NIC so the theoretical max would be 1.25 GB/s). We know we can’t achieve speeds of that nature but we are only getting at best about 120 MB/s from the point the data enters the NIC to the point it comes back onto the host from the GPU (We send the data directly to the GPU, perform an FFT, and copy it off). We have tried streams but we found that our kernels weren’t robust enough to hide the latency of the memory copies to see any speedup there. We also don’t know how to kick off the FFT asynchronously so if anyone knows how to do that maybe streams could be an option. We are seeing some speedup with pinned memory, but the results are a bit mixed, sometimes paged memory turns out to be faster and we can’t figure out why. Any thoughts would be greatly appreciated if anyone is having some luck doing something similar… THANKS!!

We are experimenting with streaming data to the GPU for signal processing. We are having some speed issues which we presume must have something to do with the memory copy latency. We are trying to stream data continuously, at as high rates as we possibly can (the data is coming in on a 10gigabit NIC so the theoretical max would be 1.25 GB/s). We know we can’t achieve speeds of that nature but we are only getting at best about 120 MB/s from the point the data enters the NIC to the point it comes back onto the host from the GPU (We send the data directly to the GPU, perform an FFT, and copy it off). We have tried streams but we found that our kernels weren’t robust enough to hide the latency of the memory copies to see any speedup there. We also don’t know how to kick off the FFT asynchronously so if anyone knows how to do that maybe streams could be an option. We are seeing some speedup with pinned memory, but the results are a bit mixed, sometimes paged memory turns out to be faster and we can’t figure out why. Any thoughts would be greatly appreciated if anyone is having some luck doing something similar… THANKS!!

If you have your own FFT kernels and they do not use all of the resources of the GPU, then you can try using both streams and concurrent kernels together (async copy and async concurrent kernels). This should give you a large speed-up in streaming. Either to just to try it to see what is possible in terms of performance or to use if you like it enough, the Kappa framework (which I wrote and sell) lets you easily stream data (in parallel on different streams where the streams overlap) to the GPU with concurrent (asynchronous, overlapping) kernels launches (psilambda.com is the URL). Check the Quick Start Guide on the web site for you platform–there should be enough there to get you going–if not–feel free to contact me.

If you have your own FFT kernels and they do not use all of the resources of the GPU, then you can try using both streams and concurrent kernels together (async copy and async concurrent kernels). This should give you a large speed-up in streaming. Either to just to try it to see what is possible in terms of performance or to use if you like it enough, the Kappa framework (which I wrote and sell) lets you easily stream data (in parallel on different streams where the streams overlap) to the GPU with concurrent (asynchronous, overlapping) kernels launches (psilambda.com is the URL). Check the Quick Start Guide on the web site for you platform–there should be enough there to get you going–if not–feel free to contact me.

Another option: avoid memcpy. Read/write the host memory directly via mapped memory.
As long as the kernel has enough calculation to hide the latencies, it should be almost “for free”.

You just need two sets of buffers(one is filled by the NIC, while the other is processed by the GPU).

Another option: avoid memcpy. Read/write the host memory directly via mapped memory.
As long as the kernel has enough calculation to hide the latencies, it should be almost “for free”.

You just need two sets of buffers(one is filled by the NIC, while the other is processed by the GPU).