Questions on GPUs for software-defined radios

I’m interested in GPU computing for the signal processing in software-defined radios, which has traditionally relied on FPGAs. The idea would be to have the RF hardware and analog/digital converters connected to the compute chassis housing the CPU and GPU via 10 GigE. Each received burst of data might have a few MB of samples that need to be demodulated, deinterleaved, error corrected, and decrypted. Then reverse that process for transmission, which is much less computationally intensive.

Copying data between CPU and GPU memory is a performance killer that I’m looking to minimize.

If I’m following Nvidia’s info correctly, GPUDirect RDMA would let me move data directly between the Ethernet card and GPU memory without CPU intervention. Defintely a good thing. Is this a doable thing for RF samples?

I’m less clear on NVLink, Pascal, and unified memory. NVLink looks like a high-speed replacement for PCIe when connecting Pascal GPUs and compatible CPUs. What are the compatible CPUs, and when will this hardware be available? Is the transfer DMA, with an interrupt to signal completion?

Unified memory looks like a programming model to treat CPU and GPU memory as a contiguous address space. Best I can tell, in coding C++, one could allocate a unique_ptr to a buffer, fill it with CPU data (e.g., data to transmit), and then std::move it to GPU processing. The compiler would automagically handle the copying. Is this correct? Would this code take advantage of NVLink when that ships?

Lastly, is there a roadmap to physically unifying memory, such as connecting the CPU and GPU to the same memory via different controllers?

This must be an interesting kind of software defined radio, because I am surprised that PCIe bandwidth is a major concern. PCIe gen3 is a full-duplex interconnect that allows simultaneous upload/download at a rate of 12 GB/sec, as long as you use a GPU with dual DMA engines, such as a Tesla GPU. My experience is that block transfers of 16 MB or larger will essentially get you the maximum possible PCIe transmission throughput, so this seems to jibe with your use case.

Since the PCIe transmissions can be overlapped with kernel executions on the GPU itself, in the best case you could build a nice pipeline that processes a continuous stream of data at close to 12 GB/sec, or about 100 Gb/sec. Not knowing the details of your use case, I would suspect that latency is a major concern, and you would want to research this in detail if that is indeed so. GPUs are designed as throughput engines, meeting tight latency requirements can be hard.

GPUdirect RDMA is outside my area of expertise, but I am fairly certain that one requirement is that your data source (here: networking card) offers an appropriate driver for that. To my knowledge there aren’t that many devices yet that offer such a driver. I think Mellanox offers some such products, but am not sure.

Historical experience shows that NVIDIA is tight-lipped about technical details of future hardware (here: Pascal and beyond), outside of rough performance parameters spelled out in marketing slides. If your employing entity is a big customer, or has a research relationship with NVIDIA, there is possibly a dedicated contact and if so, you might want to check whether you can get advance information through that channel.

[Later:] This paper ([url]http://www.ece.rice.edu/~gw2/pdf/asilomar2014_gpu_basestation.pdf[/url]) describing a GPU-based software-defined base station mentions end-to-end latency of 3 milliseconds but throughput of only 50 Mbps, while apparently using multiple GPUs. RDMA does not seem to be used. I have not read the paper in detail, but it seems application-level performance may be limited by FFT performance, which in turn is mostly a question of GPU memory bandwidth.

The latency concern comes from channel access at the MAC layer. Something like CSMA/CA in 802.11 has very tight latency requirements while a time-slotted (TDMA) approach is more tolerant of latency. It’s a matter of scoping out the areas where GPUs would be useful.

I read that same paper a few weeks ago, and the 3 ms latency got me re-interested in looking at GPUs for SDR signal processing. I got the impression that copying data between CPU and GPU memory was a significant bottleneck based on what I now see is a mis-reading of one section. That transfer is actually a minor factor.

Thanks for pointing out those facts on different flavors of PCIe and DMA. That’s something I need to dig into more before refining my rough calculations. Will ask around here on Nvidia reps…

Yes, Mellanox was the vendor I had in mind for pairing with GPUdirect RDMA.