Cuda + DMA + DAQ

I need to process a very high mass of data in real time.
At first i will acquire the data through a DAQ Card that have four channels working at 20MSamples each.
The card can pass the data through DMA utilizing almost the entire PCI Bus Bandwith.

I just discovered CUDA and before I start to make any tests, I just wanted to know if anyone already tried to utilize CUDA to make real time processing
utilizing data comming from a high speed data aquisition card.

Does anyone even think that my idea is possible to be accomplished?

1 Like

What do you mean by real-time? Do you mean that you can process the input without being overloaded, or that you can do that and also process samples in/out without latency? If your scenario can tollerate buffering up data and processing it in large blocks (perhaps 1k-1M samples at a time) I would imagine it’s worth investigating the Cuda option, it’s a fast processing system but as far as I am aware, the memory transfers can slow it all down if you are not able to process stuff in large-ish blocks.

I would also be greatly interested in this: performing GPU FFTs on data gathered by another PCI express card, an FPGA analog-to-digital converter. The data transfer to the GPU would need to be done at high speed (but the FFTs themselves can be done in large blocks as long as the data to transform is already in GPU memory). I was toying with the idea of programming the FPGA to use some kind of SLI-like protocol, having it ‘pretend’ to be another video card. I am not familiar with the particulars though, so don’t know if this would work.

Generally, does anyone know of any way to transfer data quickly over the pci-express bus without the CPU and main memory involved?

We’re investigating how feasible this is.

We could really make use of this kind of thing at work, so when you’ve got some thoughts (probably tmurray or anybody else that tries) I’d be really interested in your feasibility study output :) If it only turns out to be practical using a full Tesla device it would still be useful for this kind of application where cost is often less of an issue.

One thing that would help us are specific devices you guys would want to use this with. If you can provide us with a list of specific cards you’re interested in, that would help us a lot.

For example i would like to use this family of cards:

http://www.spectrum-instrumentation.com/m2i2020-exp.html
http://sine.ni.com/nips/cds/view/p/lang/en/nid/203361
http://www.adlinktech.com/PD/web/PD_detail…mp;id=&sid=

I have an application very similar to nervestaple’s.
I would like to transfer a block of information from this a/d card to the GPU without using the CPU or the main memory.
Then a can process the data, my only constrain is that a need to process the data at the same time the next data block is being acquired.

I’m working in a visualization application. It’s a kind of oscilloscope.

In our case, it’s a home-grown pci-express ADC card based around a Lattice ECP2M FPGA. We need extremely fast samples (and relatively low cost) which is why we weren’t able to use any of the usual commercial options - and why NVIDIA GPUs are attractive to us.

Ditto, sorta, the thing I have in mind would be to eventually replace or compliment a bespoke hardware solution, so if we are guided down a route with specific capture cards nvidia recommend then it may not be an issue at this stage. These are early research ideas, do it before the competition does, etc.

Doing direct DMA may seem elegant, but I’m not sure if it’s really worthwhile. You can copy into memory and then back to the GPU at very high bandwidth (much higher bandwidth than almost any input card can sustain), and do it in parallel. It does a bit of latency, yes, but not much compared to the latency a CUDA solution normally entails.

Doing direct card-to-card is impeded by concerns for security. But it shouldn’t be too hard for the runtime to spit out the bus address of a cudaMalloc’d buffer. Everything else would have to be done by the capture card itself. (NVIDIA, don’t worry about opening up programmability of the GPU’s DMA engine or anything else like that.)

Any news on this issue? I’m interested in the possibility of data transfer between a capture card and GPU directly.

Have you had a chance to check out GPUDirect? https://developer.nvidia.com/gpudirect

Hi,

Is there any chance that GPUDirect might be available for affordable cards?
Like under $1000 USD?

What kind of cards? Which vendors have you contacted?

I would think there aren’t all that many use cases for which going through system memory is a major performance obstacle. Off the top of my head, I am only aware of Infiniband adapters and video frame grabbers with support for NVIDIA GPUDirect.

With “affordable cards” I meant GPU cards.

If it only works with Tesla or Quadro cards, there is absolutely no sense in building a “home-grown pci-express ADC card based around a Lattice ECP2M FPGA” or the like, that nervestaple and Hill_Matthew are talking about.

A bigger problem with a “home-grown pci-express ADC card based around a Lattice ECP2M FPGA” might be that you will have to provide a GPUDirect enabled Linux driver for it.

There are certainly Quadro cards under $1000, but whether any of them support GPUDirect I do not know. I would suggest you inquire with NVIDIA. FWIW, I think it is unrealistic to expect cheap consumer products to provide all the benefits of professional solutions, in particular when the consumer products already come with support for tons of goodies included.

GPUDirect’s main benefit is in eliminating a system memory to system memory copy, which reduces latency and power consumption. With the advent of multi-channel DDR4 memory subsystems, I would expect both advantages to have diminished somewhat. I would suggest measuring current end-to-end latencies for your use case. From that determine how much of an issue that is, before deciding that you definitely need GPUDirect.