I need to process a very high mass of data in real time.
At first i will acquire the data through a DAQ Card that have four channels working at 20MSamples each.
The card can pass the data through DMA utilizing almost the entire PCI Bus Bandwith.
I just discovered CUDA and before I start to make any tests, I just wanted to know if anyone already tried to utilize CUDA to make real time processing
utilizing data comming from a high speed data aquisition card.
Does anyone even think that my idea is possible to be accomplished?
What do you mean by real-time? Do you mean that you can process the input without being overloaded, or that you can do that and also process samples in/out without latency? If your scenario can tollerate buffering up data and processing it in large blocks (perhaps 1k-1M samples at a time) I would imagine itās worth investigating the Cuda option, itās a fast processing system but as far as I am aware, the memory transfers can slow it all down if you are not able to process stuff in large-ish blocks.
I would also be greatly interested in this: performing GPU FFTs on data gathered by another PCI express card, an FPGA analog-to-digital converter. The data transfer to the GPU would need to be done at high speed (but the FFTs themselves can be done in large blocks as long as the data to transform is already in GPU memory). I was toying with the idea of programming the FPGA to use some kind of SLI-like protocol, having it āpretendā to be another video card. I am not familiar with the particulars though, so donāt know if this would work.
Generally, does anyone know of any way to transfer data quickly over the pci-express bus without the CPU and main memory involved?
We could really make use of this kind of thing at work, so when youāve got some thoughts (probably tmurray or anybody else that tries) Iād be really interested in your feasibility study output :) If it only turns out to be practical using a full Tesla device it would still be useful for this kind of application where cost is often less of an issue.
One thing that would help us are specific devices you guys would want to use this with. If you can provide us with a list of specific cards youāre interested in, that would help us a lot.
I have an application very similar to nervestapleās.
I would like to transfer a block of information from this a/d card to the GPU without using the CPU or the main memory.
Then a can process the data, my only constrain is that a need to process the data at the same time the next data block is being acquired.
Iām working in a visualization application. Itās a kind of oscilloscope.
In our case, itās a home-grown pci-express ADC card based around a Lattice ECP2M FPGA. We need extremely fast samples (and relatively low cost) which is why we werenāt able to use any of the usual commercial options - and why NVIDIA GPUs are attractive to us.
Ditto, sorta, the thing I have in mind would be to eventually replace or compliment a bespoke hardware solution, so if we are guided down a route with specific capture cards nvidia recommend then it may not be an issue at this stage. These are early research ideas, do it before the competition does, etc.
Doing direct DMA may seem elegant, but Iām not sure if itās really worthwhile. You can copy into memory and then back to the GPU at very high bandwidth (much higher bandwidth than almost any input card can sustain), and do it in parallel. It does a bit of latency, yes, but not much compared to the latency a CUDA solution normally entails.
Doing direct card-to-card is impeded by concerns for security. But it shouldnāt be too hard for the runtime to spit out the bus address of a cudaMallocād buffer. Everything else would have to be done by the capture card itself. (NVIDIA, donāt worry about opening up programmability of the GPUās DMA engine or anything else like that.)
What kind of cards? Which vendors have you contacted?
I would think there arenāt all that many use cases for which going through system memory is a major performance obstacle. Off the top of my head, I am only aware of Infiniband adapters and video frame grabbers with support for NVIDIA GPUDirect.
If it only works with Tesla or Quadro cards, there is absolutely no sense in building a āhome-grown pci-express ADC card based around a Lattice ECP2M FPGAā or the like, that nervestaple and Hill_Matthew are talking about.
A bigger problem with a āhome-grown pci-express ADC card based around a Lattice ECP2M FPGAā might be that you will have to provide a GPUDirect enabled Linux driver for it.
There are certainly Quadro cards under $1000, but whether any of them support GPUDirect I do not know. I would suggest you inquire with NVIDIA. FWIW, I think it is unrealistic to expect cheap consumer products to provide all the benefits of professional solutions, in particular when the consumer products already come with support for tons of goodies included.
GPUDirectās main benefit is in eliminating a system memory to system memory copy, which reduces latency and power consumption. With the advent of multi-channel DDR4 memory subsystems, I would expect both advantages to have diminished somewhat. I would suggest measuring current end-to-end latencies for your use case. From that determine how much of an issue that is, before deciding that you definitely need GPUDirect.