Some of the systems use the Mobile PCI Express Module (MXM) form factor which is typically used in laptops. You typically don’t find the latest and greatest architectures and cards available in this form factor, which is a concern. Other systems have the GPU chip(s) mounted directly on the processing card, precluding the possibility of replacing the GPUs with newer versions as they are released.
All of the systems rely on a host microprocessor to perform data transfer to and from the GPU memory and to control the GPU, which is how a PC-based system also works. This gets done over the PCIe bus by a DMA engine which is internal to the GPU chip. This presents bottlenecks and can be a problem for high-performance real-time embedded systems.
My question is really twofold:
Does anyone know of any other Embedded Computing Solutions for GPGPU and in particular solutions which provide expansion slots for normal PCIe form factor cards?
Are there any Embedded solutions or developments out there that provide for something along the lines of dual-ported GPU memory where a device like an FPGA can be used to transfer data to/from the GPU memory directly, instead of doing it via the host memory over the PCIe bus by the host processor?
You might like to look at Imagination technologies PowerVR GPU cores; these support OpenCL. Texas Instruments make a few chips (OMAP4) which combine the PowerVR with an ARM CPU.
Thanks david_jones, I’ll have a look at chips as well.
There are a lot of other architectures that are commonly used as well, but its not a really limitation no. All the embedded GPU Computing solutions that I’ve seen so far use Single Board Computers with Intel processors with some or other linux distro. With the second part of my question I was more thinking along the line of performance limitations imposed by the way in which you transfer data to and from the GPU memory fundamentally. It seems like you could do better if you had direct access to the write to and from the GDDR with something like an FPGA, although it would be a complicated solution.
I’m not sure either if there is support for Windows CE actually. I’m really looking at it more from a high-speed digital signal processing for embedded military applications angle, rather than mobile handheld devices like PDAs etc.
The GDDR controller is integrated within the GPU die. The only other I/O pins available on the GPU are the PCIe link and the SLI link.
So even if you could get a custom driver and custom hardware, you cannot get much more bandwidth than you will get from a PC architecture (SLI is fairly low-bandwidth, I think).
Important things to care about are:
that the GPU is really plugged to a x16 PCIe 2.0 interconnect, which can deliver 8GB/s all the way (the first product you linked shares a tiny x4 PCIe 1.0 link between 2 GPUs!),
that you can DMA directly from/to other PCIe clients like 10Gbit Ethernet NIC without going through system memory. Requires a custom driver, but there is at least one company who claims to support this…
I understand your point. In general it would also need to be considerably faster to make the complex hardware and driver development effort worthwhile. I’m still wondering though if there isn’t some way to have dual-ported GDDR memory that is accessible from the GPU and on another interface. You’ll need some synchronsation of course since up to two controllers can read/write to and from the memory simultaneously. It might mean that the actual GPU chip architecture needs to change before this is viable, so probably not worth persuing currenty in my opinion.
The speed of the bus is critically important yes. This solution you are referring to doesn’t provide a lot of bandwidth, but its limited by number of high-speed serial lanes that the PCIe bus runs over on the backplane. There are only 8 lanes per slot in the rack. You could also only use one of the VXS-GSC5200 carrier cards along with the host card - then you can have 8xPCIe, shared by 2xGPUs. Overall the bandwidth seems quite low compared to what you get though. Also, I understand PCIe is full duplex, so for 4xPCIe links you could get 1GB/s transfer in each direction simultaneously?
I’ve been wondering how easy it would be to do this. The DMA controller used by the GPU is also integrated into the chip with the latest architectures as far as I know. I would imagine that you’d need to write the driver to still use the proper commands to set up that DMA controller to DMA from a device address that you specify manually instead of just an address in host memory. I found some notes and a forum post about transferring frames from a framegrabber directly into GPU memory without passing through host memory:
I see what you mean. From a technical standpoint, it might be doable. But economically, I do not think so.
GDDR5 is high-volume cutting-edge technology, not really something from which you can get custom derivatives.
Usually, when designers need to share memories, they just use faster commodity DRAMs (or put more chips in parallel), and share the memory controller.
That is correct. Also, all PCIe links in this product are 1.0 generation, which are twice slower than PCIe 2.0 links used by current GPUs.
The IPN250 from Systerra looks much better in this regard. All their links are x16 PCIe 2.0.
(Save from the fact that I am slightly worried from seeing “DDR2” linked to their GT 240 GPUs. Hope it is a typo…)
Sorry, I got confused. I was thinking about [topic=“170188”]this[/topic]. But I see that it still needs data to go through system memory. The improvement is just that is saves one intermediate memcopy.
Out of curiosity, what kind of embedded applications need to push more than 8GB/s of raw data? Can it be easily split into multiple streams, each handled by a separate GPU?
I see yes, its called GPUDirect which basically just allows the Infiniband adapter and the GPU to share a pinned memory buffer in the host memory, but both the CPU and host memory are still involved.
The system has 3 analog input channels which get sampled to form 3 channels of data at about 5Gbps (625 MBps) each for a total of 15Gbps (1.875 GBps). The data then gets processed in real-time. The channels can be processed largely in parallel since the results of the 3 channels are only combined near the end of the processing chain. The total data throughput is therefore less than 8GB/s, although we need to get data into and out of the system still. I understand that you need to really carefully write your code in order to overlap data transfer and processing to the extent where you’d be able to get the full 8GB/s without any dead time. For that reason I’d just want it to be as fast as possible then in case there are additonal delays or things that slow it down.
The motivation behind using an FPGA than can access the GDDR directly was:
You aren’t limited by the PCIe bandwidth, which is still a lot lower than the GDDR bandwidth (even though 8GB/s isn’t too low already).
You can bypass the step of feeding data into and out of host memory entirely.
Seems like you could make data transfer mechanism more deterministic, which is important for real-time systems, since you have fine-grained control of it.
You should be able to that yes and PCIe supports it. I think the question I really want to ask is, is it as easy as just using the existing device driver for the 2nd device along with the NVidia CUDA device driver for the GPU, or do you need to write your own driver? I guess it will depend on support in the drivers themselves. I’m seeing hints in a lot of places that its possible, but I don’t see any good examples or clear indications of whether its possible without rewriting drivers.
I saw a comment from Greg Pfister in post via the link Sylvain posted as well:
which basically states that PCIe peer-to-peer transfers to the GPU has been around for a while, but not necessarily used. He also indicated there that you could transfer data from an SSD directly to a Tesla card for example. Could find much other evidence of it after a quick search.
As much as I understand, the device-drivers need to co-operate… You may need to have a fused-device driver (?) to abstract it out from the OS - which may not like all these…
I often fantasy a custom linux kernel with a co-operating network card and NV device driver for memory transfers…