Circumventing the PCI-E BUS-- Finance Application in High Frequency Trading

In a seminar I attended last week, there was a short discussion on why GPUs could not be used for real time analytical apps in HFT. I was under the impression that there were commercial apps in offline risk management and pricing applications. As many know, the big limitation for now is the latency of the CPU<–>GPU link. Is there in fact existing software/hardware workarounds for this problem on the drawing board even if years away? Moreover, is anyone aware of existing apps they could reference? Not knowing much about FPGAs, why have they been incorporated into HF work with similar barriers (or so I have heard) and supposedly GPUs can not?


“Is there in fact existing software/hardware workarounds for this problem on the drawing board even if years away?”

Yes, I’ve heard talk of such work/research being done. Being able to put low latency jobs on the GPU is of tremendous interest for many industries, in the future I’m guessing the GPU:s can move more and more into the realm of the FPGA:s ( well they already are in some areas).

This is of course very loose speculation.

What kind of latency requirements do you have in HFT? Why would this not already be fulfilled using a PCI-Express bus?

Launching a CUDA kernel happens in a few us of time. Getting data onto the GPU can be very quick as well, but it depends a lot on how much data you need to implement your business logic.


I was speaking more generically re: the hardware issue and don’t do HFT myself which requires streaming time series data and utilizing decision support systems dealing in milli-, micro-, and pico-seconds someday. Perhaps others more knowledgeable will contribute (or not given competitive issues). V.

I don’t know anything about the HFT field (and I’m surprised that the latency of a PCI-Express bus is at all comparable to the time scales of market pricing), but I do know that one company is/was selling an FPGA that plugs into a CPU socket on a dual socket Opteron motherboard and communicates with the CPU over the Hypertransport link. That’s about as high bandwidth and low latency as you can get without putting the GPU on the same die as the CPU (see AMD in 2011).

I have not heard of anyone even thinking of putting a CUDA device into a CPU socket, but that’s about the only option left. (Or hope that PCI-Express 3.0 is lower latency than 2.0.) One disadvantage to using a CPU socket for a CUDA device is the loss of the wide, dedicated memory bus. CUDA devices generally have 10-20x the memory bandwidth of the CPU, but that advantage would be gone if put into a standard CPU socket.

Here is an example of a socket 940 FPGA coprocessor:…6&Itemid=60

Sorry if this is off-topic…

I know you’re not the one making this claim, but I’m pretty sure this isn’t really a hardware issue as much as an issue with architecting a solution that gets away from this traditional approach:


copy data to device

launch kernel

wait for kernel completion

copy results from device

Now that CUDA supports page-locked/write-combining/mapped/overlapped memory and 2.0 supports concurrent data transfers you can quickly imagine some architectures and long-running kernels that might help you avoid the approach listed above and drive compute latencies downward.

Having worked with both high-speed market data feeds and the entire trading workflow, I can think of GPU applications ranging from mundane processing tasks all the way up to HFT. I’m also sure that there are plenty of people who have already done this and aren’t talking about it.

A good PDF here describing PCIe latencies:…50226-Gupta.pdf

Note that 1 microsecond at the speed of light is ~1000 feet. Ethernet will be slower than c, switches add latency, network stacks add latency, etc. etc. There are thousands of discussions on latency out there.

Rackable SGI were making noises about a year ago that their new ‘UltraViolet’ systems would enable accelerators to be first class citizens on the NUMA backbone. I’m not sure how that wound up, though.

WSJ: Trading Firms Turn To Videogame Chips To Get Even Faster : link

It’d be an interesting idea to put a DisplayPort 1.2 connection on Teslas and/or Quadros… since DP 1.2 supports Ethernet! Thus potentially giving a direct data link straight into the GPU.

I doubt we’ll see such Ethernet support though it’s fun to think about. I bet that it’d be a real challenge to implement effectively, especially trying to expose the hardware via a driver/firmware and then to CUDA. But an interesting idea nonetheless since it’d be an end-run around CPU/PCIE connection latencies and bandwidth (and instead be limited by ethernet speed, which is quite slow and laggy by itself but it’s still a direct output channel with no middlemen.)

Cram a 64-bit x86 Atom-like supervisory core onto that GPU (return of the Cell processor!), and now you have a totally autonomous device with crazy huge memory bandwidth and an ethernet jack. :)

I think most HFT use IMDG middleware like gemfire, gigaspaces to reduce the database lookups and get pub-sub services among others. I dont see why we can’t integrate GPUs onto such applications… It should work… much much faster.

I’d like to see the GeForce/Quadro/Tesla drivers eventually support direct transfers from PCIe endpoint-to-endpoint.

There have been many threads about this in the forums and I think the only “pure” example of such a transfer is NVIDIA’s Quadro SDI Capture card (specs here).

Announcements by Mellanox imply that their Infiniband driver will minimize use of the CPU by using pinned memory. Not quite a direct endpoint-to-endpoint transfer but it’s probably the best you can do right now.

I think there was a statement elsewhere in these forums that said the hardware can do it but the driver work is not currently a priority – fair enough. Pinned memory techniques are probably fine for now.

Hi, allanmac,

I want to test endpoint-to-endpoint on two Jetson Xavier boards. But I don’t know how to configure on software. Would you can give me some information?