Circumventing the PCI-E BUS-- Finance Application in High Frequency Trading

vfulco · April 21, 2010, 12:07pm

In a seminar I attended last week, there was a short discussion on why GPUs could not be used for real time analytical apps in HFT. I was under the impression that there were commercial apps in offline risk management and pricing applications. As many know, the big limitation for now is the latency of the CPU<–>GPU link. Is there in fact existing software/hardware workarounds for this problem on the drawing board even if years away? Moreover, is anyone aware of existing apps they could reference? Not knowing much about FPGAs, why have they been incorporated into HF work with similar barriers (or so I have heard) and supposedly GPUs can not?

TIA, V.

Jimmy_Pettersson · April 21, 2010, 12:35pm

“Is there in fact existing software/hardware workarounds for this problem on the drawing board even if years away?”

Yes, I’ve heard talk of such work/research being done. Being able to put low latency jobs on the GPU is of tremendous interest for many industries, in the future I’m guessing the GPU:s can move more and more into the realm of the FPGA:s ( well they already are in some areas).

This is of course very loose speculation.

cbuchner1 · April 21, 2010, 1:53pm

What kind of latency requirements do you have in HFT? Why would this not already be fulfilled using a PCI-Express bus?

Launching a CUDA kernel happens in a few us of time. Getting data onto the GPU can be very quick as well, but it depends a lot on how much data you need to implement your business logic.

Christian

vfulco · April 21, 2010, 2:23pm

I was speaking more generically re: the hardware issue and don’t do HFT myself which requires streaming time series data and utilizing decision support systems dealing in milli-, micro-, and pico-seconds someday. Perhaps others more knowledgeable will contribute (or not given competitive issues). V.

seibert · April 21, 2010, 2:27pm

I don’t know anything about the HFT field (and I’m surprised that the latency of a PCI-Express bus is at all comparable to the time scales of market pricing), but I do know that one company is/was selling an FPGA that plugs into a CPU socket on a dual socket Opteron motherboard and communicates with the CPU over the Hypertransport link. That’s about as high bandwidth and low latency as you can get without putting the GPU on the same die as the CPU (see AMD in 2011).

I have not heard of anyone even thinking of putting a CUDA device into a CPU socket, but that’s about the only option left. (Or hope that PCI-Express 3.0 is lower latency than 2.0.) One disadvantage to using a CPU socket for a CUDA device is the loss of the wide, dedicated memory bus. CUDA devices generally have 10-20x the memory bandwidth of the CPU, but that advantage would be gone if put into a standard CPU socket.

seibert · April 21, 2010, 3:13pm

Here is an example of a socket 940 FPGA coprocessor:

[url=“http://old.xtremedatainc.com/index.php?option=com_content&view=article&id=106&Itemid=60”]http://old.xtremedatainc.com/index.php?opt...6&Itemid=60[/url]

allanmac · April 21, 2010, 3:21pm

Sorry if this is off-topic…

I know you’re not the one making this claim, but I’m pretty sure this isn’t really a hardware issue as much as an issue with architecting a solution that gets away from this traditional approach:

[list=1]

[*]copy data to device

[*]launch kernel

[*]wait for kernel completion

[*]copy results from device

Now that CUDA supports page-locked/write-combining/mapped/overlapped memory and 2.0 supports concurrent data transfers you can quickly imagine some architectures and long-running kernels that might help you avoid the approach listed above and drive compute latencies downward.

Having worked with both high-speed market data feeds and the entire trading workflow, I can think of GPU applications ranging from mundane processing tasks all the way up to HFT. I’m also sure that there are plenty of people who have already done this and aren’t talking about it.

A good PDF here describing PCIe latencies: http://www.dell.com/downloads/global/power…50226-Gupta.pdf

Note that 1 microsecond at the speed of light is ~1000 feet. Ethernet will be slower than c, switches add latency, network stacks add latency, etc. etc. There are thousands of discussions on latency out there.

YDD · April 21, 2010, 3:27pm

Rackable SGI were making noises about a year ago that their new ‘UltraViolet’ systems would enable accelerators to be first class citizens on the NUMA backbone. I’m not sure how that wound up, though.

allanmac · April 28, 2010, 1:06am

WSJ: Trading Firms Turn To Videogame Chips To Get Even Faster : link

SPWorley · April 28, 2010, 2:36am

It’d be an interesting idea to put a DisplayPort 1.2 connection on Teslas and/or Quadros… since DP 1.2 supports Ethernet! Thus potentially giving a direct data link straight into the GPU.

I doubt we’ll see such Ethernet support though it’s fun to think about. I bet that it’d be a real challenge to implement effectively, especially trying to expose the hardware via a driver/firmware and then to CUDA. But an interesting idea nonetheless since it’d be an end-run around CPU/PCIE connection latencies and bandwidth (and instead be limited by ethernet speed, which is quite slow and laggy by itself but it’s still a direct output channel with no middlemen.)

seibert · April 28, 2010, 3:10am

Cram a 64-bit x86 Atom-like supervisory core onto that GPU (return of the Cell processor!), and now you have a totally autonomous device with crazy huge memory bandwidth and an ethernet jack. :)

Sarnath · April 28, 2010, 5:08am

I think most HFT use IMDG middleware like gemfire, gigaspaces to reduce the database lookups and get pub-sub services among others. I dont see why we can’t integrate GPUs onto such applications… It should work… much much faster.

allanmac · April 28, 2010, 12:57pm

I’d like to see the GeForce/Quadro/Tesla drivers eventually support direct transfers from PCIe endpoint-to-endpoint.

There have been many threads about this in the forums and I think the only “pure” example of such a transfer is NVIDIA’s Quadro SDI Capture card (specs here).

Announcements by Mellanox imply that their Infiniband driver will minimize use of the CPU by using pinned memory. Not quite a direct endpoint-to-endpoint transfer but it’s probably the best you can do right now.

I think there was a statement elsewhere in these forums that said the hardware can do it but the driver work is not currently a priority – fair enough. Pinned memory techniques are probably fine for now.

william.fang · June 24, 2021, 1:05pm

Hi, allanmac,

I want to test endpoint-to-endpoint on two Jetson Xavier boards. But I don’t know how to configure on software. Would you can give me some information?

Polzovatel1518 · November 10, 2021, 1:15pm

deleted

Topic		Replies	Views
GPU Communication Protocol CUDA Programming and Performance	16	6261	May 17, 2010
NVidia GPUs in Embedded Computing Has the GPU computing and CUDA penetrated the embedded market? CUDA Programming and Performance	11	3863	August 3, 2010
kernel launch latency CUDA Programming and Performance	16	7777	August 6, 2018
P2P: How do I know if cudaMemcpy falls back to non-P2P? CUDA Programming and Performance	8	2370	October 12, 2021
PCIe Impact Give some examples of how PCIe impact your applications CUDA Programming and Performance	15	2230	October 17, 2010
PCI-e Device to Device Transfers CUDA Programming and Performance	4	7475	September 22, 2010
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204317	April 13, 2009
Multiple users running CUDA WinXP CUDA Programming and Performance	22	6944	June 10, 2008
Cuda + DMA + DAQ CUDA Programming and Performance	15	22973	October 27, 2017
Using more than 1 CUDA card at a time. Physics simulations flat out flying on GPU CUDA Programming and Performance	12	12541	March 12, 2010

Circumventing the PCI-E BUS-- Finance Application in High Frequency Trading

Related topics