Computing design question Cuda, streaming

Dear all,

I have a general question about how to design my application. I have read the Cuda document, but still don’t know what I should look into. Really appreciate it if someone could shed a light on it.

I want to do some real time analytics about stocks, say 100 stocks. And I have real time market data feed which will stream with updated market price. What I want to do are

  1. pre-allocate memory black for each stock on the cuda card, and keep the memory during the day time.
  2. when new data coming in, directly update the corresponding memory on Cuda card.
  3. After updating, it issue signal or trigger event to start analytical calculation.
  4. When calculation is done, write the result back to CPU memory.

Here are my questions:

  1. what’s the most efficient way to stream data from CPU memory to GPU memory? Because I want it in real time, so copying memory snapshot from CPU to GPU every second is not acceptable.
  2. I may need to allocate memory block for 100 stocks both on CPU and GPU. How to mapping the CPU memory cell to each GPU memory cell?
  3. How to trigger the analytics calculation when the new data arrive on Cuda card?

I am using a Tesla C1060 with Cuda 3.2 on Windows XP.

Thank you very much for any suggestion.

I would suggest to have a running Kernel on your GPU that exchange data in real-time with your CPU using Pinned Mapped Memory, creating two independent circular queues, 1 for writing stock data and requests, 1 for writing results from the GPU (no lock involved in each case with a write and read pointer each maintained by CPU or GPU, not both).


Host write request and/or stock data on input queue

GPU Kernel read input queue continuously until there’s data available (typically write_ptr <> read_ptr)

GPU process the data internally

GPU write the result in the output queue

During GPU computing, CPU could add next data in the INPUT Queue and get data from the output queue as soon as they are available

Parallelis, thank you for your reply.

Do you know any example code about these input/output queue design?

It’s pretty basic (except for the pinned mapped memory part naturally), here’s an example in pseudo-C, given whatever is a struct you wanna read or write, preferably a single 64bits or 128bit write (ie using vector)

whatever * queue;

int queue_length=n, queue_read=0, queue_write=0;

To read (element):

if( queue_read != queue_write ) {

  whatever element = queue[queue_read];

  queue_read = (queue_read+1) % queue_length;

} else {

  // Queue is empty!!!


To write (element):

if( queue_read != (queue_write+1) % queue_length ) {

  queue[queue_write] = element;

  queue_write = (queue_write+1) % queue_length;

} else {

  // Queue is full we will have to wait!