CPU vs GPU compute accuracy + sync threads

I have been trying to develop an embarrassingly parrallel system that was first used against historical sensor data and now currently some financial data. The idea of putting on the GPU is I can test a lot of permutations / combinations at once (versus the 16 core CPU I have) i.e. use the same historical data and just call one small backtest instance per thread. The current CPP code I have currently fully works on the CPU (with an extensive test suite).

I also understand overfitting etc but have been puzzled as to why I can’t make this system work on the gpu. My idea was to just call the C++ code on device (I am not allocating data, just tracking current profit / loss as I iterate through all the time series data, no external libraries, so really should just be a simple drop in). The data I currently have:

  • 5 million rows of OHLC (float32 data)
  • A cpp class “MyTrader” which just simply iterate through the rows and tracks profit/loss based on signals. This is fully working on the cpu side of things. This class exists entirely on the stack, is around ~290 bytes in size and uses no new allocations etc. It does not nest super deep (maybe ~5 nested calls deep) so I don’t think I’m blowing up the stack.
  • It uses no local memory or fancy tools of CUDA. it probably isn’t the theoretically fastest implementation, but the idea is for each thread to compute a permutation for me.

To adapt the class, I created a helper function to help modify the normal CPP class to make it run on device, and added the below function to my class functions i.e.

#ifdef __CUDACC__
#define FUNC_PREFIX __device__
#else
#define FUNC_PREFIX
#endif

FUNC_PREFIX class TradeArrayMetrics
{
public:
  FUNC_PREFIX TradeArrayMetrics(const float dollars_per_point, float const fees_dollars_per_trade, float const points_slippage)
  {

    _multiplier = dollars_per_point;
    _fees_dollars_per_trade = fees_dollars_per_trade;
    _points_slippage = points_slippage;
  }

  // Pass r-value to prevent extra copy
  FUNC_PREFIX virtual void push_back(TRADE &&item)
  {
    // TODO: Can I immediately update that item?
    this->update_metrics(_current_trade); // Update metrics first if closed
    _current_trade = item;
    _counter++;
  }

Problem I can’t figure out: When I run the computation, there is one section of code I need to call a __syncthreads() or printf(...) when run on the GPU. I don’t understand why this is needed (as I am using no bits of shared memory, only using global memory). When I call syncthreads (or even a simple printf(“”) statement), the output is correct. When I remove that __syncthreads() OR the printf statement, then it does not work. I think this is similar to one of those Heisenberg bugs but I’m entirely confused in what I am missing and how to debug this issue? The only modification I am adding to the cpp function is the FUNC_PREFIX macro to make the code execute on the GPU.

The offending code section is:

  /* Here we actually walk the OHLC */
  FUNC_PREFIX void next()
  {
    if (this->_active_positions.size() > 0)
    {
      TRADE &cur_pos = this->_active_positions.back();
      if (cur_pos.trade_state == TRADE_STATE::DONE)
        return;

      cur_pos.update(this->_current_ohlc, this->_current_time);
      cuda_SYNCTHREADS(); // This or printf, otherwise calculation is wrong?
   }

Note: the .back() function is not a dynamically allocated structure (i.e. vector), but instead is just:

  FUNC_PREFIX TRADE &back()
  {
    return _current_trade;
  }

For debugging purposes, I am calling threads_per_block=1 and blocks_per_grid=1 (to not have multiple of them running), and verify I am only using the first in the kernel. Below is the Cuda Kernel I use to call the cpp function.

PyCuda coda used to upload to the GPU. Checked dtypes and everything of arrays and all the signatures appear to match.

    entry_limit_max_wait_nanoseconds_gpu = gpuarray.to_gpu(df_parameter_variations['entry_limit_max_wait_seconds'].to_numpy().astype(np.int64) * np.int64(1_000_000_000))
    start_time_gpu = gpuarray.to_gpu(df_parameter_variations['start_time'].to_numpy().astype(np.int16))
    end_time_gpu = gpuarray.to_gpu(df_parameter_variations['end_time'].to_numpy().astype(np.int16))
    parameter_length = np.uint32(len(df_parameter_variations))
    print(f'{end_time_gpu} {end_time_gpu.dtype}'
          )
    print(f'{entry_limit_max_wait_nanoseconds_gpu} {entry_limit_max_wait_nanoseconds_gpu.dtype}')
    print(len(df_parameter_variations))
    # autopep8: on

    # Define how it will be organized on the GPU
    threads_per_block = 1
    # math.ceil(len(df_parameter_variations) / threads_per_block)
    blocks_per_grid = 1

    num_elements = threads_per_block * blocks_per_grid

    # Declare output arrays
    out_profit = np.empty(len(df_parameter_variations), dtype=np.float32)
    out_drawdown = np.empty(len(df_parameter_variations), dtype=np.float32)
    out_numtrades = np.empty(len(df_parameter_variations), dtype=np.uint32)
    out_tradefees = np.empty(len(df_parameter_variations), dtype=np.float32)

    print(f"copied arrays to device, testing: {
          len(df_parameter_variations)} variations on {num_elements} threads")

    start_time = time.perf_counter()
    backtester_main(ohlc_gpu,
                    timestamp_gpu,
                    signal_gpu,
                    np.uint32(len(timestamp_np)),
                    entry_limit_max_wait_nanoseconds_gpu,
                    np.float32(20.0),  # DOLLARS_PER_POINT,
                    np.float32(1.5),  # FEES_DOLLARS_PER_TRADE
                    np.float32(1.0),  # POINTS_SLIPPAGE
                    start_time_gpu,
                    end_time_gpu,
                    parameter_length,
                    driver.Out(out_profit),
                    driver.Out(out_drawdown),
                    driver.Out(out_numtrades),
                    driver.Out(out_tradefees),
                    block=(threads_per_block, 1, 1), grid=(blocks_per_grid,))
    end_time = time.perf_counter()

__syncthreads() andprintf() can both affect the visibility of global data. If you have any sharing of the global space between threads (one thread writing to a space that another thread might read - not something I would expect to see when “embarrassingly parallel” is used as a descriptor) then that could be a concern. I wouldn’t expect this interaction with only a single thread, so another possibility with those is that they affect the compiler code generation, i.e. order/sequence of machine instructions.

If printf() ruins your debugging prospect, then you’ll need to find a way to debug without it. One possible method would be to, instead of printf(), write to global space, and then read out those “breadcrumbs” to decipher where things have become upset, from host code, later, after the kernel completion. Of course that may also ruin your prospect, I’m not sure. Another common debugging suggestion is to reduce the complexity of the code until the observation disappears. This may give some ideas by itself, or could be useful to facilitate easier debug.

You may find some other debug ideas in unit 12 of this online series. I’m quite confident all the tools and methods described there work in a pycuda environment, perhaps with some effort. I believe pycuda does the rigorous error checking for you, but the compute-sanitizer suggestion along with the subtools, as well as the debugger may be things to look at.

Sometimes I suggest to people that are looking for debugging assistance, that if they provide a complete code, that someone else could actually run and test and observe the issue, it may help. However, this code already looks complicated enough that I don’t know if its likely that anyone would have the appetite. The suggestion around simplification of your code could possibly also serve this purpose.

Perhaps unrelated:

When you say, something exists on the stack, you mean it exists as local (=scoped = automatic) variable vs. on the heap?

In CUDA one tries to keep data in registers or at least shared memory.

The actual stack would be in Cuda local memory, which actually resides in device memory (same as global memory) and is quite slow, if not cached well.

Hi, yes everything as scoped or local and automatically allocated as a C++ class would. The traditional stack paradigm on a CPU might not be as accurate in regards to gpus? I would have thought I’d see this more be an issue with multiple threads running concurrently but I limited it to only run one single thread and one block and the input/output arrays are as simple as they come (so no race conditions).

While the memory footprint of the on device active computation in a thread is small (vs access to large global arrays) I’m wonder if something might be going on with how the it allocates memory to registers that’s different conceptually than on a cpu?

Just bonking my head against the wall trying to think through what is going on here.

I’m also taking the other post’s suggestions and seeing if I can narrow down and reproduce this problem.

Hello! I think this thread can be closed. I was on Cuda 12.0 (I think PyCuda somehow installed that based on my nvcc version. I removed it and went to 12.8 and all of the calculations work now).

You can close this but I did add that compute-sanitizer to my pipeline. I think that video you linked is incredibly valuable and have at least added some initial automated checking during the build to make sure the static analysis is showing no potentials for errors.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.