I have been trying to develop an embarrassingly parrallel system that was first used against historical sensor data and now currently some financial data. The idea of putting on the GPU is I can test a lot of permutations / combinations at once (versus the 16 core CPU I have) i.e. use the same historical data and just call one small backtest instance per thread. The current CPP code I have currently fully works on the CPU (with an extensive test suite).
I also understand overfitting etc but have been puzzled as to why I can’t make this system work on the gpu. My idea was to just call the C++ code on device (I am not allocating data, just tracking current profit / loss as I iterate through all the time series data, no external libraries, so really should just be a simple drop in). The data I currently have:
- 5 million rows of OHLC (float32 data)
- A cpp class “MyTrader” which just simply iterate through the rows and tracks profit/loss based on signals. This is fully working on the cpu side of things. This class exists entirely on the stack, is around ~290 bytes in size and uses no new allocations etc. It does not nest super deep (maybe ~5 nested calls deep) so I don’t think I’m blowing up the stack.
- It uses no local memory or fancy tools of CUDA. it probably isn’t the theoretically fastest implementation, but the idea is for each thread to compute a permutation for me.
To adapt the class, I created a helper function to help modify the normal CPP class to make it run on device, and added the below function to my class functions i.e.
#ifdef __CUDACC__
#define FUNC_PREFIX __device__
#else
#define FUNC_PREFIX
#endif
FUNC_PREFIX class TradeArrayMetrics
{
public:
FUNC_PREFIX TradeArrayMetrics(const float dollars_per_point, float const fees_dollars_per_trade, float const points_slippage)
{
_multiplier = dollars_per_point;
_fees_dollars_per_trade = fees_dollars_per_trade;
_points_slippage = points_slippage;
}
// Pass r-value to prevent extra copy
FUNC_PREFIX virtual void push_back(TRADE &&item)
{
// TODO: Can I immediately update that item?
this->update_metrics(_current_trade); // Update metrics first if closed
_current_trade = item;
_counter++;
}
Problem I can’t figure out: When I run the computation, there is one section of code I need to call a __syncthreads()
or printf(...)
when run on the GPU. I don’t understand why this is needed (as I am using no bits of shared memory, only using global memory). When I call syncthreads (or even a simple printf(“”) statement), the output is correct. When I remove that __syncthreads() OR the printf statement, then it does not work. I think this is similar to one of those Heisenberg bugs but I’m entirely confused in what I am missing and how to debug this issue? The only modification I am adding to the cpp function is the FUNC_PREFIX
macro to make the code execute on the GPU.
The offending code section is:
/* Here we actually walk the OHLC */
FUNC_PREFIX void next()
{
if (this->_active_positions.size() > 0)
{
TRADE &cur_pos = this->_active_positions.back();
if (cur_pos.trade_state == TRADE_STATE::DONE)
return;
cur_pos.update(this->_current_ohlc, this->_current_time);
cuda_SYNCTHREADS(); // This or printf, otherwise calculation is wrong?
}
Note: the .back()
function is not a dynamically allocated structure (i.e. vector), but instead is just:
FUNC_PREFIX TRADE &back()
{
return _current_trade;
}
For debugging purposes, I am calling threads_per_block=1
and blocks_per_grid=1
(to not have multiple of them running), and verify I am only using the first in the kernel. Below is the Cuda Kernel I use to call the cpp function.
PyCuda coda used to upload to the GPU. Checked dtypes and everything of arrays and all the signatures appear to match.
entry_limit_max_wait_nanoseconds_gpu = gpuarray.to_gpu(df_parameter_variations['entry_limit_max_wait_seconds'].to_numpy().astype(np.int64) * np.int64(1_000_000_000))
start_time_gpu = gpuarray.to_gpu(df_parameter_variations['start_time'].to_numpy().astype(np.int16))
end_time_gpu = gpuarray.to_gpu(df_parameter_variations['end_time'].to_numpy().astype(np.int16))
parameter_length = np.uint32(len(df_parameter_variations))
print(f'{end_time_gpu} {end_time_gpu.dtype}'
)
print(f'{entry_limit_max_wait_nanoseconds_gpu} {entry_limit_max_wait_nanoseconds_gpu.dtype}')
print(len(df_parameter_variations))
# autopep8: on
# Define how it will be organized on the GPU
threads_per_block = 1
# math.ceil(len(df_parameter_variations) / threads_per_block)
blocks_per_grid = 1
num_elements = threads_per_block * blocks_per_grid
# Declare output arrays
out_profit = np.empty(len(df_parameter_variations), dtype=np.float32)
out_drawdown = np.empty(len(df_parameter_variations), dtype=np.float32)
out_numtrades = np.empty(len(df_parameter_variations), dtype=np.uint32)
out_tradefees = np.empty(len(df_parameter_variations), dtype=np.float32)
print(f"copied arrays to device, testing: {
len(df_parameter_variations)} variations on {num_elements} threads")
start_time = time.perf_counter()
backtester_main(ohlc_gpu,
timestamp_gpu,
signal_gpu,
np.uint32(len(timestamp_np)),
entry_limit_max_wait_nanoseconds_gpu,
np.float32(20.0), # DOLLARS_PER_POINT,
np.float32(1.5), # FEES_DOLLARS_PER_TRADE
np.float32(1.0), # POINTS_SLIPPAGE
start_time_gpu,
end_time_gpu,
parameter_length,
driver.Out(out_profit),
driver.Out(out_drawdown),
driver.Out(out_numtrades),
driver.Out(out_tradefees),
block=(threads_per_block, 1, 1), grid=(blocks_per_grid,))
end_time = time.perf_counter()