Architecture design for huge data load that can't be linear-ized

I need to process a huge load of time ordered data that cannot be partitioned into several threads because there’s linear correlation between the beginning and the end, it’s financial data and I cannot linearize it for processing: software open positions based on certain conditions and if this data is linearized it won’t be possible for a specific thread to know which positions have been opened in the past.

So my initial design attempt is to load this data (> 2GBs) to Global Memory, fire several blocks/threads and make each thread process the 2 GBs altogether. I need each thread to process the same data with several different parameters; let’s say that around 150 combinations will be enough.

any thoughts on this design?

would a better approach might be to load smaller pieces (let’s say 20 Mbs, 100 times), pass positions that have already been opened and fire the 150 threads to process each piece.

Would there be any gain or the 20 x 100 RAM allocation an reading would consume the same time than reading the 2GBs at once?

Any other idea that I’m missing?

Thanks in advance.

If you end up running 150 threads on the device, you’ll be very disappointed by the performance.

100x150 threads would be a much better approach