Is there a best way or guide to optimize for many small operations?
I have an implementation of a Counter Factual Regret algorithm; think a weighted tree/graph structure such that when it hits a leaf (terminal node) the weights are re-calculated. Currently, many CPU threads are running calculations in the tree and it stops when an accuracy check-point is met.
The calculations are simple and small (~1400 elements in an array at each node). However, there are 2M+ nodes in the tree and will run billions of comparisons overall.
Learning CUDA’s programming model I’m trying to benchmark CPU vs. GPU, proper device memory management, unified memory, etc… However, I cannot seem to create a benchmark or scenario where the GPU out performs CPU in this model.
Before going too far down the path with a re-implementation of CUDA, thought to ask if any of this makes sense or if there is an intelligent design or model I can follow.