Optimize - Many small operations (CPU is faster for now?)

jshoberg · July 10, 2019, 7:51pm

Is there a best way or guide to optimize for many small operations?

I have an implementation of a Counter Factual Regret algorithm; think a weighted tree/graph structure such that when it hits a leaf (terminal node) the weights are re-calculated. Currently, many CPU threads are running calculations in the tree and it stops when an accuracy check-point is met.

The calculations are simple and small (~1400 elements in an array at each node). However, there are 2M+ nodes in the tree and will run billions of comparisons overall.

Learning CUDA’s programming model I’m trying to benchmark CPU vs. GPU, proper device memory management, unified memory, etc… However, I cannot seem to create a benchmark or scenario where the GPU out performs CPU in this model.

Before going too far down the path with a re-implementation of CUDA, thought to ask if any of this makes sense or if there is an intelligent design or model I can follow.

njuffa · July 10, 2019, 8:13pm

Google Scholar indicates multiple articles about the implementation of counterfactual regret algorithms on GPU published in the past four years; have you done a thorough literature search to familiarize yourself with the state of the art?

The typical way to accelerate many small operations, each of which does not offer enough parallelism to fill the GPU by itself, is to submit these operations in batches. Since GPUs are throughput-optimized processors, whereas CPUs are typically latency optimized processors, this is a very appropriate adaptation, however it may not be a good fit for particular use cases where latency matters (high-frequency trading of securities, for example).

Depending on how the graph is implemented and what the traversal patterns are the memory access pattern might be a bad fit to the GPU memory, which provides high bandwidth for many regular access patterns. What does the CUDA profiler indicate where the bottlenecks are?

You may need to rethink the data structure implementation for the GPU. For example, it is my understanding that GPU-accelerated in-memory databases derive their speed mostly from the fact that they have abolished classical database organizations and instead turned to brute-force searches in arrays, something GPUs are good at due to their memory bandwidth which typically exceeds CPU system memory bandwidth by a factor of 5.

From my work in embedded system I know that hybrid data structures can also be beneficial, where a list or tree is implemented as an overlay for an array, which allows either classical node traversal or array traversal as an alternative, depending on the task.

FloridaMan · July 11, 2019, 11:46pm

I think I went through this just last week. I am also new to CUDA and pieces are starting to fall into place for me. There’s a reply I made in this thread (hopefully I’m not too incorrect with it): https://devtalk.nvidia.com/default/topic/1056962/cuda-programming-and-performance/-how-do-i-sort-using-cuda-/

The epiphany was when I realized (or at least, I think I realized) how CUDA works internally. I haven’t benchmarked things to confirm my theory, but I believe that CUDA methods (kernel functions global and device) with any branching or looping constructs will take as long as the longest call. Consider the following:

//Really inefficient code example: 
__global__ void compoundAddNumbers(uint32_t *result)
{
    unsigned max = blockDim.x*blockIdx.x+threadIdx.x;
    //For shorter loops, "#pragma unroll" here is supposed to improve performance if the iterations are a fixed number. Then again, that would nullify this example. 
    uint32_t total = 0; //use a register for the loop. 
    if (x % 2 == 0)
    {
        for (int i = 0; i < max; i++)
        {
            total += 1; //Use the register here to minimize synchronization issues and waiting with the atomicAdd, so this is a bit of efficiency. 
        }
    }
    else
    {
        for (int i = 0; i < max; i++)
        {
            total += i;
        }
    } 
    atomicAdd(result, total);
}

The way I understand it, is that within a warp, the execution time for the above will be hindered by the following issues (notwithstanding if the compiler is smart enough to eliminate branching):

The longest running loop will take the same amount of time as the shortest.
The total execution time will be that of both branches of the if statement.

I am still an amateur with this stuff, but so far, my checklist of things that want to be avoided are as follows:

Avoid external memory access and minimize read/write to external variables and pointers
If external memory must be accessed, it's faster to access contiguous blocks than random reads. External memory calls load a block of memory into cache. If one "thread" accesses r[0] and another accesses r[1], performance will be better than if they're accessing r[0] and r[200]
Avoid looping constructs, especially of different dimensions
Avoid branching statements, especially when there are large blocks of code within each half of the branch

CUDA cores and threads are best not to be thought of as “task parallel”, where each core and thread are doing something different and can finish at different times like on a multicore CPU. It’s much better to think of them as “instruction parallel” where each you’re performing the exact same operation on multiple “parallel” memory addresses at the same time.