When to use Serial CPU, CUDA, OpenMP and MPI?

Hello All

  Can some experts please share their wisdom and experience one when/how to use Serial CPU, Cuda, OpenMP and MPI to get the best results. I know there are qualitative differences between all the approaches. So lets assume I have a cluster with multiple processors (i.e. cores) per node and 1 or more GPU's per node. How to I get the maximum performance?

  Recently I have become interested in GPGPU for scientific applications. I am a research scientist who does a lot of simulation work. Before I started re-writing my code for GPGPU I wanted to understand the conditions where the GPU is the best approach. So I have performed my own benchmarks to help answer my questions. I wanted to post my findings here to share my results for other peoples benefits. 

  Now I am ignoring MPI parallelism here because I think MPI can be combined with either OpenMP or CUDA (i.e. hybrid parallelism). Where as Cuda is in more direct competition with OpenMP. I also performed serial execution on the CPU as a reference (some problems are too small for parallelism).

  All my testing was on my personal computer. It has a Intel Core i7 5820K processor, 6 cores and with SMT 12 threads and 32 GB or RAM. My graphics card is lowly NVidia GT 730, it is a Keppler GPU with 2GB of RAM. So a relatively powerful CPU with a low-end GPU. So the hard-ware should favour CPU parallelism.

  The test case I chose is simple, it is basically derived from the vectorAdd sample given in the Cuda tool-kit. The benchmarking is based on a collection of functions that all perform the same vector operations. These functions have two parameters, the number of elements and the number of operations applied to each element. So this gives me a sense of both data size and code complexity effects on performance. 

  Another important point is Cuda performance is quite sensitive to how memory is managed. The first time I ran my benchmark, my Cuda kernel used the global memory for nearly everything. In this case there were not a lot of advantages to using Cuda. So I developed a Cuda kernel that avoided global memory. That dramatically improved the performance.

  So let me show the code for each function. Serial CPU version:
void testSingle(float* h_A, float* h_B, float* h_C, unsigned long numElements, unsigned long numOperations)
{   
    for(unsigned long i = 0; i < numElements; ++i)
    {   
        h_C[i]=h_A[i]+h_B[i];
        for(unsigned long j=1;j<numOperations;j++)
            h_C[i]+=h_A[i]+h_B[i];
    }
}
  OpenMP CPU version:
void testOpenMP(float* h_A, float* h_B, float* h_C, unsigned long numElements, unsigned long numOperations)
{
    #pragma omp parallel for 
    for(unsigned long i = 0; i < numElements; ++i)
    {   
        h_C[i]=h_A[i]+h_B[i];
        for(unsigned long j=1;j<numOperations;j++)
            h_C[i]+=h_A[i]+h_B[i];
    }
}
  Slow Cuda implementation:
__global__ void vectorAdd(const float *A, const float *B, float *C, unsigned long numElements, unsigned long numOperations)
{ 
    int i = blockDim.x * blockIdx.x + threadIdx.x;
  
    if (i < numElements)       
    {
        C[i] = A[i] + B[i];    
        for(unsigned long j=1;j<numOperations;j++)
            C[i] = C[i] + A[i] + B[i];      
    }
}
  I had two improved cuda kernels, one based on local variables that I assume are stored in registers and another based on shared memory. They both had similar performance so I will only show the local variable version:
__global__ void vectorAddReg(const float *A, const float *B, float *C, unsigned long numElements, unsigned long numOperations)
{ 
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        float rC;
        float rA=A[i];
        float rB=B[i];
        rC = rA + rB;
        for(unsigned long j=1;j<numOperations;j++)
            rC += rA + rB;
        C[i]=rC;
    }
}
  Now all these different implementations were wrapped in a 'for loop' and timing gates as shown below. I show the CUDA version just to show that I include GPU-host communication in the timing. The 'for loop' runs the test 4 times just to obtain more average conditions.
// prepare some data for the cuda call
    threadsPerBlock=256;
    int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
    // Get the starting time
    gettimeofday(&start, NULL);
    // Lets test the openmp thread time
    for(unsigned long i=0;i<TEST_COUNT;i++)
    {
        // send input data
        sendToGPU( h_A,  h_B, numElements);
        // Launch the Vector Add CUDA Kernel
        vectorAddReg<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements, numOperations);
        // receive results
        recvFrGPU( h_C, numElements);
    }
    // now collect the end
    gettimeofday(&end, NULL);
    std::cout << "The time for a cudaSC for loop is " << double((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec))/double(TEST_COUNT) << " micro seconds" << std::endl;
  I had two compilation units, one with pure CPU code compiled with g++, the other with cuda code compiled with nvcc. All of these versions were compiled with -O3 optimization. I didn't use -march=native on the cpu code so I cannot be sure if my code is using specialized vector instructions in the CPU.

  I ran my benchmark from 10-10^6 elements with 10-10^6 operations per element and every combination therein. The benchmarking occurred when my computer was <b>not</b> running a GUI so both the CPU and GPU were not running any other tasks. Thus, system load was minimal and benchmark's relatively unspoiled.

  Parallel code provides the greatest advantage when the parallel workload is very large compared to the serial work. In this case the serial work is the overhead of communication and spawning parallel work flows. So to get a raw measure of speed I looked at the biggest problem, 10^6 elements with 10^6 operations. There I get the following speed up:
OpenMP vs Serial CPU:    7.1x faster
CudaSlow vs Serial CPU: 10.5x faster
CudaFast vs Serial CPU: 82.8x faster
CudaSlow vs OpenMP:      1.5x faster
CudaFast vs OpenMP:     11.7x faster
  Clearly the fast implementation of cuda is superior with a 82.8x speed up over serial code.

  Back to the original question, under what conditions is it better to use one approach over the other? In general serial code was best for small problems. It appears the over-head of both OpenMP and Cuda is significant. Furthermore there is increased communication overhead with Cuda compared with OpenMP  (PCI vs. CPU cache and FSB). So I have listed the conditions where one approach is better than the other:
Num Elm	Num Opr 	Best Approach
10	All     	Serial CPU
100	10-100  	Serial CPU
100	10^3-10^5	Fast Cuda
100	10^6    	OpenMP
1000	10      	Serial CPU
1000	100+    	Fast Cuda
All other cases 	Fast Cuda
  The results show that in most cases the choice is between serial CPU and Cuda. However there are two cut-offs, the number of elements and the number of operations per element. In general if the number of elements is 10, then it seems the CPU is superior. For medium size problems 100-1000, then you need to have at least 1000 operations per element to beat serial CPU performance. Beyond 1000 elements, cuda is the best. 

  I suspect the CPU is best for small problems because it can keep all the data in fast cache and use vectored instructions. Once the data is larger than cache, then the GPU appears to be better. The benefit of cache does help OpenMP in one instance, however it shows that many operations per element are required before the overhead of spawning threads is compensated by parallel throughput. In that case OpenMP was only 2x faster than Cuda. So there is little advantage in mixing OpenMP and Cuda parallelism for these rare conditions where OpenMP has an advantage.

  An important quality of GPU parallelism vs. OpenMP is Cuda is Single Instruction Multiple Data (SIMD) and OpenMP is Multiple Instruction Multiple Data (MIMD). So complicated work-flows with a lot of branching and heterogeneous mix of algorithms are simply not appropriate for CUDA. In this case OpenMP is the only solution.

  So when to use serial CPU, OpenMP and CUDA? 
  -------------------------------------------

  In general small problems (<100 elements), or medium size problems (100-1000 elements) with a simple calculations (<10^5 total operations) are best executed in serial on the CPU. For larger SIMD problems, Cuda is superior over OpenMP. However, Cuda only offers a significant advantage when the kernels are optimized (i.e. more development effort is required with Cuda). So OpenMP should be chosen when you either have a large MIMD problem or you want to parallel a code quickly.

  So what is the best solution for cluster computing?
  ---------------------------------------------------

  If your problem is very large, then partition the memory across multiple nodes and perform MPI parallelism at a high level within your code. In this case you should only run 1 MPI process per node, and then use a combination of OpenMP and Cuda on the node.

  Next you should focus on the code that runs on a single node. You need to identify the areas of your code that are solving large SIMD problems that exceed the serial CPU cut-off. These parts of the code should be sent to a GPU. Finally, use OpenMP to parallel any remaining parts that can be executed in parallel. If you want results quickly, you could start to execute all single node parallel parts purely with OpenMP, then in time move more and more to the GPU. This solution will make a fast portable code because not all systems will have a GPU, so you can always fallback to your OpenMP parallel code in these cases.

  An optimal hybrid solution may look like the code below. Note, the 'my_rank' argument indicates that the code is only operating on partitioned data indicated by 'my_rank'. Since spawning GPU processing is non-blocking, you can execute SIMD on the GPU and perform addition CPU work in parallel.
void run_simulation_algorithm()
{
    // serial part
    execute_initial_serial_part();
    // MPI coordination for starting the parallel execution over MPI:
    start_mpi_parallel_execution(my_rank);
    // spawn any GPU kernels:
    asynchronously_spawn_large_SIMD_problems_to_GPU(my_rank);
    // execute more complicated problems:
    execute_large_MIMD_problems_with_OpenMP(my_rank);
    // execute any remaining serial parts
    execute_small_problems_in_serial(my_rank);
    // collect the GPU results:
    collect_GPU_results(my_rank);
    // MPI coordination for resuming serial part of the code
    coordinate_mpi_execution_to_resume_serial_execution();
    // Resume serial execution
    resume_serial_execution();
}
  As with any parallel speed-up it will allow you to either solve the same problem faster, or solve larger problems with the same resources (assuming highly scalable code). However utilizing the GPU gives you more options in partitioning your problem for economic cluster utilization. Utilizing the GPU gives you a 10x speed-up over OpenMP parallel code. This would allow you to partition your problem into larger chunks by a factor of 10 and use 1/10 of the nodes you would need in a pure CPU implementation. This will help you run more jobs on a busy cluster. So one key advantage to GPU programming may be overall increase in throughput on a busy cluster.

If you can reveal the general nature of the simulations, it is more likely that you will get targeted feedback from people familiar with that application area.

One fundamental difference is that OpenMP on CPUs offers O(100) parallelism, whereas CUDA on GPUs offers O(10,000) parallelism, and in fact requires that for good performance. If your simulations can take advantage of the massive parallelism of the GPU, you should be able to achieve nice speedups. In addition to massive FLOPS, another advantage of GPUs is the improved memory bandwidth, around a factor of 5x compared to CPUs-systems (350 GB/s vs 70 GB/s). To take advantage of that requires (1) problems that mostly fit into the limited GPU memory (2) avoiding frequent data transfer from/to GPU (3) reasonably regular access patterns.

Beyond the pure performance aspects, GPUs also offer significantly better perf/watt and perf/$ than CPU-only systems. The former is usually of interest only if you operate large clusters or where electricity is expensive, the latter probably matters to every researcher. If your computations are mostly single precision, you may be able to use cheap consumer-grade GPUs.

GPUs have been around long enough (about ten years) that they have spread into just about every area of science, with tens of thousands of papers published. Unless your simulations are of an unusual nature, chances are there are already several applications using different approaches, with published performance data. Many performance claims have to be deconstructed carefully though: often the comparison is to single-threaded or non-vectorized CPU code, or they report the speedup for just one part of a large-ish simulation. If there is sufficient inherent parallelism in a simulation for a GPU implementation to make sense, realistic expectations for the application-level performance advantage for GPU-accelerated code compared to well-optimized CPU code would be in the 2x to 10x range, with an average of around 5x.

Hi njuffa

    Thanks a lot for your reply. I have not read so much of the academic literature on the subject. Playing with things manually has the added benefit of learning new skills, so I will take a look at what is out there.

    To answer your question about applications, I personally perform multi-fidelity multidisciplinary design optimization of wind turbines. So my high fidelity calculations are typically RANS CFD, vortex calculations and various structural multi-body FEM models. Some of the codes have analytic gradients. Typically these codes are coupled together and with optimizers and iterative uncertainty quantification algorithms (i.e. Dakota). So there is a lot of scope for multi-level parallelism. So in summary I use a wide range of tools.

    There is a second goal in these investigations. My institution is looking at acquiring a new cluster and I am trying to figure out if we should have some Tesla cards on some of the nodes and how many. To get the Tesla cards I have to convince my colleagues that GPU programming is a good idea. So that's why I am trying to figure out the benefits, work required to port CPU code and best practices.

    Our biggest cluster users are typically running LES and large scale structural topology optimization. In all cases we rely mostly on our own tools so porting things to GPUs is possible. I think the topology optimization people could benefit the most from GPUs because their problems are basically humongous FEM simulations and are easy to parallelize.

    So in summary the applications I am concerned with are a variety of structural and aerodynamic simulations wrapped within iterative algorithms.

Take Care
Mike

CFD is outside my area of expertise. Is your LES code LBM based?

I am all for experimental approaches, and in fact often encourage them. Unfortunately, in industrial settings, they are sometimes prohibitive from a time utilization perspective. A comprehensive literature search is usually a valuable complement to running relevant experiments.

Hi njuffa

      Once again thanks for considering my requests. I personally do not use LES so I cannot say for certain but I think the LES used here is based on the Navier Stokes equations with SGS models. The LES is usually looking at larger domains, not quite full atmospheric scales, but atmospheric boundary layer is the largest.

Take Care
Mike

What is this “test” code actually testing ?

global void vectorAdd(const float *A, const float *B, float *C, unsigned long numElements, unsigned long numOperations)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)       
    {
        C[i] = A[i] + B[i];    
        for(unsigned long j=1;j<numOperations;j++)
            C[i] = C[i] + A[i] + B[i];      
    }
}

So far it seems to be reading from an element from an array called A at index I into that array.
So far it seems to be reading from an element from an array called B at index I into that array.
So far it seems to be adding those two elements together and storing it into an array called C at index I into that array.

So far so good, then it does something strange:

It starts a second loop with a J index.

Then the code repeats itself with index I was this a typo or is this intentional ? Doesn’t seem like a typo since all indexes are using I the second time ?

So the first piece of code seems to be testing the thread, the second loop code seems to be testing the looping ?

Is this a cache test ? a compiler optimization test ? or is it a hardware test ?

Basically the second loop/code makes each thread walk through memory more or less at it’s own pace, though there will probably be some lock-stepping going on, might depend on resources available.

So far this seems to be more like a memory test than math/calculations/executions of logic.

Graphics cards not strong for random access, but do have a lot of execution possibility for same code/same data then slightly alternating data, like SSE, small pieces of code not to overflow registers available it seems.

Now that I read the rest of original posting it seems the question/posting is answered mostly by itself…
it’s a test to test at what scale what hardware should be chosen. However this test may not test a “mix” of code/logic scenerio so some(*) conclusions being drawn without an actual test of that ?
(fixed a little typo *a=o)

Hi Skybuck

   Yes I agree it is a silly piece of code. Your final understanding of its purpose is correct. I was just trying to understand the conditions where GPU will provide an advantage. I don't want to spend a pile of time porting my code to GPU's and just to find out the overhead is too much.

   Your concern about a mix of code scenarios is valid. I have been working on a second test program. However it is not testing speed. More just program architecture. My code is not well structured for GPU programming so I am thinking about how to migrate my current architecture to something more appropriate. So slowly my test programs are evolving towards more realistic scenarios.

Thanks
Mike

May I have a continued questions ? So far I am using CPU -GPU combined coding. Cuda and c++. one node have 4GPU cards and 24 CPU processors or more. The current code has CPU part and GPU part, and one thread associated with each GPU. SO I use 4 CPU threads because I have 4 GPU cards. My question is can I use the left 20 threads (suppose I use one thread each processor) to do for loop parallelization in CPU code part ?

Yes, you can use the CPU threads however you see fit.