Hello All
Can some experts please share their wisdom and experience one when/how to use Serial CPU, Cuda, OpenMP and MPI to get the best results. I know there are qualitative differences between all the approaches. So lets assume I have a cluster with multiple processors (i.e. cores) per node and 1 or more GPU's per node. How to I get the maximum performance?
Recently I have become interested in GPGPU for scientific applications. I am a research scientist who does a lot of simulation work. Before I started re-writing my code for GPGPU I wanted to understand the conditions where the GPU is the best approach. So I have performed my own benchmarks to help answer my questions. I wanted to post my findings here to share my results for other peoples benefits.
Now I am ignoring MPI parallelism here because I think MPI can be combined with either OpenMP or CUDA (i.e. hybrid parallelism). Where as Cuda is in more direct competition with OpenMP. I also performed serial execution on the CPU as a reference (some problems are too small for parallelism).
All my testing was on my personal computer. It has a Intel Core i7 5820K processor, 6 cores and with SMT 12 threads and 32 GB or RAM. My graphics card is lowly NVidia GT 730, it is a Keppler GPU with 2GB of RAM. So a relatively powerful CPU with a low-end GPU. So the hard-ware should favour CPU parallelism.
The test case I chose is simple, it is basically derived from the vectorAdd sample given in the Cuda tool-kit. The benchmarking is based on a collection of functions that all perform the same vector operations. These functions have two parameters, the number of elements and the number of operations applied to each element. So this gives me a sense of both data size and code complexity effects on performance.
Another important point is Cuda performance is quite sensitive to how memory is managed. The first time I ran my benchmark, my Cuda kernel used the global memory for nearly everything. In this case there were not a lot of advantages to using Cuda. So I developed a Cuda kernel that avoided global memory. That dramatically improved the performance.
So let me show the code for each function. Serial CPU version:
void testSingle(float* h_A, float* h_B, float* h_C, unsigned long numElements, unsigned long numOperations)
{
for(unsigned long i = 0; i < numElements; ++i)
{
h_C[i]=h_A[i]+h_B[i];
for(unsigned long j=1;j<numOperations;j++)
h_C[i]+=h_A[i]+h_B[i];
}
}
OpenMP CPU version:
void testOpenMP(float* h_A, float* h_B, float* h_C, unsigned long numElements, unsigned long numOperations)
{
#pragma omp parallel for
for(unsigned long i = 0; i < numElements; ++i)
{
h_C[i]=h_A[i]+h_B[i];
for(unsigned long j=1;j<numOperations;j++)
h_C[i]+=h_A[i]+h_B[i];
}
}
Slow Cuda implementation:
__global__ void vectorAdd(const float *A, const float *B, float *C, unsigned long numElements, unsigned long numOperations)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements)
{
C[i] = A[i] + B[i];
for(unsigned long j=1;j<numOperations;j++)
C[i] = C[i] + A[i] + B[i];
}
}
I had two improved cuda kernels, one based on local variables that I assume are stored in registers and another based on shared memory. They both had similar performance so I will only show the local variable version:
__global__ void vectorAddReg(const float *A, const float *B, float *C, unsigned long numElements, unsigned long numOperations)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements)
{
float rC;
float rA=A[i];
float rB=B[i];
rC = rA + rB;
for(unsigned long j=1;j<numOperations;j++)
rC += rA + rB;
C[i]=rC;
}
}
Now all these different implementations were wrapped in a 'for loop' and timing gates as shown below. I show the CUDA version just to show that I include GPU-host communication in the timing. The 'for loop' runs the test 4 times just to obtain more average conditions.
// prepare some data for the cuda call
threadsPerBlock=256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
// Get the starting time
gettimeofday(&start, NULL);
// Lets test the openmp thread time
for(unsigned long i=0;i<TEST_COUNT;i++)
{
// send input data
sendToGPU( h_A, h_B, numElements);
// Launch the Vector Add CUDA Kernel
vectorAddReg<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements, numOperations);
// receive results
recvFrGPU( h_C, numElements);
}
// now collect the end
gettimeofday(&end, NULL);
std::cout << "The time for a cudaSC for loop is " << double((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec))/double(TEST_COUNT) << " micro seconds" << std::endl;
I had two compilation units, one with pure CPU code compiled with g++, the other with cuda code compiled with nvcc. All of these versions were compiled with -O3 optimization. I didn't use -march=native on the cpu code so I cannot be sure if my code is using specialized vector instructions in the CPU.
I ran my benchmark from 10-10^6 elements with 10-10^6 operations per element and every combination therein. The benchmarking occurred when my computer was <b>not</b> running a GUI so both the CPU and GPU were not running any other tasks. Thus, system load was minimal and benchmark's relatively unspoiled.
Parallel code provides the greatest advantage when the parallel workload is very large compared to the serial work. In this case the serial work is the overhead of communication and spawning parallel work flows. So to get a raw measure of speed I looked at the biggest problem, 10^6 elements with 10^6 operations. There I get the following speed up:
OpenMP vs Serial CPU: 7.1x faster
CudaSlow vs Serial CPU: 10.5x faster
CudaFast vs Serial CPU: 82.8x faster
CudaSlow vs OpenMP: 1.5x faster
CudaFast vs OpenMP: 11.7x faster
Clearly the fast implementation of cuda is superior with a 82.8x speed up over serial code.
Back to the original question, under what conditions is it better to use one approach over the other? In general serial code was best for small problems. It appears the over-head of both OpenMP and Cuda is significant. Furthermore there is increased communication overhead with Cuda compared with OpenMP (PCI vs. CPU cache and FSB). So I have listed the conditions where one approach is better than the other:
Num Elm Num Opr Best Approach
10 All Serial CPU
100 10-100 Serial CPU
100 10^3-10^5 Fast Cuda
100 10^6 OpenMP
1000 10 Serial CPU
1000 100+ Fast Cuda
All other cases Fast Cuda
The results show that in most cases the choice is between serial CPU and Cuda. However there are two cut-offs, the number of elements and the number of operations per element. In general if the number of elements is 10, then it seems the CPU is superior. For medium size problems 100-1000, then you need to have at least 1000 operations per element to beat serial CPU performance. Beyond 1000 elements, cuda is the best.
I suspect the CPU is best for small problems because it can keep all the data in fast cache and use vectored instructions. Once the data is larger than cache, then the GPU appears to be better. The benefit of cache does help OpenMP in one instance, however it shows that many operations per element are required before the overhead of spawning threads is compensated by parallel throughput. In that case OpenMP was only 2x faster than Cuda. So there is little advantage in mixing OpenMP and Cuda parallelism for these rare conditions where OpenMP has an advantage.
An important quality of GPU parallelism vs. OpenMP is Cuda is Single Instruction Multiple Data (SIMD) and OpenMP is Multiple Instruction Multiple Data (MIMD). So complicated work-flows with a lot of branching and heterogeneous mix of algorithms are simply not appropriate for CUDA. In this case OpenMP is the only solution.
So when to use serial CPU, OpenMP and CUDA?
-------------------------------------------
In general small problems (<100 elements), or medium size problems (100-1000 elements) with a simple calculations (<10^5 total operations) are best executed in serial on the CPU. For larger SIMD problems, Cuda is superior over OpenMP. However, Cuda only offers a significant advantage when the kernels are optimized (i.e. more development effort is required with Cuda). So OpenMP should be chosen when you either have a large MIMD problem or you want to parallel a code quickly.
So what is the best solution for cluster computing?
---------------------------------------------------
If your problem is very large, then partition the memory across multiple nodes and perform MPI parallelism at a high level within your code. In this case you should only run 1 MPI process per node, and then use a combination of OpenMP and Cuda on the node.
Next you should focus on the code that runs on a single node. You need to identify the areas of your code that are solving large SIMD problems that exceed the serial CPU cut-off. These parts of the code should be sent to a GPU. Finally, use OpenMP to parallel any remaining parts that can be executed in parallel. If you want results quickly, you could start to execute all single node parallel parts purely with OpenMP, then in time move more and more to the GPU. This solution will make a fast portable code because not all systems will have a GPU, so you can always fallback to your OpenMP parallel code in these cases.
An optimal hybrid solution may look like the code below. Note, the 'my_rank' argument indicates that the code is only operating on partitioned data indicated by 'my_rank'. Since spawning GPU processing is non-blocking, you can execute SIMD on the GPU and perform addition CPU work in parallel.
void run_simulation_algorithm()
{
// serial part
execute_initial_serial_part();
// MPI coordination for starting the parallel execution over MPI:
start_mpi_parallel_execution(my_rank);
// spawn any GPU kernels:
asynchronously_spawn_large_SIMD_problems_to_GPU(my_rank);
// execute more complicated problems:
execute_large_MIMD_problems_with_OpenMP(my_rank);
// execute any remaining serial parts
execute_small_problems_in_serial(my_rank);
// collect the GPU results:
collect_GPU_results(my_rank);
// MPI coordination for resuming serial part of the code
coordinate_mpi_execution_to_resume_serial_execution();
// Resume serial execution
resume_serial_execution();
}
As with any parallel speed-up it will allow you to either solve the same problem faster, or solve larger problems with the same resources (assuming highly scalable code). However utilizing the GPU gives you more options in partitioning your problem for economic cluster utilization. Utilizing the GPU gives you a 10x speed-up over OpenMP parallel code. This would allow you to partition your problem into larger chunks by a factor of 10 and use 1/10 of the nodes you would need in a pure CPU implementation. This will help you run more jobs on a busy cluster. So one key advantage to GPU programming may be overall increase in throughput on a busy cluster.