Deal all, I am paralleling an existing numerical model written in C with CUDA. The existing model supports multiple thread computing, so I assume it has the potential to CUDA. However, with multiple threads the computation speed gets only marginal return when thread number goes above 10. This makes me wonder whether the computation speed can be speed up by CUDA.
To implement CUDA, i am considering the following steps. Could you please kindly provide your comments?
a) identify the kernel(s) and add global. Certain kernels may return variables in the model and I will change them to void
b) identify device functions who are called within the kernel(s) and add device
c) identify which global variables who are allocated with malloc AND used by any device functions, and change them to cudaMallocManaged. I noted in previously CUDA version, there may be some issues with cudaMallocManaged. I wonder its performance in version 9.2. Will it be able to manage unified memory correctly so that I do not need to worry transferring data between GPU and CPU?
These steps will involve a significant hours so I would greatly appreciate your inputs before I jump into the unknown. Thanks!
Those mechanical steps are necessary to port a code to CUDA, but by themselves are not sufficient to guarantee any performance improvement.
Most existing scientific and technical computing codes that are written to be multi-threaded are often not able to exploit more than a few dozen threads at the most. That level of thread parallelism is insufficient to get any useful benefit from CUDA, and furthermore such codes often don’t have memory access characteristics that will readily fall into the desired patterns for optimal memory access in CUDA.
These are the two most important objectives to achieve performance in a CUDA code:
- expose enough parallelism (roughly, be able to use ~10,000 or more threads)
- make efficient use of the memory subsystem(s) (for global memory, strive for coalesced loads/stores)
Neither of these objectives are guaranteed to be satisfied by the mechanical steps you have laid out.
Rather than a brief, passing familiarity with CUDA that can be gained in ~1 hour, I recommend a minimum of about 8 hours of careful study of CUDA principles before trying your hand at porting of a basic code.
Thank you, these are very helpful comments! Do you have a specific document when you mention CUDA principles? Can you recommend a reference for your item 2? It will obviously take fair amount of work to change existing memory model (eg when and where to allocate/free memory). From your experience should i start paralleling based on existing memory model. I do not want to completely change memory model unless if have to. The reference to CUDA principles should help on this i assume.
For basic CUDA understanding I would recommend full familiarity with these two presentations:
The two optimization principles are covered in some detail in the 2nd presentation. These are not the only 2 examples or possibilities. There are many presentations, blogs, etc. that cover similar content. But these cover the necessary material to give a basic understanding of CUDA programming along with key optimization principles that are usually necessary to adhere to, for good performance.
In the second presentation above, refer to slide 121 for a summary statement pointing out the 2 items I mentioned for good performance:
• What you need for good GPU performance
– Expose sufficient parallelism to keep GPU busy
• General recommendations:
– 1000+ threadblocks per GPU
– 1000+ concurrent threads per SM (32+ warps)
– Maximize memory bandwidth utilization
• Pay attention to warp address patterns
• Have sufficient independent memory accesses to saturate the bus
Point (2) is a natural consequence of the fact that modern GPUs offer such a high computational throughput that most real-life use cases are limited by memory throughput rather than by computational throughput, e.g. when analyzed with the help of the roofline model (*).
If you go back thirty years, people considered one floating-point operation for every one byte/sec memory throughput as the holy grail of balanced supercomputers. For comparison, Tesla V100 offers 7.8 DP TFLOPS with 0.9 TB/s memory throughput, and the ratios for some consumer GPUs are even higher.
(*) Samuel Williams, Andrew Waterman, David Patterson, “Roofline: an insightful visual performance model for multicore architectures”, Communications of the ACM, Volume 52, Issue 4, April 2009, pp. 65-76