CUDA Fortran optimization strategies

CUDA_Fortran_noob · June 4, 2024, 7:11pm

Hi…I am relatively new to using CUDA Fortran. I’ve successfully managed to port a legacy Fortran numerical model code to run on a GPU by using kernel loop directives and device global memory to save variables used in the computations every timestep. Although the performance is good relative to running on a CPU (~40 times faster), I suspect there is more to be gained based on the diagnostic output of NSight Compute. Apparently, resolving/improving long scoreboard stalls and warp occupancy could mean unlocking more speed gains. However, I am struggling with resources and strategies for addressing these issues. Some approaches I tried (using coalesced data access and using shared memory) do not seem to help at all. Are there any good resources out there (including code examples, preferably in Fortran) to help with further optimizing GPU performance based on the diagnostic results of NSight Compute? Any references will be appreciated.

Thanks!

MatColgrove · June 4, 2024, 9:23pm

Other than the Nsight-Compute docs themselves, I’m not aware of other docs. Maybe they are out there, but I would think it would difficult to write since so much is contextual based on the specifics of the program, it would hard to generalize. Though Nsight-Compute Forum might be a good resource since you can see questions asked by others which may be helpful or at least give you ideas. Most folks post CUDA C questions, but the strategies would apply to Fortran as well.

For the long scoreboard, this may or may not impact performance depending on if other warps can be scheduled while others are waiting for memory. If it is impactful, then it might be a case where there’s not enough time to allow the fetch to be resolved before the data is being used. Shuffling operations might help here so you have the threads do something else before using the data.

Occupancy is more about balancing the shared resources of the device so having a high occupancy does not necessarily improve performance. Generally a 50% occupancy is considered very good especially for Fortran since these are more often scientific codes with larger kernels that use more resources per warp.

The first thing to ask is if the kernel is fully utilizing the GPU? Each multi-processor (SM) at 100% occupancy can run 2048 concurrent threads, so an H100 with 114 SMs needs 233,472 threads or 116,736 at 50% occupancy. Since you’re using CUF kernels and I presume are letting the compiler set the schedule (i.e. using"<<< *,* >>>", it will use the loop bounds to set the number of blocks and threads per blocks. In other words, the product of the loop iterations will determine how many threads are used. If there are too few iterations, there’s simply not enough work which gets reflected as a low occupancy. Also the long scoreboard becomes a factor since there’s not enough warps to hide the memory latency.

If the kernel does have enough work, then register usage is often the main limiter to occupancy. There are 65536 registers per SM, so to reach 100%, each of the 2048 threads can only use a max of 32 registers. If the threads use more registers, fewer of them can run. The difficult part is that register usage is dictated by the local variables used and scratch area for things like intermediate computation and holding address. So for larger scientific kernels, the primary method is to split the kernel into multiple kernels, so long as the algorithm allows it and the cost of launch more kernels is less than the speed-up from the better occupancy. For example at a hackathon I mentored, the code was ported from an older Fortran code that had a generic routine to handle seven cases. We split this into seven specialized kernels. While there was some repeated code, it was enough to half the register usage.

We do have a flag, "-gpu=maxregcount:<n>" which sets the maximum number of registers per thread. This can be used to help occupancy. However the data that once was stored in registers needs to go someplace, so instead is stored in “local memory”, which is just private space in main memory. Then registers “spill” to local memory so other local data can be stored in the register. This often leads to lower performance since the extra data movement offsets the gains in occupancy. I very rarely use this flag unless the registers at a borderline, like 33 or 129, But it’s interesting to experiment with to see how register usage affects occupancy.

CUDA_Fortran_noob · June 5, 2024, 7:25pm

Thanks for the reply. I will look into the NSight Compute documents and the aspects you mentioned.

Topic		Replies	Views
nv nsight report card suggestions Legacy PGI Compilers	2	957	July 16, 2019
Nvfortran 、 NVIDIA RTX A4000、registers nvc, nvc++ and nvfortran	18	1023	August 1, 2023
Optimisation of occupancy summary table CUDA Programming and Performance	10	727	September 15, 2023
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5897	July 25, 2007
What is the performance impact of launching many many small blocks? CUDA Programming and Performance cuda , kernel	7	58	November 7, 2024
too large kernel solutions CUDA Programming and Performance	11	4279	September 2, 2008
Multiproccesor occupancy for a flop intensive kernel .. how bad is 25% ? CUDA Programming and Performance	10	1739	June 30, 2009
Forcing register reuse in a loop CUDA Programming and Performance	9	3156	March 6, 2010
NVCC chooses to use local memory while there is a lot of registers it can use CUDA Programming and Performance	10	1297	January 7, 2022
question about register and performance CUDA Programming and Performance	3	6708	September 22, 2008

CUDA Fortran optimization strategies

Related topics