CUDA Fortran optimization strategies

Hi…I am relatively new to using CUDA Fortran. I’ve successfully managed to port a legacy Fortran numerical model code to run on a GPU by using kernel loop directives and device global memory to save variables used in the computations every timestep. Although the performance is good relative to running on a CPU (~40 times faster), I suspect there is more to be gained based on the diagnostic output of NSight Compute. Apparently, resolving/improving long scoreboard stalls and warp occupancy could mean unlocking more speed gains. However, I am struggling with resources and strategies for addressing these issues. Some approaches I tried (using coalesced data access and using shared memory) do not seem to help at all. Are there any good resources out there (including code examples, preferably in Fortran) to help with further optimizing GPU performance based on the diagnostic results of NSight Compute? Any references will be appreciated.

Thanks!

Other than the Nsight-Compute docs themselves, I’m not aware of other docs. Maybe they are out there, but I would think it would difficult to write since so much is contextual based on the specifics of the program, it would hard to generalize. Though Nsight-Compute Forum might be a good resource since you can see questions asked by others which may be helpful or at least give you ideas. Most folks post CUDA C questions, but the strategies would apply to Fortran as well.

For the long scoreboard, this may or may not impact performance depending on if other warps can be scheduled while others are waiting for memory. If it is impactful, then it might be a case where there’s not enough time to allow the fetch to be resolved before the data is being used. Shuffling operations might help here so you have the threads do something else before using the data.

Occupancy is more about balancing the shared resources of the device so having a high occupancy does not necessarily improve performance. Generally a 50% occupancy is considered very good especially for Fortran since these are more often scientific codes with larger kernels that use more resources per warp.

The first thing to ask is if the kernel is fully utilizing the GPU? Each multi-processor (SM) at 100% occupancy can run 2048 concurrent threads, so an H100 with 114 SMs needs 233,472 threads or 116,736 at 50% occupancy. Since you’re using CUF kernels and I presume are letting the compiler set the schedule (i.e. using"<<< *,* >>>", it will use the loop bounds to set the number of blocks and threads per blocks. In other words, the product of the loop iterations will determine how many threads are used. If there are too few iterations, there’s simply not enough work which gets reflected as a low occupancy. Also the long scoreboard becomes a factor since there’s not enough warps to hide the memory latency.

If the kernel does have enough work, then register usage is often the main limiter to occupancy. There are 65536 registers per SM, so to reach 100%, each of the 2048 threads can only use a max of 32 registers. If the threads use more registers, fewer of them can run. The difficult part is that register usage is dictated by the local variables used and scratch area for things like intermediate computation and holding address. So for larger scientific kernels, the primary method is to split the kernel into multiple kernels, so long as the algorithm allows it and the cost of launch more kernels is less than the speed-up from the better occupancy. For example at a hackathon I mentored, the code was ported from an older Fortran code that had a generic routine to handle seven cases. We split this into seven specialized kernels. While there was some repeated code, it was enough to half the register usage.

We do have a flag, "-gpu=maxregcount:<n>" which sets the maximum number of registers per thread. This can be used to help occupancy. However the data that once was stored in registers needs to go someplace, so instead is stored in “local memory”, which is just private space in main memory. Then registers “spill” to local memory so other local data can be stored in the register. This often leads to lower performance since the extra data movement offsets the gains in occupancy. I very rarely use this flag unless the registers at a borderline, like 33 or 129, But it’s interesting to experiment with to see how register usage affects occupancy.

Thanks for the reply. I will look into the NSight Compute documents and the aspects you mentioned.