I’m working on porting a Fortran CPU code to GPUs. Data parallelization on this particular code is challenging. The data structures are not regular, memory access can’t really be coalesced, and the “unit of work” is too large for a single thread and too small for a large block of 256+ threads. I’m considering launching many (~2 million) blocks of 64 threads each instead of trying to use larger blocks and then fusing as many kernels as possible to keep data in shared memory where it can be coalesced. Is there a performance or other issue that comes with launching so many small blocks? A grid stride loop is an option as well if launching that number of blocks is a problem but the size of them isn’t. I anticipate about 20KB of shared memory usage per block and 150-200 registers per thread.
focusing just on the block size (not the other concerns you have indicated, such as uncoalesced access), the main thing you would want to aim for would be the possibility of full occupancy. This depends to some degree on the GPU you intend to run on. What we are looking for is a product of block size in number of warps times the maximum blocks per SM to be equal to or greater than the maximum warps per SM, pulling relevant data from this table in the programming guide.
64 threads per block corresponds to 2 warps per block.
Example, for cc 6.1 GPU:
2 warps/block x 32 blocks/SM = 64 warps/SM (which matches what is in the table)
So 64 threads per block is a reasonable “minimum” size in this case. If we chose 32 threads per block in the same scenario that would be a maximum of 32 warps/SM, which would not match the hardware maximum of 64 warps/SM, i.e. you could only achieve a maximum of 50% occupancy.
You can do a similar analysis depending on your specific GPU, and for most GPUs I think you will find that 64 threads per block is enough to achieve maximum occupancy looking strictly at the block size.
Other limiters to occupancy might include shared memory usage (you mention using shared memory) but this is not necessarily connected to block size or number of blocks. Typical shared usage tends to be predicated on threads, and threads per SM does not vary depending on how you divide available warps among threadblocks, as long as you have taken into account the treatment above for occupancy based on warps. Some folks will also mention a desire to have a minimum of 4 warps per threadblock. Whatever this is connected to I would consider a second-order concern, if any. There might be some performance impact/benefit to have 128 threads per threadblock, but I’d be surprised if it is measurable at more than a few percent difference, that is what I mean by “second-order”. OTOH, a 50% occupancy “mistake” could possibly cut your performance by more than a few percent, in my experience.
Those are going to be huge limiters to occupancy (and therefore, possibly performance) compared to any concern about the number of threads per block taken by itself.
10KB shared per warp means you are going to be limited to some number of warps in the range of 4-16 or so, and these numbers are all below the 32-64 warps that are possible in a modern GPU SM.
150 registers/thread divided by the 64K registers available in most modern GPU SM will limit you to around 436 threads, or about 13 warps, again less than the 32-64 warps that are possible.
It’s generally a good idea (in my view) to achieve a good proper working article for benchmarking and baseline comparisons, before allowing the considerations I have raised here to have a large impact on your initial coding direction. But it is probably wise to keep in mind that at some point you may discover occupancy to be a performance limiter (based on shared usage and register usage, not based on block size of 64 threads per block). Furthermore, if you have an alternate design path that allows you to achieve full occupancy rather than 50% or less, it may be worth keeping that in your back pocket for comparison later.
I have certainly worked on codes where substantial refactoring effort to try to increase occupancy actually resulted in a performance decrease. In one particular case I recall, it was because as I increased occupancy, my data working set increased beyond the L2 cache size, and this had a significant negative performance hit, which counteracted any gains I might have gotten from better occupancy. In my experience there is no ideal alternative to some experimentation when you have alternate design ideas; for my mind, at least, the GPU is too complex to predict all side-effects ahead of time.
There is no first-order concern to launching a grid with a large number of blocks. A grid-stride loop may provide a performance benefit, but in my experience it is rarely more than few percent. A grid-stride loop has a primary purpose (in my view) to provide flexibility in decoupling the grid size from the work or data-set size. Some of the reasons to do this decoupling may be with an eye towards performance, but by itself a grid-stride loop realization of a basic transform-type operation compared to a naive 1 thread per data element realization would rarely be more than a few percent difference in performance, and I promise you I have run across cases where the 1 thread per element realization (and corresponding larger grid size) actually ran slightly faster than the grid-stride loop realization. With a bit of searching you can find people who have posted examples of such.
If your threads need to access individual data by indices (instead of exchange data between threads), you can have the lowest array dimension of an array in shared memory (the one responsible for bank conflicts) be a [32] array, where you put the lane number (assuming the elements have size 4 bytes).
Then you never get bank conflicts regardless of the indices.
E.g. (in C, which has reverse array index order compared to Fortran)
__shared__ volatile float array[80][2][32]; // 20 KiB size
int warp = threadIdx.y; // 0..1
int lane = threadIdx.x; // 0..31
int elem = 79; // 0..79
float res = array[elem][warp][lane];
It is also possible to change the ‘logical block size’ during a kernel run.
E.g. your kernel comprises steps A, B and C.
Then during A your 64 threads work on 4 units of work in a 4-iteration for loop; during B they split the work; 16 threads do 1 unit of work each simultaneously. and during C you have a 2-iteration for loop with 32 threads doing 1 unit of work each.
Thank you for the detailed answer. Unfortunately the shared memory usage is actually the hard requirement and 20KB is the lowest likely limit, I’m loading a specific set of data into shared memory and loading a smaller or bigger set isn’t really feasible at the current stage without creating an unmanageable level of complexity. Even the perfectly coalesced with regular data structures version of this code is typically limited by register usage so I think I’ll just have to accept less than ideal occupancy. I’ll try experimenting with different block sizes to see what works best.
There are some possibilities to make low-occupancy code more feasible:
- Avoid operations with long latency (no long math calculations, not too many accesses to shared memory, few instructions for loading from global memory)
- Up till Turing you could have the instructions for loading from global memory separated into their own ‘logical block’ per SM. E.g. having 9 blocks (8 for computations, 1 dedicated for loading from memory). The block used for loading from memory would do this work for the other blocks asynchronously, so they can continue to run and would not block (saving latency). As each of the 9 blocks would have 20 KiB of shared memory, this 9th block would switch its shared memory with one of the other blocks and then load the data for the next computation block. You would individually synchronize the blocks by using 8 named barriers. To be able to switch shared memory, all 9 ‘logical’ blocks would be inside one actual thread block with 9*64 threads.
- Alternatively starting with Ampere you can directly asynchronously load into shared memory: CUDA C++ Programming Guide
Unfortunately this code is pretty much nothing but long math calculations and many accesses to shared memory or global memory. I’m not aiming for perfect performance, it will never happen with this kind of code, I’m mostly aiming for significant speedup compared to the CPU version.
I’d love to use cuda::memcpy_async
but it’s not available in CUDA Fortran. Switching the CUDA portions of the code to C++ is my preference but I’m not in a position to dictate language choice in this project.
As far as I can tell, named barriers are also not supported in CUDA Fortran.
We have some CUDA Fortran experts that patrol this forum. Just for future reference. Detailed CUDA Fortran questions will probably get a better response there. But I don’t think any of the general statements I’ve made around occupancy, shared usage, or register usage, would be any different in a CUDA Fortran setting (vs. CUDA C++).
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.