My question is: when setting Grid size, is that better if we set them 110 instead of 25, for example?
Quick Answer: It depends
Better Answer: Trial and error is the best solution.
computations are performed in warps (32) and moreover, optimal SM load usually require grid size of 128 or 256. if you have less work, look for the ways to combine multiple work items into the single grid
next, single SM can perfrom up to 2048 threads simultaneously, so with 128 items per grid, each SM can perfrom up to 2048/128=16 grids
next, note that each grid can be perfromed by only single SM, so you need to have at least as much grids as you have SMs to fill entire GPU, and you may need 16x more to fill all GPU resources if each grid is 128-thread wide
finally, some SMs will finish their work earlier than others, so to ensure almost 100% GPU resource usage, you need to have at least 10-20x more work than required to fill GPU at every particular moment. Alternatively, you can use multiple streams to push work to GPU concurrently and avoid the “tail effect”
so, overall, you need at least ~100k threads to fill GPU. with only 10 threads, your CPU will be definitely faster
has anybody tried to use cudaOccupancyMaxActiveBlocksPerMultiprocessor to configure grid ?
No matter what, hopefully the code was written using grid-stride loops :P
The only problem I’ve ever run into when using those though is typing
return instead of
continue. Don’t do that lol XD