I’m new in the forum, nice to meet you all. I’d like to ask you something about maximum performance when dispatching tasks in DirectCompute, using a NVIDIA 980 GTX. As far as I know, to achieve the maximum performance it is quite important to know the architecture of the graphic card you are using. In my case, I use a NVIDIA 980 GTX that has 16 SM’S, each one with 4 WS’s containing 32 cuda cores per WS (This is a total of 16432=2.048 cuda cores in total).
Let’s say that my simple aim is to visualize 2.048 points (one point per core) all moving around the screen (I use Unity3D with DirectCompute). To do this, I assign one single point to one core and then perform some basic calculations to change the point position. My question is how to dispatch (or scale) groups (a.k.a blocks in CUDA) and threads to maxime the fps.
As far as I know, each group of threads is assigned to a SM, and (in the case of a NVIDIA 980) each SM can run simultaneously (in a single clock cycle) 128 tasks (4 WS’s x 32 threads each one). So, according the NVIDIA 980 architecture, my question is, which option do you think is better in terms of performance of FPS?
1- If I dispatch 16 groups (e.g. [4,4,1]), each one of 128 threads ([128,1,1]), then this will be done in 1 clock cycle, because each group will be run on one SM (and each one can run 128 tasks instantaneously). So, the 2048 tasks (16SM4WS32threads) are done once.
2- If I dispatch 8 groups (e.g. [4,2,1]), each one of 256 threads ([256,1,1]), then this will be done in 2 clock cycles, because each of the 8 groups will be run on 8 SM’s (and each one can run ‘only’ 128 tasks -of the 256- instantaneously). So, in the first clock cycle the 8 groups process 1024 tasks (8SM4WS32threads), and in the second cycle the remaining 1024 tasks are done. Total of 2 clock cycles.
Is this correct? Is the first option better in performance than the second one?
If yes, can I conclude that it is generally better to keep your SM’s all together working (in the case of NVIDIA 980 GTX, dispatch a number of groups multiple of 16), instead of dispatching a larger number of threads with less number of groups?
Thank you very much,
You’d be helping me a lot,