Dispatching tasks properly on NVIDIA 980 GTX [DirectCompute]

Hi all,

I’m new in the forum, nice to meet you all. I’d like to ask you something about maximum performance when dispatching tasks in DirectCompute, using a NVIDIA 980 GTX. As far as I know, to achieve the maximum performance it is quite important to know the architecture of the graphic card you are using. In my case, I use a NVIDIA 980 GTX that has 16 SM’S, each one with 4 WS’s containing 32 cuda cores per WS (This is a total of 16432=2.048 cuda cores in total).

Let’s say that my simple aim is to visualize 2.048 points (one point per core) all moving around the screen (I use Unity3D with DirectCompute). To do this, I assign one single point to one core and then perform some basic calculations to change the point position. My question is how to dispatch (or scale) groups (a.k.a blocks in CUDA) and threads to maxime the fps.

As far as I know, each group of threads is assigned to a SM, and (in the case of a NVIDIA 980) each SM can run simultaneously (in a single clock cycle) 128 tasks (4 WS’s x 32 threads each one). So, according the NVIDIA 980 architecture, my question is, which option do you think is better in terms of performance of FPS?

1- If I dispatch 16 groups (e.g. [4,4,1]), each one of 128 threads ([128,1,1]), then this will be done in 1 clock cycle, because each group will be run on one SM (and each one can run 128 tasks instantaneously). So, the 2048 tasks (16SM4WS32threads) are done once.
2- If I dispatch 8 groups (e.g. [4,2,1]), each one of 256 threads ([256,1,1]), then this will be done in 2 clock cycles, because each of the 8 groups will be run on 8 SM’s (and each one can run ‘only’ 128 tasks -of the 256- instantaneously). So, in the first clock cycle the 8 groups process 1024 tasks (8SM4WS32threads), and in the second cycle the remaining 1024 tasks are done. Total of 2 clock cycles.

Is this correct? Is the first option better in performance than the second one?
If yes, can I conclude that it is generally better to keep your SM’s all together working (in the case of NVIDIA 980 GTX, dispatch a number of groups multiple of 16), instead of dispatching a larger number of threads with less number of groups?

Thank you very much,
You’d be helping me a lot,
Kind regards
Antonio

i suppose by groups you mean blocks

i would agree that, in general, knowing how much work you have, and the kind of work, it is better to set up and distribute blocks between sm’s as best as possible
but this is ‘in general’

depending on the work done, you may or may not see a significant difference in execution time between 1) and 2), because, your clock cycle calculations are assumptious and ‘optimistic’, and 2) can easily ‘interleave’ its work such that it appears as 1)

blocks of kernels are also distributed such that you can not truly know on which sm the block is likely to run, or how the blocks would be distributed in the end

Many folks approach GPU programming from this perspective, but it is the wrong perspective.

Your goal should be to oversubscribe the GPU. It wants and needs many more threads than “cores”, so on a 980GTX your goal would be to dispatch work that has many more than 2048 threads in it. Think 100,000 threads or more.

Stated another way, if you have at most 2048 threads of work, the problem is poorly suited to your 980GTX GPU, and the average performance will be disappointing.

You don’t get to choose how the work is distributed to SMs, so there is little point in worrying about it.