Hi I was wondering if someone can explain to me how do work groups and work units get executed on Nvidia hardware in terms of kicking of threads(smallest instance of execution i.e. a single kernel)?
Say if I had n work groups each with m work units on an array with K length (just a simple 1-1 copy in kernel, no cris-crossing between global index & array index)? How many threads will get executed say on a Quadro FX 580(which has 4 compute units) at the same time and will they all work on 1 work group each or how?
Basically it would be good if someone wrote a chronology/timeline of execution explaining the threads connected to the work units and work groups?
Like for example :
Execute instance 0 : 4 Compute units work on work group 0-3 → thread 0 in ComputeUnit0 works on WorkUnit0 in WorkGroup0, t1 in CU0 works on WU1 in WG0 and so on…
Ex. inst 1 : 4 Compute units work on work group 5-8 → thread 0 in CompUnit0 works on WorkUnit0 in WorkGroup5 …
…
I hope I am not asking for too much :)