Work Groups and Work Units How do they correspond to actual GPU threads?

Hi I was wondering if someone can explain to me how do work groups and work units get executed on Nvidia hardware in terms of kicking of threads(smallest instance of execution i.e. a single kernel)?

Say if I had n work groups each with m work units on an array with K length (just a simple 1-1 copy in kernel, no cris-crossing between global index & array index)? How many threads will get executed say on a Quadro FX 580(which has 4 compute units) at the same time and will they all work on 1 work group each or how?

Basically it would be good if someone wrote a chronology/timeline of execution explaining the threads connected to the work units and work groups?

Like for example :

Execute instance 0 : 4 Compute units work on work group 0-3 -> thread 0 in ComputeUnit0 works on WorkUnit0 in WorkGroup0, t1 in CU0 works on WU1 in WG0 and so on…
Ex. inst 1 : 4 Compute units work on work group 5-8 -> thread 0 in CompUnit0 works on WorkUnit0 in WorkGroup5 …

I hope I am not asking for too much :)

I was going to point to a page that had links to Nvidia’s OpenCL Programmers Guide & Best Practices Guide, but they seem to be no longer on the page:

http://www.nvidia.com/object/cuda_opencl_new.html

These PDF’s, especially the Best Practices Guide, assume you know Cuda, & even use Cuda Terminology over OpenCL’s. You practically need to do a translation in your head. It is terrible. That said, they are still the best place to start.

If this removal was un-intended, Nvidia please put it back. A rewrite to use OpenCL terminology would be really good.

I just noticed a sticky topic on AMD’s OpenCL forum that has a video series link. I watched about 3 mins of the first video. Too introductory for me at this point. It might be good for someone who has not written their first kernel yet. I do not have a hour to sit through stuff I probably already know, hoping there are things I do not.