Global Timing and Kernels

[Stop me when I go wrong]

I understand that my GT240 has 12 Multiprocessors (SMs), each with 8 SPs (giving my 12x8 = 96 Cuda cores). Capability 1.2

I’m trying to gain an understanding of how warps are allocated and scheduled on SPs, and the timing of such. For understanding, rather than practical application.

I’m using the Clock() function in the kernel and note from the Cuda reference manual that it ” returns the value of a per-multiprocessor counter that is incremented every clock cycle”.

Q1) Are the clock counters on all Multiprocessors synchronised, or can simultaneous calls to clock() from different SMs return wildly different values?

E.g. If Warp 0 (say Thread 0) in Block 0 gets (say) 2345 when calling clock(), and Warp 0 in Block 1 gets 2347, and the blocks/warps are running on different Multiprocessors – can I infer that W0B0 started 2 clock cycles before W0B1 in real-time? I/O Latency ignored.

AND Q2)

Is there any way for a block/warp/thread to determine which Multiprocessor (1 of 12) it has been allocated to? Likewise which actual core it is scheduled on?

Thanks
Charlie

Hi Charlie,

The answer to your second question will help you answer the first one: You can determine the ID of the SM you are running on via the smid register, e.g. as described here [url]using %smid? (compiling a ptx?) - CUDA Programming and Performance - NVIDIA Developer Forums.

Top banana

Thank you.

Charlie

Good grief, now what’s going on?

I’ve pasted in the code from the above thread – great. And used that to extract an SM id.

But it’s returning values between 0 and 14. Which would be fine, except I’m running on a GT 240 that allegedly only has 12 SMs, not 15.

12 SMs x 8 Cores is returned by deviceQuery.exe, and gives me the 96 cores as it says on the tin. So why am I getting %smid values of 12,13 and 14?

Do I actually have 15 physical SMs, and perhaps the GPU only uses 12 concurrently? Or is the smid value got a bit more to it than just the SM number?

[Edit: Scrap that. I’m only getting 12 distinct values returned - 0,1,2, 4,5,6, 8,9,10, 12,13,14. The values 3,7,11 are missing. D’oh! Presumably there’s some underlying hardware reason why SMs are not identified sequentially]

Charlie

Yeah, I would suspect that the GT 240 is die-harvested, with SMs that fail testing getting disabled. The other option is that the die was designed by deleting SMs from a bigger chip design, and the hard-coded SM numbers were left as-is.

Page 191 of the PTX ISA doc states that %nsmid can be greater than your accessible SM’s.

My guess is that each SM has a hardwired SMID and that the “missing” SMIDs were disabled during manufacturing.

Or, I could be wrong, and it could be far more volatile of an identifier than that!

Just for kicks I made a test program. The gist is here.

Some results:

Tesla   K20c    (13) [  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12 ]
GeForce GT 240  (12) [  0,  1,  2, --,  4,  5,  6, --,  8,  9, 10, --, 12, 13, 14, ... ]
GeForce GTX 680 ( 8) [  0,  1,  2,  3,  4,  5,  6,  7 ]
GeForce GT 545  ( 3) [  0,  1,  2 ]
GeForce 9400 GT ( 4) [  0,  1, --, --,  4,  5, ... ]

Note that pre-Fermi devices don’t support %nsmid.

The conclusion from this sample seems to be that only pre-Fermi devices have SMIDs that aren’t always in the range (0 … multiProcessorCount-1).

Dear Allan,
Thanks for the code:-)
Here are results for a few more GPUs:

GeForce GTX 580 (16)     [  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 ]
Tesla C2050 (14)         [  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13 ]
Tesla T10 Processor (30) [  0,  1,  2, --,  4,  5,  6, --,  8,  9, 10, --, 12, 13, 14, --, 16, 17, 18, --, 20, 21, 22, --, 24, 25, 26, --, 28, 29 ]
GeForce GTX 295 (30)     [  0,  1,  2, --,  4,  5,  6, --,  8,  9, 10, --, 12, 13, 14, --, 16, 17, 18, --, 20, 21, 22, --, 24, 25, 26, --, 28, 29 ]
Quadro NVS 290 ( 2)      [  0,  1, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, -- ]

ps: the Tesla T10 Processor (30) and GeForce GTX 295 (30) also use smid 30, 32-34 and 36-38.

Tesla T10 Processor (30) [  0,  1,  2, --,  4,  5,  6, --,  8,  9, 10, --, 12, 13, 14, --, 16, 17, 18, --, 20, 21, 22, --, 24, 25, 26, --, 28, 29, 30, --, 32, 33, 34, --, 36, 37, 38 ]

Oops, a bug in the code snippet. Looks like the g_smid array should’ve been larger. I updated the gist.