Global Timing and Kernels

charlie1970 · February 9, 2013, 3:49pm

[Stop me when I go wrong]

I understand that my GT240 has 12 Multiprocessors (SMs), each with 8 SPs (giving my 12x8 = 96 Cuda cores). Capability 1.2

I’m trying to gain an understanding of how warps are allocated and scheduled on SPs, and the timing of such. For understanding, rather than practical application.

I’m using the Clock() function in the kernel and note from the Cuda reference manual that it ” returns the value of a per-multiprocessor counter that is incremented every clock cycle”.

Q1) Are the clock counters on all Multiprocessors synchronised, or can simultaneous calls to clock() from different SMs return wildly different values?

E.g. If Warp 0 (say Thread 0) in Block 0 gets (say) 2345 when calling clock(), and Warp 0 in Block 1 gets 2347, and the blocks/warps are running on different Multiprocessors – can I infer that W0B0 started 2 clock cycles before W0B1 in real-time? I/O Latency ignored.

AND Q2)

Is there any way for a block/warp/thread to determine which Multiprocessor (1 of 12) it has been allocated to? Likewise which actual core it is scheduled on?

Thanks
Charlie

PedroUK · February 9, 2013, 5:53pm

Hi Charlie,

The answer to your second question will help you answer the first one: You can determine the ID of the SM you are running on via the smid register, e.g. as described here [url]using %smid? (compiling a ptx?) - CUDA Programming and Performance - NVIDIA Developer Forums.

charlie1970 · February 9, 2013, 5:57pm

Top banana

Thank you.

Charlie

charlie1970 · February 10, 2013, 5:55pm

Good grief, now what’s going on?

I’ve pasted in the code from the above thread – great. And used that to extract an SM id.

But it’s returning values between 0 and 14. Which would be fine, except I’m running on a GT 240 that allegedly only has 12 SMs, not 15.

12 SMs x 8 Cores is returned by deviceQuery.exe, and gives me the 96 cores as it says on the tin. So why am I getting %smid values of 12,13 and 14?

Do I actually have 15 physical SMs, and perhaps the GPU only uses 12 concurrently? Or is the smid value got a bit more to it than just the SM number?

[Edit: Scrap that. I’m only getting 12 distinct values returned - 0,1,2, 4,5,6, 8,9,10, 12,13,14. The values 3,7,11 are missing. D’oh! Presumably there’s some underlying hardware reason why SMs are not identified sequentially]

Charlie

seibert · February 10, 2013, 7:57pm

Yeah, I would suspect that the GT 240 is die-harvested, with SMs that fail testing getting disabled. The other option is that the die was designed by deleting SMs from a bigger chip design, and the hard-coded SM numbers were left as-is.

allanmac · February 10, 2013, 8:03pm

Page 191 of the PTX ISA doc states that %nsmid can be greater than your accessible SM’s.

My guess is that each SM has a hardwired SMID and that the “missing” SMIDs were disabled during manufacturing.

Or, I could be wrong, and it could be far more volatile of an identifier than that!

allanmac · February 10, 2013, 9:27pm

Just for kicks I made a test program. The gist is here.

Some results:

Tesla   K20c    (13) [  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12 ]
GeForce GT 240  (12) [  0,  1,  2, --,  4,  5,  6, --,  8,  9, 10, --, 12, 13, 14, ... ]
GeForce GTX 680 ( 8) [  0,  1,  2,  3,  4,  5,  6,  7 ]
GeForce GT 545  ( 3) [  0,  1,  2 ]
GeForce 9400 GT ( 4) [  0,  1, --, --,  4,  5, ... ]

Note that pre-Fermi devices don’t support %nsmid.

The conclusion from this sample seems to be that only pre-Fermi devices have SMIDs that aren’t always in the range (0 … multiProcessorCount-1).

wlangdon · July 17, 2013, 10:20am

Dear Allan,
Thanks for the code:-)
Here are results for a few more GPUs:

GeForce GTX 580 (16)     [  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 ]
Tesla C2050 (14)         [  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13 ]
Tesla T10 Processor (30) [  0,  1,  2, --,  4,  5,  6, --,  8,  9, 10, --, 12, 13, 14, --, 16, 17, 18, --, 20, 21, 22, --, 24, 25, 26, --, 28, 29 ]
GeForce GTX 295 (30)     [  0,  1,  2, --,  4,  5,  6, --,  8,  9, 10, --, 12, 13, 14, --, 16, 17, 18, --, 20, 21, 22, --, 24, 25, 26, --, 28, 29 ]
Quadro NVS 290 ( 2)      [  0,  1, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, --, -- ]

wlangdon · July 17, 2013, 3:13pm

ps: the Tesla T10 Processor (30) and GeForce GTX 295 (30) also use smid 30, 32-34 and 36-38.

Tesla T10 Processor (30) [  0,  1,  2, --,  4,  5,  6, --,  8,  9, 10, --, 12, 13, 14, --, 16, 17, 18, --, 20, 21, 22, --, 24, 25, 26, --, 28, 29, 30, --, 32, 33, 34, --, 36, 37, 38 ]

allanmac · July 17, 2013, 4:35pm

Oops, a bug in the code snippet. Looks like the g_smid array should’ve been larger. I updated the gist.

Topic		Replies	Views
any way to know on which SM a thread is running? CUDA Programming and Performance	22	12442	July 17, 2017
%smid register returning 0 CUDA Programming and Performance	2	134	July 24, 2025
Which SM is each block using? CUDA Programming and Performance	1	869	July 15, 2013
How to know the scheduling information about the kernel? CUDA Programming and Performance cuda	7	888	May 28, 2024
Identifying SM number of a block CUDA Programming and Performance	1	4241	May 3, 2010
Multiple CUDA streams assigned mostly to sm 0 CUDA Programming and Performance	0	614	June 26, 2018
Do different SMs of gpu use different clocks? CUDA Programming and Performance	3	603	September 19, 2023
how does clock() work CUDA Programming and Performance	6	3018	January 26, 2020
Question about the assignment of SMS through Green Context CUDA Programming and Performance cuda	5	141	October 24, 2025
visibility what thread contains to what SM CUDA Programming and Performance	5	1893	August 5, 2013

Global Timing and Kernels

Related topics