[Stop me when I go wrong]
I understand that my GT240 has 12 Multiprocessors (SMs), each with 8 SPs (giving my 12x8 = 96 Cuda cores). Capability 1.2
I’m trying to gain an understanding of how warps are allocated and scheduled on SPs, and the timing of such. For understanding, rather than practical application.
I’m using the Clock() function in the kernel and note from the Cuda reference manual that it ” returns the value of a per-multiprocessor counter that is incremented every clock cycle”.
Q1) Are the clock counters on all Multiprocessors synchronised, or can simultaneous calls to clock() from different SMs return wildly different values?
E.g. If Warp 0 (say Thread 0) in Block 0 gets (say) 2345 when calling clock(), and Warp 0 in Block 1 gets 2347, and the blocks/warps are running on different Multiprocessors – can I infer that W0B0 started 2 clock cycles before W0B1 in real-time? I/O Latency ignored.
Is there any way for a block/warp/thread to determine which Multiprocessor (1 of 12) it has been allocated to? Likewise which actual core it is scheduled on?