How many kernel instances are actually running?

Hello. Consider the Tesla C1060. It has 30 multiprocessors and 240 cores.

  1. Does that mean there are 8 cores per multiprocessor (likely)? Or 240 cores per multiprocessor?
  2. Can each core run up to one warp (32 threads) or one half-warp (16 threads) simultaneously?
  3. Finally: what is the total number of instances of a kernel that could run simultaneously on this device?

Thank you in advance.

It’s 30 multiprocessor with 8 cores each.

Each core runs a full warp in 4 clock cycles, one quarter warp per cycle, as each warp shares one instruction counter. Memory controller works on 16 thread sized requests (either first or second half warp)

each core can hand a limited number of blocks and a limited number of warps (numbers depends on compute, if memory serves its 8 blocks and 32 warps on the c1060).

The is one kernel active at a time on the entire cards, not sure what you mean by number of instances though, the kernel is just the device code that each thread runs, so theoretically you could have 301024 “instances” at a time, or 3032 if we take vectorization (warp) as a single instance.

It’s 30 multiprocessor with 8 cores each.

Each core runs a full warp in 4 clock cycles, one quarter warp per cycle, as each warp shares one instruction counter. Memory controller works on 16 thread sized requests (either first or second half warp)

each core can hand a limited number of blocks and a limited number of warps (numbers depends on compute, if memory serves its 8 blocks and 32 warps on the c1060).

The is one kernel active at a time on the entire cards, not sure what you mean by number of instances though, the kernel is just the device code that each thread runs, so theoretically you could have 301024 “instances” at a time, or 3032 if we take vectorization (warp) as a single instance.

Correction: each multiprocessor can issue a single instruction for a full warp in 4 clock cycles (assuming C1060 as mentioned above).

Each core can be thought of as working on a single thread in any given clock cycle, which is why it takes 4 clock cycles to get through a single instruction for 32 threads with 8 cores.

Correction: each multiprocessor can issue a single instruction for a full warp in 4 clock cycles (assuming C1060 as mentioned above).

Each core can be thought of as working on a single thread in any given clock cycle, which is why it takes 4 clock cycles to get through a single instruction for 32 threads with 8 cores.