I’m working with a Quadro 2000M.
The device properties is telling me there are 4 multiprocessors, with a warp-size of 32.
The spec’s say this chip has 192 cores.
4 * 32 = 128
Where do the other 64 cores come from?
I’m working with a Quadro 2000M.
The device properties is telling me there are 4 multiprocessors, with a warp-size of 32.
The spec’s say this chip has 192 cores.
4 * 32 = 128
Where do the other 64 cores come from?
It is a 2.1 cuda capable device, each SM has 48 cuda cores. 4SMx48core=192.
Warp size has nothing to do with the number of cores.
Here, each SM has 48 HW-cores.
The SM can only run one kernel at a time. That would mean that every HW-core in the SM is running the same kernel. (Not some cores running kernelA, while others run kernelB.) True?
The warp scheduler issues work to the SM in units of warp. (Here a warp is 32 threads each.) True?
How can I keep all 48 HW-cores busy if work is issued in units of 32?
What am I missing?
Fermi devices issue warp instructions to groups of 16 CUDA cores at a time, not all 32. On compute capability 2.1, there are two warp schedulers and one of them can issue two independent instructions from the same warp. So in general you will keep 32 cores busy if you have 2 warps ready to run all the time, and sometimes you will have 48 busy cores if there are warps with independent instructions. And remember that modern compute hardware is pipelined, so there are something like 10 warp instructions in the process of being executed by each group of 16 CUDA cores at any given time. This is why the CUDA programming guide recommends that you have a lot of warps available to maximize utilization.