CUDA Cores, Multiprocessors and OpenCL

Hello,

I’m reading the OpenCL Programming Guide, and I’m realizing that newer Nvidia GPUs have less Multiprocessors than previous GPUs.

GTX 480 = 15 vs GTX 280 = 30

So my first question is: The way to improve scalability in my OpenCL App for future, is not counting with more Multiprocessors?

But I can see, that number of CUDA Cores is higher, but I don’t find any mapping between CUDA Cores and Thread,Thread Block etc in OpenCL Programming Guide.

Could you clarify me, about what a CUDA Core is in OpenCL? Or what it contains?

I’m assuming that CUDA Cores are part of Multiprocessors right? And the way the Nvidia improved this new GPU Gen, was increasing the number of cores inside each Multiprocessor, is that true?

Thank you

There is the option to completely ignore the # of Multiprocessors (and thus any translation to CUDA cores) when it comes to determining worksize / workgroup settings. Just breaking down and writing generic calibration routines to find the best settings using observation seems to me to be both more future & vendor proof. The # of processors may not always be the limiting factor for determining best performance, especially when either many registers are used or for Image I/O. It takes some experimenting just to get the calibrator to return the same or a really close value for the consecutive calibration runs, but I found it possible.

It is not that things like the # of registers can not be controlled, but with so many moving parts, getting actual timings at all possible settings seems like the ultimate test. This process also produces a very useful artifact, the best time. Knowing what that is before & after making changes can very instructive.

Just store the devices & software version of the last calibration, and check that nothing has changed since then every time. If something has changed, then recalibration is first done. In version 1.1 there is a new kernel info query that tells the size to use when incrementing workgroup size. This means that code used to guess this # for platforms like OSX & ATI, can soon retired.

There is the option to completely ignore the # of Multiprocessors (and thus any translation to CUDA cores) when it comes to determining worksize / workgroup settings. Just breaking down and writing generic calibration routines to find the best settings using observation seems to me to be both more future & vendor proof. The # of processors may not always be the limiting factor for determining best performance, especially when either many registers are used or for Image I/O. It takes some experimenting just to get the calibrator to return the same or a really close value for the consecutive calibration runs, but I found it possible.

It is not that things like the # of registers can not be controlled, but with so many moving parts, getting actual timings at all possible settings seems like the ultimate test. This process also produces a very useful artifact, the best time. Knowing what that is before & after making changes can very instructive.

Just store the devices & software version of the last calibration, and check that nothing has changed since then every time. If something has changed, then recalibration is first done. In version 1.1 there is a new kernel info query that tells the size to use when incrementing workgroup size. This means that code used to guess this # for platforms like OSX & ATI, can soon retired.