I was looking at a chart on wikipedia about the features of different compute levels. After 3.5 it appears that every following compute level is the same. Why is this?
I want to buy a gpu that is compute level 3.5 but if there is some significant difference between that and a 5.0 I want to know.
Ok it just looks like differences in performance but not in actual language features. Right now I have a GTX 960 which is of compute level 5.2. I am running my display off of it which makes it hard / impossible to debug and run some cuda programs. I wanted to buy a weaker gpu like a gtx 730 just for debugging and development purposes. The 730 is of compute level 3.5 which I think should be fine.
the Kepler’s SMX have been simplified and streamlined somewhat to become Maxwell’s SMMs.
Anything you see in this table as requiring “Multiple instructions” in 5.4.1 Table 2, column “5.x”, which had a fixed number of cycles in previous compute models was removed in the transition from SMX to SMM.
SIMD video instructions were removed as a hardware feature. Hardware instruction for count of leading zeros and most significant non-sign bit was removed as a hardware feature, same for 32-bit integer multiply, multiply-add, extended-precision multiply-add.
Also the instruction scheduler was simplified greatly. Now the compiler has to provide extra control words indicating data dependence and timings, instead of complex circuitry on the GPU figuring this out at run time using methods such as scoreboarding.
Could you point to the part of the CUDA documentation that you are referring to? Sometimes documentation is buggy, too, and there is neither a debugger nor a regression test one can run on documentation …
Clearly lists the number of operations per clock cycle as 32 for Compute 3.x and as “Multiple Instructions” for Compute 5.x architectures. This seems to imply that a hardware instruction was replaced by a software emulation requiring multiple instructions.
“count of leading zeros”, a.k.a. __clz(), has been a two-instruction sequence ever since the FLO instruction was added in sm_20. Prior to that it was a five-instruction sequence if I recall correctly.
So “multiple instructions” would be correct for count leading zeros, across all currently supported architectures. “most significant non-sign bit” would appear to be a reference to the FLO instruction itself? If so, that is obviously a single instruction. I do not know how the throughput of FLO has changed between architectures, it is certainly possible that its throughput has been lowered in sm_5x.
While the document does not indicate that hardware support was removed, NVIDIA may want to clarify the description to avoid misunderstandings since functionality does not seem to have changed all the way from sm_20 and sm_52 with regard to these operations (performance may well have changed).
I’ve filed a bug internally at NVIDIA to refer to this. Having said that, I’m not entirely sure what is being referenced by the “Multiple instructions” listed under cc5.x, so it may be possible that there is some logic here, although it’s not entirely evident.