What is the difference between cuda comput level 3.5 and 5.0

I was looking at a chart on wikipedia about the features of different compute levels. After 3.5 it appears that every following compute level is the same. Why is this?

I want to buy a gpu that is compute level 3.5 but if there is some significant difference between that and a 5.0 I want to know.

Maybe you should look at the NVIDIA docs:


here are some tables that list differences – 3.5 and 5.x look different to me:


especially table 12.

Ok it just looks like differences in performance but not in actual language features. Right now I have a GTX 960 which is of compute level 5.2. I am running my display off of it which makes it hard / impossible to debug and run some cuda programs. I wanted to buy a weaker gpu like a gtx 730 just for debugging and development purposes. The 730 is of compute level 3.5 which I think should be fine.

the Kepler’s SMX have been simplified and streamlined somewhat to become Maxwell’s SMMs.

Anything you see in this table as requiring “Multiple instructions” in 5.4.1 Table 2, column “5.x”, which had a fixed number of cycles in previous compute models was removed in the transition from SMX to SMM.


SIMD video instructions were removed as a hardware feature. Hardware instruction for count of leading zeros and most significant non-sign bit was removed as a hardware feature, same for 32-bit integer multiply, multiply-add, extended-precision multiply-add.

Also the instruction scheduler was simplified greatly. Now the compiler has to provide extra control words indicating data dependence and timings, instead of complex circuitry on the GPU figuring this out at run time using methods such as scoreboarding.

I don’t see that. The ‘FLO’ instruction seems to be very much alive. Given the following kernel instantiated for ‘int’

template <class T>
__global__ void kernel (T *out, const T *in)
    *out = __clz (*in);

the SASS for sm_35 is

code for sm_35
                Function : _Z6kernelIiEvPT_PKS0_
        .headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                           /* 0x08a010dc10a01000 */
        /*0008*/                   MOV R1, c[0x0][0x44];   /* 0x64c03c00089c0006 */
        /*0010*/                   MOV R4, c[0x0][0x148];  /* 0x64c03c00291c0012 */
        /*0018*/                   MOV R5, c[0x0][0x14c];  /* 0x64c03c00299c0016 */
        /*0020*/                   LD.E R0, [R4];          /* 0xc4800000001c1000 */
        /*0028*/                   MOV R2, c[0x0][0x140];  /* 0x64c03c00281c000a */
        /*0030*/                   MOV R3, c[0x0][0x144];  /* 0x64c03c00289c000e */
        /*0038*/                   FLO.U32 R0, R0;         /* 0xe1800000001c0002 */
                                                           /* 0x0800000000b810a0 */
        /*0048*/                   ISUB R0, 0x1f, R0;      /* 0xc09000000f9c0001 */
        /*0050*/                   ST.E [R2], R0;          /* 0xe4800000001c0800 */
        /*0058*/                   EXIT;                   /* 0x18000000001c003c */
        /*0060*/                   BRA 0x60;               /* 0x12007ffffc1c003c */
        /*0068*/                   NOP;                    /* 0x85800000001c3c02 */
        /*0070*/                   NOP;                    /* 0x85800000001c3c02 */
        /*0078*/                   NOP;                    /* 0x85800000001c3c02 */

while the SASS for sm_52 is

code for sm_52
                Function : _Z6kernelIiEvPT_PKS0_
        .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                           /* 0x001f8800fe2007e6 */
        /*0008*/                   MOV R1, c[0x0][0x20];   /* 0x4c98078000870001 */
        /*0010*/                   MOV R2, c[0x0][0x148];  /* 0x4c98078005270002 */
        /*0018*/                   MOV R3, c[0x0][0x14c];  /* 0x4c98078005370003 */
                                                           /* 0x001fc401fe2000bd */
        /*0028*/                   LDG.E R0, [R2];         /* 0xeed4200000070200 */
        /*0030*/                   MOV R2, c[0x0][0x140];  /* 0x4c98078005070002 */
        /*0038*/                   MOV R3, c[0x0][0x144];  /* 0x4c98078005170003 */
                                                           /* 0x001fc401fc41071d */
        /*0048*/                   FLO.U32 R0, R0;         /* 0x5c30000000070000 */
        /*0050*/                   IADD R0, -R0, 0x1f;     /* 0x3812000001f70000 */
        /*0058*/                   STG.E [R2], R0;         /* 0xeedc200000070200 */
                                                           /* 0x001f8000ffe007ff */
        /*0068*/                   EXIT;                   /* 0xe30000000007000f */
        /*0070*/                   BRA 0x70;               /* 0xe2400fffff87000f */
        /*0078*/                   NOP;                    /* 0x50b0000000070f00 */

hey, I only paraphrased the official CUDA documentation on arithmetic instructions ;)

who am I to judge its correctness.

Could you point to the part of the CUDA documentation that you are referring to? Sometimes documentation is buggy, too, and there is neither a debugger nor a regression test one can run on documentation …

Section 5.4.1 Table 2, column “5.x”, row 11 (not including the header) titled “count of leading zeros, most significant non-sign bit”


Clearly lists the number of operations per clock cycle as 32 for Compute 3.x and as “Multiple Instructions” for Compute 5.x architectures. This seems to imply that a hardware instruction was replaced by a software emulation requiring multiple instructions.

“count of leading zeros”, a.k.a. __clz(), has been a two-instruction sequence ever since the FLO instruction was added in sm_20. Prior to that it was a five-instruction sequence if I recall correctly.

So “multiple instructions” would be correct for count leading zeros, across all currently supported architectures. “most significant non-sign bit” would appear to be a reference to the FLO instruction itself? If so, that is obviously a single instruction. I do not know how the throughput of FLO has changed between architectures, it is certainly possible that its throughput has been lowered in sm_5x.

While the document does not indicate that hardware support was removed, NVIDIA may want to clarify the description to avoid misunderstandings since functionality does not seem to have changed all the way from sm_20 and sm_52 with regard to these operations (performance may well have changed).

I’ve filed a bug internally at NVIDIA to refer to this. Having said that, I’m not entirely sure what is being referenced by the “Multiple instructions” listed under cc5.x, so it may be possible that there is some logic here, although it’s not entirely evident.