What's new in Compute Capability 2.1

With GF-104, NV brings a new version of compute capability, i.e. version 2.1. But I haven’t found any formal introduction on this matter in the latest CUDA programming guide. Besides so called ``ILP" mechanism that can look at 2 instruction window (which should be the supporting mechanism for carrying out execution and load/store at the same time for one instruction stream?), is there any other updates? Thanks.

@nvidia (or @administrator)

I criticize hardly to nvidia :angry: because the gf10x’s have the good surprise. Therefore nvidia may be pirate!
I just bought GTX480 for excellent performance but I miss new CC 2.1. :angry: :(

I prohibit the good surprise and I oblige all the same for EACH generation of videocards!! - eg:
-GF10x (would be all entry, mainstream and high end) - dx11 and cc2.0 but gf104, 106 and 108 have cc2.1 External Image
-GF11x - dx11 and cc2.2(?)
-GF12x - dx12(?) and cc3.0(?)

@nvidia (or @administrator)

I criticize hardly to nvidia :angry: because the gf10x’s have the good surprise. Therefore nvidia may be pirate!
I just bought GTX480 for excellent performance but I miss new CC 2.1. :angry: :(

I prohibit the good surprise and I oblige all the same for EACH generation of videocards!! - eg:
-GF10x (would be all entry, mainstream and high end) - dx11 and cc2.0 but gf104, 106 and 108 have cc2.1 External Image
-GF11x - dx11 and cc2.2(?)
-GF12x - dx12(?) and cc3.0(?)

I cannot understand all you are saying, but where did you read that cc2.1 hardware had better performance than cc2.0? Because as far as I have seen so far it is not the case.

I cannot understand all you are saying, but where did you read that cc2.1 hardware had better performance than cc2.0? Because as far as I have seen so far it is not the case.

Compute 2.1 doesn’t give any extra abilities to CUDA. Yes, the architecture is improved underneath, but it’s wholly transparent to CUDA apps.

I would guess the only reason there’s a 2.1 compute label bump at all is for the compiler. Passing in the 2.1 compute -arch flag to the compiler hints that it can tune its peephole optimizer to set up instruction order for ILP opportunities that GF104 can take advantage of with its 2 execution units per scheduler that GF100 with its single execution unit cannot.

Compute 2.1 doesn’t give any extra abilities to CUDA. Yes, the architecture is improved underneath, but it’s wholly transparent to CUDA apps.

I would guess the only reason there’s a 2.1 compute label bump at all is for the compiler. Passing in the 2.1 compute -arch flag to the compiler hints that it can tune its peephole optimizer to set up instruction order for ILP opportunities that GF104 can take advantage of with its 2 execution units per scheduler that GF100 with its single execution unit cannot.

nvidia may be pirate because gf100 is expensive but it may limitation to run samples CUDA and the SAME ONE generation of videocards has cc2.0 and cc2.1 - two!?? I prohibit its! therefore bad news for gtx480.

Sorry for bad english.

nvidia may be pirate because gf100 is expensive but it may limitation to run samples CUDA and the SAME ONE generation of videocards has cc2.0 and cc2.1 - two!?? I prohibit its! therefore bad news for gtx480.

Sorry for bad english.

To me it seems the 2.0 is the safer bet for performance. From my understanding you need twice as many instructions to hide latency in 2.1 compared to 2.0. (ref: CUDA PG 5.2.3 Multiprocessor Level).

我觉得2.0比2.1好一点 :)

To me it seems the 2.0 is the safer bet for performance. From my understanding you need twice as many instructions to hide latency in 2.1 compared to 2.0. (ref: CUDA PG 5.2.3 Multiprocessor Level).

我觉得2.0比2.1好一点 :)

Note though that gf104 has 48 cores per SM, rather than 32, so it’s 3 half warps at a time rather than two, and shared/L1 is shared between more cores. Probably a change for the worse for CUDA code rather than for the better.

The improvement is that there are more texture units per cores.

Note though that gf104 has 48 cores per SM, rather than 32, so it’s 3 half warps at a time rather than two, and shared/L1 is shared between more cores. Probably a change for the worse for CUDA code rather than for the better.

The improvement is that there are more texture units per cores.

Have I missed something? It was previously possible to pass compute 2.1 to the compiler with the -arch flag but I don’t think there was any evidence it actually did anything. With CUDA 3.2 RC3 it is actively rejected by the compiler.

Have I missed something? It was previously possible to pass compute 2.1 to the compiler with the -arch flag but I don’t think there was any evidence it actually did anything. With CUDA 3.2 RC3 it is actively rejected by the compiler.

Except the texture units don’t work properly (at least not in CUDA).

Except the texture units don’t work properly (at least not in CUDA).

with that 3rd core they’re adding a minimal amount of superscalar ( Superscalar processor - Wikipedia ) logic to the processor.

it’s smart to only add a little on these things, rather than a lot like on a cpu, because as you add more the payoff-per-circuitry decreases rather quickly. and the cores fill their execution pipes pretty well as is with temporal multithreading.

but yeah, as i understand it that’s the main difference: the addition of a little bit of superscalar logic. it’s transparent to the apps.

with that 3rd core they’re adding a minimal amount of superscalar ( Superscalar processor - Wikipedia ) logic to the processor.

it’s smart to only add a little on these things, rather than a lot like on a cpu, because as you add more the payoff-per-circuitry decreases rather quickly. and the cores fill their execution pipes pretty well as is with temporal multithreading.

but yeah, as i understand it that’s the main difference: the addition of a little bit of superscalar logic. it’s transparent to the apps.