What's new in Compute Capability 2.1

auhgnist · August 9, 2010, 8:42am

With GF-104, NV brings a new version of compute capability, i.e. version 2.1. But I haven’t found any formal introduction on this matter in the latest CUDA programming guide. Besides so called ``ILP" mechanism that can look at 2 instruction window (which should be the supporting mechanism for carrying out execution and load/store at the same time for one instruction stream?), is there any other updates? Thanks.

nuninho1980 · November 1, 2010, 7:38pm

@nvidia (or @administrator)

I criticize hardly to nvidia :angry: because the gf10x’s have the good surprise. Therefore nvidia may be pirate!
I just bought GTX480 for excellent performance but I miss new CC 2.1. :angry: :(

I prohibit the good surprise and I oblige all the same for EACH generation of videocards!! - eg:
-GF10x (would be all entry, mainstream and high end) - dx11 and cc2.0 but gf104, 106 and 108 have cc2.1 External Image
-GF11x - dx11 and cc2.2(?)
-GF12x - dx12(?) and cc3.0(?)

nuninho1980 · November 1, 2010, 7:38pm

@nvidia (or @administrator)

I criticize hardly to nvidia :angry: because the gf10x’s have the good surprise. Therefore nvidia may be pirate!
I just bought GTX480 for excellent performance but I miss new CC 2.1. :angry: :(

I prohibit the good surprise and I oblige all the same for EACH generation of videocards!! - eg:
-GF10x (would be all entry, mainstream and high end) - dx11 and cc2.0 but gf104, 106 and 108 have cc2.1 External Image
-GF11x - dx11 and cc2.2(?)
-GF12x - dx12(?) and cc3.0(?)

E.D_Riedijk · November 1, 2010, 8:05pm

I cannot understand all you are saying, but where did you read that cc2.1 hardware had better performance than cc2.0? Because as far as I have seen so far it is not the case.

E.D_Riedijk · November 1, 2010, 8:05pm

I cannot understand all you are saying, but where did you read that cc2.1 hardware had better performance than cc2.0? Because as far as I have seen so far it is not the case.

SPWorley · November 1, 2010, 8:22pm

Compute 2.1 doesn’t give any extra abilities to CUDA. Yes, the architecture is improved underneath, but it’s wholly transparent to CUDA apps.

I would guess the only reason there’s a 2.1 compute label bump at all is for the compiler. Passing in the 2.1 compute -arch flag to the compiler hints that it can tune its peephole optimizer to set up instruction order for ILP opportunities that GF104 can take advantage of with its 2 execution units per scheduler that GF100 with its single execution unit cannot.

SPWorley · November 1, 2010, 8:22pm

Compute 2.1 doesn’t give any extra abilities to CUDA. Yes, the architecture is improved underneath, but it’s wholly transparent to CUDA apps.

I would guess the only reason there’s a 2.1 compute label bump at all is for the compiler. Passing in the 2.1 compute -arch flag to the compiler hints that it can tune its peephole optimizer to set up instruction order for ILP opportunities that GF104 can take advantage of with its 2 execution units per scheduler that GF100 with its single execution unit cannot.

nuninho1980 · November 1, 2010, 8:41pm

nvidia may be pirate because gf100 is expensive but it may limitation to run samples CUDA and the SAME ONE generation of videocards has cc2.0 and cc2.1 - two!?? I prohibit its! therefore bad news for gtx480.

Sorry for bad english.

nuninho1980 · November 1, 2010, 8:41pm

nvidia may be pirate because gf100 is expensive but it may limitation to run samples CUDA and the SAME ONE generation of videocards has cc2.0 and cc2.1 - two!?? I prohibit its! therefore bad news for gtx480.

Sorry for bad english.

Jimmy_Pettersson · November 1, 2010, 8:53pm

To me it seems the 2.0 is the safer bet for performance. From my understanding you need twice as many instructions to hide latency in 2.1 compared to 2.0. (ref: CUDA PG 5.2.3 Multiprocessor Level).

æˆ‘è§‰å¾—2.0æ¯”2.1å¥½ä¸€ç‚¹ :)

Jimmy_Pettersson · November 1, 2010, 8:53pm

To me it seems the 2.0 is the safer bet for performance. From my understanding you need twice as many instructions to hide latency in 2.1 compared to 2.0. (ref: CUDA PG 5.2.3 Multiprocessor Level).

æˆ‘è§‰å¾—2.0æ¯”2.1å¥½ä¸€ç‚¹ :)

laughingrice · November 1, 2010, 9:00pm

Note though that gf104 has 48 cores per SM, rather than 32, so it’s 3 half warps at a time rather than two, and shared/L1 is shared between more cores. Probably a change for the worse for CUDA code rather than for the better.

The improvement is that there are more texture units per cores.

laughingrice · November 1, 2010, 9:00pm

Note though that gf104 has 48 cores per SM, rather than 32, so it’s 3 half warps at a time rather than two, and shared/L1 is shared between more cores. Probably a change for the worse for CUDA code rather than for the better.

The improvement is that there are more texture units per cores.

shawkie · November 3, 2010, 8:52am

Have I missed something? It was previously possible to pass compute 2.1 to the compiler with the -arch flag but I don’t think there was any evidence it actually did anything. With CUDA 3.2 RC3 it is actively rejected by the compiler.

shawkie · November 3, 2010, 8:52am

Have I missed something? It was previously possible to pass compute 2.1 to the compiler with the -arch flag but I don’t think there was any evidence it actually did anything. With CUDA 3.2 RC3 it is actively rejected by the compiler.

shawkie · November 3, 2010, 8:53am

Except the texture units don’t work properly (at least not in CUDA).

shawkie · November 3, 2010, 8:53am

Except the texture units don’t work properly (at least not in CUDA).

happyjack272 · November 3, 2010, 2:53pm

with that 3rd core they’re adding a minimal amount of superscalar ( Superscalar processor - Wikipedia ) logic to the processor.

it’s smart to only add a little on these things, rather than a lot like on a cpu, because as you add more the payoff-per-circuitry decreases rather quickly. and the cores fill their execution pipes pretty well as is with temporal multithreading.

but yeah, as i understand it that’s the main difference: the addition of a little bit of superscalar logic. it’s transparent to the apps.

happyjack272 · November 3, 2010, 2:53pm

with that 3rd core they’re adding a minimal amount of superscalar ( Superscalar processor - Wikipedia ) logic to the processor.

it’s smart to only add a little on these things, rather than a lot like on a cpu, because as you add more the payoff-per-circuitry decreases rather quickly. and the cores fill their execution pipes pretty well as is with temporal multithreading.

but yeah, as i understand it that’s the main difference: the addition of a little bit of superscalar logic. it’s transparent to the apps.

Topic		Replies	Views
why CUDA 2.0 does not expose all PTX ISA 1.3 ? CUDA Programming and Performance	20	27723	November 5, 2008
GeForce GTX 460 & CUDA 3.1 (What is deviceQuery reporting?) CUDA Programming and Performance	8	10838	August 15, 2010
CUDA Toolkit 3.0 released CUDA Programming and Performance	62	26031	September 21, 2010
Openness about 'real' cubin instructions CUDA Programming and Performance	27	20380	April 29, 2009
Determining correct compute capability for a loaded PTX file/kernel ? CUDA Programming and Performance	10	2610	February 11, 2015
What's new in CUDA 2.1? CUDA Programming and Performance	26	18485	December 9, 2008
Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m CUDA Programming and Performance	63	10331	April 5, 2012
Is nvidia forcing SP compute customers into expensive cards? Why is SP Cuda so slow on gtx680? Somet CUDA Programming and Performance	49	13180	May 20, 2012
CUDA 2.0 Beta 2 GTX support, more Linux distros... CUDA Programming and Performance	29	55617	October 30, 2008
Why is DirectCompute 2x faster than CUDA for my kernel? CUDA Programming and Performance	23	6511	November 11, 2010

What's new in Compute Capability 2.1

Related topics