What compute capacity is for RTX40 series?

Capacity of 30series is 8.6 or 8.9. But I cannot find 40s capacity value in website, now I need to compile tensorflow with CUDA12 and RTX40 series GPU, what capacity should I choose? 8.9 or more?

RTX 40 series GPUs are compute capability 8.9

I know what you mean. I also catch the value from former tensorflow version. But I am very confused that 40s/8.9 vs 30s/8.6. The improvement is so tiny and impossible. 10-6.1 20-7.5 30-8.6, I think 40s is more then 9.6 at least.

Maybe the 8.9 is the highest limit tensorflow can support before 40s.

It’s OK if you don’t believe me.

Try running the deviceQuery CUDA sample code on your 40 series GPU.

Good luck!

The increments between compute capabilities are not of a prescribed fixed size. The only requirement is that a new GPU architecture is assigned a higher number than any previous architecture. For example, the Ada Lovelave architecture could have been assigned the number 8.8 instead of 8.9 if NVIDIA had wanted that instead. A useful overview of architectures and associated GPUs can be found in the Wikipedia article on CUDA.

NVIDIA appears to be in the habit of assigning a new major version number (integer) to each “major” architecture, and assigning some minor number (fraction) to each architecture derived from such a major architecture. What NVIDIA considers a “major” architecture, and how it picks the numbers to enumerate the derived architectures is a detail we are not privy to and that does not matter for CUDA programmers.

1 Like

8.9 ~=9 #amIRight? FP4Life!

Tables 14 and 15 here outline some quite significant differences between the two.

1 Like

Depends what you mean by significant. I don’t do a lot with tensor cores and I’m still waiting on 64-bit signed integer atomic adds 17 years into this. I can trick it with:

atomicAdd((unsigned long long int*)pInt64, llitoulli(a));

Where:
static device inline unsigned long long int llitoulli(long long int l)
{
unsigned long long int u;
asm(“mov.b64 %0, %1;” : “=l”(u) : “l”(l));
return u;
}

but they did this with Volta and Turing, and now with Ampere and Ada. For me, it’s just another architecture to tweak with L1/SMEM utilization and grid dimensions. Before that, Celsius to Fermi to Kepler to Maxwell to Pascal to Volta was unambiguous.

Over double the amount of shared memory/block on 9.0 I’d find pretty useful.

Except that shared memory is a crippled shadow of what it once was pre 7.x and all my high performance code now relies almost entirely on the register file and synchronous warp collectives, relegating SMEM to L1 instead. 6 years in, I have yet to see the benefit of race conditions within warps in real code. To me, H100’s value proposition is distributed computing in a single or a cluster of DGX servers rather than minor differences in CUDA device properties and a few bespoke instructions.

I believe you for sure, and I have got 8.9 from tensorflow device list. Now I have compiled tensorflow 2.12 with cuda12 and cudnn8.8. Thanks!

Thanks, Deep comprehension to the compute capacity!