GF100 vs GF104 Performance question

Beautiful_Code · August 31, 2010, 7:17am

hi all,
I was planning to buy inexpensive graphic card with fermi architecture.
after reading between GTX470, GTX465, and GTX460 I figured that the GTX460 is build using GF104 architecture, which redesigned to be cheaper solution for games. I couldn’t find any spec or white paper for the GF104 architecture. I found one picture of the 104 archeticture, which contain only 2 graphic clusters. VS 4 in the GF100. in the GF104 each SM contain 48 SPs, while in GF100 each SM contain 32 SPs.

my questions are:

are there any spec of white papers from nvidia about GF104 ?
does that changes in the number of SP per SM change the number of the threads that can resident in the SM ?
is the shared memory available for each SM is the same or got increased due to more SPs ?
what is the difference between GTX 470 and GTX460(1GB) in term of compute capability (FLOPS)

finally do you think i should have both cards so i can do more testing ?

I am using this for academic purpose for my thesis in computer science.

MisterAnderson42 · August 31, 2010, 1:11pm

The detailed CUDA specs will come with CUDA 3.2 (3.1 predates GF104).

For a theoretical peak perf comparison, just look at the NVIDIA specs page:
[url=“http://www.nvidia.com/object/product_geforce_gtx_470_us.html”]http://www.nvidia.com/object/product_geforce_gtx_470_us.html[/url]
[url=“http://www.nvidia.com/object/product-geforce-gtx-460-us.html”]http://www.nvidia.com/object/product-geforce-gtx-460-us.html[/url]

They list memory bandwidth and you can get FLOPS by multiplying the clock rate times the number of cores times the number of flops per clock (2 for FMA)

As far as real application benchmarking goes, I’ll post my apps benchmarks on Thur when my GF104 arrives :) (already have a GTX 480 for work).

MisterAnderson42 · August 31, 2010, 1:11pm

The detailed CUDA specs will come with CUDA 3.2 (3.1 predates GF104).

For a theoretical peak perf comparison, just look at the NVIDIA specs page:
[url=“http://www.nvidia.com/object/product_geforce_gtx_470_us.html”]http://www.nvidia.com/object/product_geforce_gtx_470_us.html[/url]
[url=“http://www.nvidia.com/object/product-geforce-gtx-460-us.html”]http://www.nvidia.com/object/product-geforce-gtx-460-us.html[/url]

They list memory bandwidth and you can get FLOPS by multiplying the clock rate times the number of cores times the number of flops per clock (2 for FMA)

As far as real application benchmarking goes, I’ll post my apps benchmarks on Thur when my GF104 arrives :) (already have a GTX 480 for work).

Beautiful_Code · August 31, 2010, 1:52pm

thanks a lot for the information, they were very useful to me,
and i am so exciting to see your app benchmark :)

right now all what i have to work on is GTX8800, GT295, and Tesla C1070. I am so exciting to use GF100 and GF104

Beautiful_Code · August 31, 2010, 1:52pm

thanks a lot for the information, they were very useful to me,
and i am so exciting to see your app benchmark :)

right now all what i have to work on is GTX8800, GT295, and Tesla C1070. I am so exciting to use GF100 and GF104

shawkie · August 31, 2010, 10:06pm

I’m eager to hear how you get on with it. I’m not particularly fond of mine (and I bought it with my own money too…).

shawkie · August 31, 2010, 10:06pm

I’m eager to hear how you get on with it. I’m not particularly fond of mine (and I bought it with my own money too…).

moozoo · September 1, 2010, 3:10am

I’ll add some questions

What is the maximum number of concurrent kernels? (I’m guessing 8 instead of 16)

Do we optimise instruction order or will an updated cuda compiler do it for us?

On a 2Gb GTX 460 under opencl’s what is MAX_MEM_ALLOC_SIZE? 512MB?

I’ll note that in diagrams of GF104 SM’s the Register File is the same size as the GF100 (32768 x32 bit) but shared between 48 cores instead of 32.

I find it discouraging that specifications aren’t release at the same time as the chipset.

I have the impression that the GPGPU team at Nvidia are playing catch up with the hardware/driver teams.

moozoo · September 1, 2010, 3:10am

I’ll add some questions

What is the maximum number of concurrent kernels? (I’m guessing 8 instead of 16)

Do we optimise instruction order or will an updated cuda compiler do it for us?

On a 2Gb GTX 460 under opencl’s what is MAX_MEM_ALLOC_SIZE? 512MB?

I’ll note that in diagrams of GF104 SM’s the Register File is the same size as the GF100 (32768 x32 bit) but shared between 48 cores instead of 32.

I find it discouraging that specifications aren’t release at the same time as the chipset.

I have the impression that the GPGPU team at Nvidia are playing catch up with the hardware/driver teams.

shawkie · September 1, 2010, 8:33am

As I understand it the GTX 460 should be thought of as half a GTX 480 (i.e. 8 multiprocessors instead of 16) with the same amount of registers and shared memory per multiprocessor. Again, maximum number of threads per multiprocessor remains the same. In some circumstances actual compute performance (FLOPS) will be up to 50% higher due to its ability to extract instruction level parallelism. Texture fill rate is in theory pretty high (GF104 has twice as many texture units per multiprocessor as GF100) but this seems to be almost impossible to acheive. I’m assuming that the optimiser/assembler is going to be responsible for optimising the instruction order since its pretty much impossible to do this manually with the nVidia tools. What I don’t understand is that presumably the optimiser/assembler is the same for CUDA and for Direct3D and is part of the video driver rather than the CUDA toolkit. So does that mean this is as good as its ever going to get?

shawkie · September 1, 2010, 8:33am

As I understand it the GTX 460 should be thought of as half a GTX 480 (i.e. 8 multiprocessors instead of 16) with the same amount of registers and shared memory per multiprocessor. Again, maximum number of threads per multiprocessor remains the same. In some circumstances actual compute performance (FLOPS) will be up to 50% higher due to its ability to extract instruction level parallelism. Texture fill rate is in theory pretty high (GF104 has twice as many texture units per multiprocessor as GF100) but this seems to be almost impossible to acheive. I’m assuming that the optimiser/assembler is going to be responsible for optimising the instruction order since its pretty much impossible to do this manually with the nVidia tools. What I don’t understand is that presumably the optimiser/assembler is the same for CUDA and for Direct3D and is part of the video driver rather than the CUDA toolkit. So does that mean this is as good as its ever going to get?

MisterAnderson42 · September 1, 2010, 8:40pm

hoomd performance for 2 very different benchmarks on gtx 460 1GB / gtx 480
benchmark 1: 490 / 824 ~=59% of a gtx 480
benchmark 2: 920 / 1552 ~= 59% of a gtx 480

At stock clocks, a gtx 460 1GB has 64.9% of the memory bandwidth of a 480 and 67.4% of the FLOPs. As hoomd is memory bandwidth bound, these numbers are right where they should be - attributing the 59 reduction from 64 possibly resulting from the reduction of the L2 cache compared to the 480.

All in all, I’m pleased. The 460 is an inexpensive but fairly fast compute 2.x development card, runs cool and quiet, and will no doubt run games much smoother than the 8800 GT it upgraded :)

As for the max number of blocks and/or threads in flight? I’m guessing the same as everyone else here that those haven’t changed - will find out for sure when the 3.2 programming guide is out.

MisterAnderson42 · September 1, 2010, 8:40pm

hoomd performance for 2 very different benchmarks on gtx 460 1GB / gtx 480
benchmark 1: 490 / 824 ~=59% of a gtx 480
benchmark 2: 920 / 1552 ~= 59% of a gtx 480

At stock clocks, a gtx 460 1GB has 64.9% of the memory bandwidth of a 480 and 67.4% of the FLOPs. As hoomd is memory bandwidth bound, these numbers are right where they should be - attributing the 59 reduction from 64 possibly resulting from the reduction of the L2 cache compared to the 480.

All in all, I’m pleased. The 460 is an inexpensive but fairly fast compute 2.x development card, runs cool and quiet, and will no doubt run games much smoother than the 8800 GT it upgraded :)

As for the max number of blocks and/or threads in flight? I’m guessing the same as everyone else here that those haven’t changed - will find out for sure when the 3.2 programming guide is out.

jma · September 3, 2010, 3:22pm

Any idea how much level 1 instruction cache there is?

This used to be 4K, but with now 512 threads as being optimal (rather than 256,) more diverging warps on the same MP might hit the limit and miss the cache?

jma · September 3, 2010, 3:22pm

Any idea how much level 1 instruction cache there is?

This used to be 4K, but with now 512 threads as being optimal (rather than 256,) more diverging warps on the same MP might hit the limit and miss the cache?

MisterAnderson42 · September 3, 2010, 8:11pm

Where do you find that 512 threads is optimal? I find via benchmarking that the optimal depends on the kernel & hardware:

_default_block_size_db[‘2.1’] = {‘improper.harmonic’: 64, ‘pair.lj’: 256, ‘dihedral.harmonic’: 128, ‘pair.dpd’: 192, ‘angle.cgcmm’: 96,

                                 'nlist.filter': 256, 'pair.dpd_conservative': 160, 'pair.table': 128, 'pair.cgcmm': 160, 'pair.slj': 128,

                                 'pair.morse': 192, 'nlist': 544, 'bond.harmonic': 416, 'pair.yukawa': 160, 'bond.fene': 160,

                                 'angle.harmonic': 160, 'pair.gauss': 192}

MisterAnderson42 · September 3, 2010, 8:11pm

Where do you find that 512 threads is optimal? I find via benchmarking that the optimal depends on the kernel & hardware:

_default_block_size_db[‘2.1’] = {‘improper.harmonic’: 64, ‘pair.lj’: 256, ‘dihedral.harmonic’: 128, ‘pair.dpd’: 192, ‘angle.cgcmm’: 96,

                                 'nlist.filter': 256, 'pair.dpd_conservative': 160, 'pair.table': 128, 'pair.cgcmm': 160, 'pair.slj': 128,

                                 'pair.morse': 192, 'nlist': 544, 'bond.harmonic': 416, 'pair.yukawa': 160, 'bond.fene': 160,

                                 'angle.harmonic': 160, 'pair.gauss': 192}

jma · September 4, 2010, 10:23am

Ah! Sorry …

Anyway, how much level 1 instruction cache is there per MP in the GF104?

jma · September 4, 2010, 10:23am

Ah! Sorry …

Anyway, how much level 1 instruction cache is there per MP in the GF104?

Topic		Replies	Views
GeForce GTX 460 & CUDA 3.1 (What is deviceQuery reporting?) CUDA Programming and Performance	8	10830	August 15, 2010
GTX 460 CUDA Programming and Performance	58	60204	August 5, 2010
Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m CUDA Programming and Performance	63	10330	April 5, 2012
GTX 460 - how man angels on the head of a pin how many cores per MP for a GTX 460 - 32 or 48 CUDA Programming and Performance	15	15615	July 18, 2010
GTX460 number of multiprocessors CUDA Programming and Performance	16	10153	September 22, 2010
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19607	July 5, 2011
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37326	August 30, 2009
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11144	May 23, 2010
Theoretical peak performance question GF100 can't co-issue instructions can it? CUDA Programming and Performance	15	3341	March 3, 2011
GTX 580 is not as good as GTX480 for CUDA ? CUDA Programming and Performance	23	3888	November 7, 2010

GF100 vs GF104 Performance question

Related topics