GF100 vs GF104 Performance question

hi all,
I was planning to buy inexpensive graphic card with fermi architecture.
after reading between GTX470, GTX465, and GTX460 I figured that the GTX460 is build using GF104 architecture, which redesigned to be cheaper solution for games. I couldn’t find any spec or white paper for the GF104 architecture. I found one picture of the 104 archeticture, which contain only 2 graphic clusters. VS 4 in the GF100. in the GF104 each SM contain 48 SPs, while in GF100 each SM contain 32 SPs.

my questions are:

  1. are there any spec of white papers from nvidia about GF104 ?
  2. does that changes in the number of SP per SM change the number of the threads that can resident in the SM ?
  3. is the shared memory available for each SM is the same or got increased due to more SPs ?
  4. what is the difference between GTX 470 and GTX460(1GB) in term of compute capability (FLOPS)

finally do you think i should have both cards so i can do more testing ?

I am using this for academic purpose for my thesis in computer science.

The detailed CUDA specs will come with CUDA 3.2 (3.1 predates GF104).

For a theoretical peak perf comparison, just look at the NVIDIA specs page:
http://www.nvidia.com/object/product_geforce_gtx_470_us.html
http://www.nvidia.com/object/product-geforce-gtx-460-us.html

They list memory bandwidth and you can get FLOPS by multiplying the clock rate times the number of cores times the number of flops per clock (2 for FMA)

As far as real application benchmarking goes, I’ll post my apps benchmarks on Thur when my GF104 arrives :) (already have a GTX 480 for work).

The detailed CUDA specs will come with CUDA 3.2 (3.1 predates GF104).

For a theoretical peak perf comparison, just look at the NVIDIA specs page:
http://www.nvidia.com/object/product_geforce_gtx_470_us.html
http://www.nvidia.com/object/product-geforce-gtx-460-us.html

They list memory bandwidth and you can get FLOPS by multiplying the clock rate times the number of cores times the number of flops per clock (2 for FMA)

As far as real application benchmarking goes, I’ll post my apps benchmarks on Thur when my GF104 arrives :) (already have a GTX 480 for work).

thanks a lot for the information, they were very useful to me,
and i am so exciting to see your app benchmark :)

right now all what i have to work on is GTX8800, GT295, and Tesla C1070. I am so exciting to use GF100 and GF104

thanks a lot for the information, they were very useful to me,
and i am so exciting to see your app benchmark :)

right now all what i have to work on is GTX8800, GT295, and Tesla C1070. I am so exciting to use GF100 and GF104

I’m eager to hear how you get on with it. I’m not particularly fond of mine (and I bought it with my own money too…).

I’m eager to hear how you get on with it. I’m not particularly fond of mine (and I bought it with my own money too…).

I’ll add some questions

What is the maximum number of concurrent kernels? (I’m guessing 8 instead of 16)

Do we optimise instruction order or will an updated cuda compiler do it for us?

On a 2Gb GTX 460 under opencl’s what is MAX_MEM_ALLOC_SIZE? 512MB?

I’ll note that in diagrams of GF104 SM’s the Register File is the same size as the GF100 (32768 x32 bit) but shared between 48 cores instead of 32.

I find it discouraging that specifications aren’t release at the same time as the chipset.

I have the impression that the GPGPU team at Nvidia are playing catch up with the hardware/driver teams.

I’ll add some questions

What is the maximum number of concurrent kernels? (I’m guessing 8 instead of 16)

Do we optimise instruction order or will an updated cuda compiler do it for us?

On a 2Gb GTX 460 under opencl’s what is MAX_MEM_ALLOC_SIZE? 512MB?

I’ll note that in diagrams of GF104 SM’s the Register File is the same size as the GF100 (32768 x32 bit) but shared between 48 cores instead of 32.

I find it discouraging that specifications aren’t release at the same time as the chipset.

I have the impression that the GPGPU team at Nvidia are playing catch up with the hardware/driver teams.

As I understand it the GTX 460 should be thought of as half a GTX 480 (i.e. 8 multiprocessors instead of 16) with the same amount of registers and shared memory per multiprocessor. Again, maximum number of threads per multiprocessor remains the same. In some circumstances actual compute performance (FLOPS) will be up to 50% higher due to its ability to extract instruction level parallelism. Texture fill rate is in theory pretty high (GF104 has twice as many texture units per multiprocessor as GF100) but this seems to be almost impossible to acheive. I’m assuming that the optimiser/assembler is going to be responsible for optimising the instruction order since its pretty much impossible to do this manually with the nVidia tools. What I don’t understand is that presumably the optimiser/assembler is the same for CUDA and for Direct3D and is part of the video driver rather than the CUDA toolkit. So does that mean this is as good as its ever going to get?

As I understand it the GTX 460 should be thought of as half a GTX 480 (i.e. 8 multiprocessors instead of 16) with the same amount of registers and shared memory per multiprocessor. Again, maximum number of threads per multiprocessor remains the same. In some circumstances actual compute performance (FLOPS) will be up to 50% higher due to its ability to extract instruction level parallelism. Texture fill rate is in theory pretty high (GF104 has twice as many texture units per multiprocessor as GF100) but this seems to be almost impossible to acheive. I’m assuming that the optimiser/assembler is going to be responsible for optimising the instruction order since its pretty much impossible to do this manually with the nVidia tools. What I don’t understand is that presumably the optimiser/assembler is the same for CUDA and for Direct3D and is part of the video driver rather than the CUDA toolkit. So does that mean this is as good as its ever going to get?

hoomd performance for 2 very different benchmarks on gtx 460 1GB / gtx 480
benchmark 1: 490 / 824 ~=59% of a gtx 480
benchmark 2: 920 / 1552 ~= 59% of a gtx 480

At stock clocks, a gtx 460 1GB has 64.9% of the memory bandwidth of a 480 and 67.4% of the FLOPs. As hoomd is memory bandwidth bound, these numbers are right where they should be - attributing the 59 reduction from 64 possibly resulting from the reduction of the L2 cache compared to the 480.

All in all, I’m pleased. The 460 is an inexpensive but fairly fast compute 2.x development card, runs cool and quiet, and will no doubt run games much smoother than the 8800 GT it upgraded :)

As for the max number of blocks and/or threads in flight? I’m guessing the same as everyone else here that those haven’t changed - will find out for sure when the 3.2 programming guide is out.

hoomd performance for 2 very different benchmarks on gtx 460 1GB / gtx 480
benchmark 1: 490 / 824 ~=59% of a gtx 480
benchmark 2: 920 / 1552 ~= 59% of a gtx 480

At stock clocks, a gtx 460 1GB has 64.9% of the memory bandwidth of a 480 and 67.4% of the FLOPs. As hoomd is memory bandwidth bound, these numbers are right where they should be - attributing the 59 reduction from 64 possibly resulting from the reduction of the L2 cache compared to the 480.

All in all, I’m pleased. The 460 is an inexpensive but fairly fast compute 2.x development card, runs cool and quiet, and will no doubt run games much smoother than the 8800 GT it upgraded :)

As for the max number of blocks and/or threads in flight? I’m guessing the same as everyone else here that those haven’t changed - will find out for sure when the 3.2 programming guide is out.

Any idea how much level 1 instruction cache there is?

This used to be 4K, but with now 512 threads as being optimal (rather than 256,) more diverging warps on the same MP might hit the limit and miss the cache?

Any idea how much level 1 instruction cache there is?

This used to be 4K, but with now 512 threads as being optimal (rather than 256,) more diverging warps on the same MP might hit the limit and miss the cache?

Where do you find that 512 threads is optimal? I find via benchmarking that the optimal depends on the kernel & hardware:

_default_block_size_db[‘2.1’] = {‘improper.harmonic’: 64, ‘pair.lj’: 256, ‘dihedral.harmonic’: 128, ‘pair.dpd’: 192, ‘angle.cgcmm’: 96,

                                 'nlist.filter': 256, 'pair.dpd_conservative': 160, 'pair.table': 128, 'pair.cgcmm': 160, 'pair.slj': 128,

                                 'pair.morse': 192, 'nlist': 544, 'bond.harmonic': 416, 'pair.yukawa': 160, 'bond.fene': 160,

                                 'angle.harmonic': 160, 'pair.gauss': 192}

Where do you find that 512 threads is optimal? I find via benchmarking that the optimal depends on the kernel & hardware:

_default_block_size_db[‘2.1’] = {‘improper.harmonic’: 64, ‘pair.lj’: 256, ‘dihedral.harmonic’: 128, ‘pair.dpd’: 192, ‘angle.cgcmm’: 96,

                                 'nlist.filter': 256, 'pair.dpd_conservative': 160, 'pair.table': 128, 'pair.cgcmm': 160, 'pair.slj': 128,

                                 'pair.morse': 192, 'nlist': 544, 'bond.harmonic': 416, 'pair.yukawa': 160, 'bond.fene': 160,

                                 'angle.harmonic': 160, 'pair.gauss': 192}

Ah! Sorry …

Anyway, how much level 1 instruction cache is there per MP in the GF104?

Ah! Sorry …

Anyway, how much level 1 instruction cache is there per MP in the GF104?