hi all,
I was planning to buy inexpensive graphic card with fermi architecture.
after reading between GTX470, GTX465, and GTX460 I figured that the GTX460 is build using GF104 architecture, which redesigned to be cheaper solution for games. I couldn’t find any spec or white paper for the GF104 architecture. I found one picture of the 104 archeticture, which contain only 2 graphic clusters. VS 4 in the GF100. in the GF104 each SM contain 48 SPs, while in GF100 each SM contain 32 SPs.
my questions are:
are there any spec of white papers from nvidia about GF104 ?
does that changes in the number of SP per SM change the number of the threads that can resident in the SM ?
is the shared memory available for each SM is the same or got increased due to more SPs ?
what is the difference between GTX 470 and GTX460(1GB) in term of compute capability (FLOPS)
finally do you think i should have both cards so i can do more testing ?
I am using this for academic purpose for my thesis in computer science.
They list memory bandwidth and you can get FLOPS by multiplying the clock rate times the number of cores times the number of flops per clock (2 for FMA)
As far as real application benchmarking goes, I’ll post my apps benchmarks on Thur when my GF104 arrives :) (already have a GTX 480 for work).
They list memory bandwidth and you can get FLOPS by multiplying the clock rate times the number of cores times the number of flops per clock (2 for FMA)
As far as real application benchmarking goes, I’ll post my apps benchmarks on Thur when my GF104 arrives :) (already have a GTX 480 for work).
As I understand it the GTX 460 should be thought of as half a GTX 480 (i.e. 8 multiprocessors instead of 16) with the same amount of registers and shared memory per multiprocessor. Again, maximum number of threads per multiprocessor remains the same. In some circumstances actual compute performance (FLOPS) will be up to 50% higher due to its ability to extract instruction level parallelism. Texture fill rate is in theory pretty high (GF104 has twice as many texture units per multiprocessor as GF100) but this seems to be almost impossible to acheive. I’m assuming that the optimiser/assembler is going to be responsible for optimising the instruction order since its pretty much impossible to do this manually with the nVidia tools. What I don’t understand is that presumably the optimiser/assembler is the same for CUDA and for Direct3D and is part of the video driver rather than the CUDA toolkit. So does that mean this is as good as its ever going to get?
As I understand it the GTX 460 should be thought of as half a GTX 480 (i.e. 8 multiprocessors instead of 16) with the same amount of registers and shared memory per multiprocessor. Again, maximum number of threads per multiprocessor remains the same. In some circumstances actual compute performance (FLOPS) will be up to 50% higher due to its ability to extract instruction level parallelism. Texture fill rate is in theory pretty high (GF104 has twice as many texture units per multiprocessor as GF100) but this seems to be almost impossible to acheive. I’m assuming that the optimiser/assembler is going to be responsible for optimising the instruction order since its pretty much impossible to do this manually with the nVidia tools. What I don’t understand is that presumably the optimiser/assembler is the same for CUDA and for Direct3D and is part of the video driver rather than the CUDA toolkit. So does that mean this is as good as its ever going to get?
hoomd performance for 2 very different benchmarks on gtx 460 1GB / gtx 480
benchmark 1: 490 / 824 ~=59% of a gtx 480
benchmark 2: 920 / 1552 ~= 59% of a gtx 480
At stock clocks, a gtx 460 1GB has 64.9% of the memory bandwidth of a 480 and 67.4% of the FLOPs. As hoomd is memory bandwidth bound, these numbers are right where they should be - attributing the 59 reduction from 64 possibly resulting from the reduction of the L2 cache compared to the 480.
All in all, I’m pleased. The 460 is an inexpensive but fairly fast compute 2.x development card, runs cool and quiet, and will no doubt run games much smoother than the 8800 GT it upgraded :)
As for the max number of blocks and/or threads in flight? I’m guessing the same as everyone else here that those haven’t changed - will find out for sure when the 3.2 programming guide is out.
hoomd performance for 2 very different benchmarks on gtx 460 1GB / gtx 480
benchmark 1: 490 / 824 ~=59% of a gtx 480
benchmark 2: 920 / 1552 ~= 59% of a gtx 480
At stock clocks, a gtx 460 1GB has 64.9% of the memory bandwidth of a 480 and 67.4% of the FLOPs. As hoomd is memory bandwidth bound, these numbers are right where they should be - attributing the 59 reduction from 64 possibly resulting from the reduction of the L2 cache compared to the 480.
All in all, I’m pleased. The 460 is an inexpensive but fairly fast compute 2.x development card, runs cool and quiet, and will no doubt run games much smoother than the 8800 GT it upgraded :)
As for the max number of blocks and/or threads in flight? I’m guessing the same as everyone else here that those haven’t changed - will find out for sure when the 3.2 programming guide is out.
Any idea how much level 1 instruction cache there is?
This used to be 4K, but with now 512 threads as being optimal (rather than 256,) more diverging warps on the same MP might hit the limit and miss the cache?
Any idea how much level 1 instruction cache there is?
This used to be 4K, but with now 512 threads as being optimal (rather than 256,) more diverging warps on the same MP might hit the limit and miss the cache?