I’m going to buy a GTX 460 in one month so I just started reading on it. And I realize it uses GF104 which differs from the GF100, so here I have some questions. Thanks a lot if you could help me with some of them! External Image
Each SM in GF104 has 2 Warp Scheduler and 4 dispatch units. Since 2 dispatch units would fall under the same warp scheduler, I assume that the two dispatch units give out instructions from the same warp at the one time. I guess the two dispatch units could give non-dependent instructions from the same half-warp to achieve ILP. Could the 2 dispatch units under the same scheduler give out the same instruction for each half-warp, so that instead of the normal 16 threads, 32 threads in the warp would be executed concurrently?
It seems that the warp scheduler and the dispatch units run at half the clock of the SPs, so this means on average two instructions could be given per SP cycle. But each instruction only work on 16 SPs right? Wouldn’t that mean the scheduler could only keep 2/3 of the SPs on a GF104 fully occupied?
What are the other units that run at half the clock of the SPs?
I’m still not so sure with the latency of various actions…
uncached global memory access takes 400 cycles right?
It takes 400 cycles to access texture cache, and coalesced access pattern does not matter at all, correct?
Is L1 cache affected by bank conflicts too?
Shared memory would take 0 cycle to access, as long as no bank conflicts, is this right?
I think I still have many more questions… but maybe I shouldn’t put too many in one topic. Thanks a lot if you could help me answer some of the questions above!
I’m going to buy a GTX 460 in one month so I just started reading on it. And I realize it uses GF104 which differs from the GF100, so here I have some questions. Thanks a lot if you could help me with some of them! External Image
Each SM in GF104 has 2 Warp Scheduler and 4 dispatch units. Since 2 dispatch units would fall under the same warp scheduler, I assume that the two dispatch units give out instructions from the same warp at the one time. I guess the two dispatch units could give non-dependent instructions from the same half-warp to achieve ILP. Could the 2 dispatch units under the same scheduler give out the same instruction for each half-warp, so that instead of the normal 16 threads, 32 threads in the warp would be executed concurrently?
It seems that the warp scheduler and the dispatch units run at half the clock of the SPs, so this means on average two instructions could be given per SP cycle. But each instruction only work on 16 SPs right? Wouldn’t that mean the scheduler could only keep 2/3 of the SPs on a GF104 fully occupied?
What are the other units that run at half the clock of the SPs?
I’m still not so sure with the latency of various actions…
uncached global memory access takes 400 cycles right?
It takes 400 cycles to access texture cache, and coalesced access pattern does not matter at all, correct?
Is L1 cache affected by bank conflicts too?
Shared memory would take 0 cycle to access, as long as no bank conflicts, is this right?
I think I still have many more questions… but maybe I shouldn’t put too many in one topic. Thanks a lot if you could help me answer some of the questions above!
No answer because NVIDIA does not disclose much of this information in a clear way. You’re stuck reading hardware review sites (who presumably either get special briefings or are just throwing out plausible guesses) or trying to infer it from papers and talks by NVIDIA employees.
No answer because NVIDIA does not disclose much of this information in a clear way. You’re stuck reading hardware review sites (who presumably either get special briefings or are just throwing out plausible guesses) or trying to infer it from papers and talks by NVIDIA employees.
Well, if you’re in the Northern Hemisphere, winter is coming up, and the 470 puts out quite a bit more heat than the 460, …
Seriously, better compute performance, and the prices of the 470 and 480 are dropping a lot now that the GTX 580 has hit Newegg and other vendors. I saw a used GTX 480 with the high flow back plate sell on eBay for $305. The 470 can be found for around $250 if you shop around (vendors or eBay - I think many eBay resellers are trying to get their value out and aren’t selling at market prices.) The one advantage of the 460 is that you can (sometimes) get one with 2GB of memory, if your problems demand that. The price savings is small relative to the performance you give up, especially if you include the effective cost of the slot in the computer you install it into.
Well, if you’re in the Northern Hemisphere, winter is coming up, and the 470 puts out quite a bit more heat than the 460, …
Seriously, better compute performance, and the prices of the 470 and 480 are dropping a lot now that the GTX 580 has hit Newegg and other vendors. I saw a used GTX 480 with the high flow back plate sell on eBay for $305. The 470 can be found for around $250 if you shop around (vendors or eBay - I think many eBay resellers are trying to get their value out and aren’t selling at market prices.) The one advantage of the 460 is that you can (sometimes) get one with 2GB of memory, if your problems demand that. The price savings is small relative to the performance you give up, especially if you include the effective cost of the slot in the computer you install it into.
Yes, its the difference between GF100 and GF104. In my case I have an CUDA application that is normally limited by texture fill rate. On paper the GTX460 should soundly beat the GTX470 but for some unknown reason it manages only 50% of its theoretical performance (vs. 100% for GF100). nVidia don’t take GF104 seriously for CUDA applications so its impossible to get any help from them.