New guy with loads of questions Mostly regarding the GF104

I’m going to buy a GTX 460 in one month so I just started reading on it. And I realize it uses GF104 which differs from the GF100, so here I have some questions. Thanks a lot if you could help me with some of them! External Image

  1. Each SM in GF104 has 2 Warp Scheduler and 4 dispatch units. Since 2 dispatch units would fall under the same warp scheduler, I assume that the two dispatch units give out instructions from the same warp at the one time. I guess the two dispatch units could give non-dependent instructions from the same half-warp to achieve ILP. Could the 2 dispatch units under the same scheduler give out the same instruction for each half-warp, so that instead of the normal 16 threads, 32 threads in the warp would be executed concurrently?

  2. It seems that the warp scheduler and the dispatch units run at half the clock of the SPs, so this means on average two instructions could be given per SP cycle. But each instruction only work on 16 SPs right? Wouldn’t that mean the scheduler could only keep 2/3 of the SPs on a GF104 fully occupied?

  3. What are the other units that run at half the clock of the SPs?

  4. I’m still not so sure with the latency of various actions…
    uncached global memory access takes 400 cycles right?
    It takes 400 cycles to access texture cache, and coalesced access pattern does not matter at all, correct?
    Is L1 cache affected by bank conflicts too?
    Shared memory would take 0 cycle to access, as long as no bank conflicts, is this right?

I think I still have many more questions… but maybe I shouldn’t put too many in one topic. Thanks a lot if you could help me answer some of the questions above!

I’m going to buy a GTX 460 in one month so I just started reading on it. And I realize it uses GF104 which differs from the GF100, so here I have some questions. Thanks a lot if you could help me with some of them! External Image

  1. Each SM in GF104 has 2 Warp Scheduler and 4 dispatch units. Since 2 dispatch units would fall under the same warp scheduler, I assume that the two dispatch units give out instructions from the same warp at the one time. I guess the two dispatch units could give non-dependent instructions from the same half-warp to achieve ILP. Could the 2 dispatch units under the same scheduler give out the same instruction for each half-warp, so that instead of the normal 16 threads, 32 threads in the warp would be executed concurrently?

  2. It seems that the warp scheduler and the dispatch units run at half the clock of the SPs, so this means on average two instructions could be given per SP cycle. But each instruction only work on 16 SPs right? Wouldn’t that mean the scheduler could only keep 2/3 of the SPs on a GF104 fully occupied?

  3. What are the other units that run at half the clock of the SPs?

  4. I’m still not so sure with the latency of various actions…
    uncached global memory access takes 400 cycles right?
    It takes 400 cycles to access texture cache, and coalesced access pattern does not matter at all, correct?
    Is L1 cache affected by bank conflicts too?
    Shared memory would take 0 cycle to access, as long as no bank conflicts, is this right?

I think I still have many more questions… but maybe I shouldn’t put too many in one topic. Thanks a lot if you could help me answer some of the questions above!

No one answering? Is it because you guys also have no idea of the answers or is it just you guys don’t bother?

Somebody please just throw me any random answer!

No one answering? Is it because you guys also have no idea of the answers or is it just you guys don’t bother?

Somebody please just throw me any random answer!

No answer because NVIDIA does not disclose much of this information in a clear way. You’re stuck reading hardware review sites (who presumably either get special briefings or are just throwing out plausible guesses) or trying to infer it from papers and talks by NVIDIA employees.

No answer because NVIDIA does not disclose much of this information in a clear way. You’re stuck reading hardware review sites (who presumably either get special briefings or are just throwing out plausible guesses) or trying to infer it from papers and talks by NVIDIA employees.

Thanks a lot seibert! Now at least I know I’d have to figure out many of those things myself.

Thanks a lot seibert! Now at least I know I’d have to figure out many of those things myself.

About GF104 superscalar:
http://www.anandtech.com/show/3809/nvidias-geforce-gtx-460-the-200-king/2

Slightly outdated, but still prvides some insight about latencies etc:
http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

About GF104 superscalar:
http://www.anandtech.com/show/3809/nvidias-geforce-gtx-460-the-200-king/2

Slightly outdated, but still prvides some insight about latencies etc:
http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

If you have any interest in CUDA I would do yourself a massive favour and get a GTX 470 rather than a GTX 460.

If you have any interest in CUDA I would do yourself a massive favour and get a GTX 470 rather than a GTX 460.

I was thinking of doing something the author of that PDF did. Thanks a lot nighthawk!

And shawkie, why do you say that? Is it just because of the difference in configuration between gf100 and gf104 ?

I was thinking of doing something the author of that PDF did. Thanks a lot nighthawk!

And shawkie, why do you say that? Is it just because of the difference in configuration between gf100 and gf104 ?

Well, if you’re in the Northern Hemisphere, winter is coming up, and the 470 puts out quite a bit more heat than the 460, …

Seriously, better compute performance, and the prices of the 470 and 480 are dropping a lot now that the GTX 580 has hit Newegg and other vendors. I saw a used GTX 480 with the high flow back plate sell on eBay for $305. The 470 can be found for around $250 if you shop around (vendors or eBay - I think many eBay resellers are trying to get their value out and aren’t selling at market prices.) The one advantage of the 460 is that you can (sometimes) get one with 2GB of memory, if your problems demand that. The price savings is small relative to the performance you give up, especially if you include the effective cost of the slot in the computer you install it into.

Regards,

Martin

Well, if you’re in the Northern Hemisphere, winter is coming up, and the 470 puts out quite a bit more heat than the 460, …

Seriously, better compute performance, and the prices of the 470 and 480 are dropping a lot now that the GTX 580 has hit Newegg and other vendors. I saw a used GTX 480 with the high flow back plate sell on eBay for $305. The 470 can be found for around $250 if you shop around (vendors or eBay - I think many eBay resellers are trying to get their value out and aren’t selling at market prices.) The one advantage of the 460 is that you can (sometimes) get one with 2GB of memory, if your problems demand that. The price savings is small relative to the performance you give up, especially if you include the effective cost of the slot in the computer you install it into.

Regards,

Martin

Yes, its the difference between GF100 and GF104. In my case I have an CUDA application that is normally limited by texture fill rate. On paper the GTX460 should soundly beat the GTX470 but for some unknown reason it manages only 50% of its theoretical performance (vs. 100% for GF100). nVidia don’t take GF104 seriously for CUDA applications so its impossible to get any help from them.