How can you compute the processing power of a GPU? For examlpe, I have two 9800 GT’s. I’ve found sources that say a 9800GT has 336GFLOPS, and sources that say 504GFLOPS.
Here’s my take on it:
When I multiply the frequency by the number of processors I get 1.5GHz * 112SP = 168GFLOPS. So If I take the data from different sources, the GPU should be able to execute either 2 instructions per cycle (giving 336GFLOPS) or 3 (giving 504GFLOPS).
So which is the best way to find the theoretical performance?
Exactly the way you are doing it. Compute 1.0 and 1.1 devices can do a MADD in one clock, so the factor of 2 is correct for an absolute theoretical peak performance of your 9800 GT. (there is a benchmark somewhere on these forums that gets very close to reaching this).
Compute 1.3 devices (G200) can do a fused MADD and MUL in one clock so the factor of 3 is correct for those devices.
There is some confusion among sources because G8x was supposed to be able to do a fused MADD and MULL in one clock, but something went wrong in the design or implementation.
Thanks! That’s the best answer I got. I was confused on how much performance I can expect out my kernels. I think that 250 to 300 GFLOPS should be the expected performance of a good kernel on my 9800 GT.
I believe it can in some (very rare) circumstances. G80 was honestly before my time at NVIDIA, but David Kirk and others definitely assured me that the number wasn’t a complete marketing fabrication during the Tesla 8-series launch. (This was after my old employer, Rys Sommefeldt at Beyond3D, figured out that the MUL wasn’t doing anything in graphics, so we came in very skeptical about this claim.)
Err… no. Very, very few kernels approach the theoretical GFLOPs, not even the “good” ones. The majority of kernels are memory bandwidth limited, and fall short of the figure by one or two orders of magnitude. The others simply don’t do multiply-add after multiply-add, endlessly.
Oh, yeah, I definitely agree with the memory limitation, but I don’t think it’s that bad. Latecy seems to be more of a pain than bandwidth, at least to the calculations I’m doing. I wrote a kernel that does about 135 GFLOPS, and it was a pain to optimize memory access. It may be that the kernel is just extremely light, but I don’t see why it would not be possible to get within 70- 80% of the peak theoretical value.
It calculates the electrostatic field lines given by any number of charges. It does about 18 floating point instructions per charge per thread, with a ratio of multiply to add very close to 1:1. I’m getting over 130GFLOPS when I have at least a few thousand threads.
Thank you for the clarification. In this light, it appears that the theoretical FLOP/s are off for G80, G80 Ultra and G92 - It’s figure 1.1 in the Programming Guide that appears to be incorrect (at least for CUDA applications) as they obviously have used a multiplier of 3 instead of 2 for those architectures.
I was just comparing the original figure from the 1.0 programming guide. There the GF8800 GTX had the expected performance of 21350e6128 = 345.6 GFLOP/s, compared to what now appears
Heh, you are right. The new 2.0 guide does get it wrong. If you look at the old 0.8/1.0/1.1 guides, they do mark it at just over 300.
The funny thing is, the marketing would look better (at least in my eyes) if the G80/G92 numbers were lower. That would make for a much steeper leap up to the G200 level and make it look like G200 is really revolutionary.
The CUDA improvements are pretty sweet on that hardware. Double the registers and more relaxed coalescing requirements mean you can get away running slightly under-tuned code without loosing much. That and 30 MPs w/ 140 GiB/s of mem bandwidth is insane performance.
Definitely. I love the loose coalescing requirements. In fact, I’ve even stopped overoptimizing code, figuring that it won’t make a difference once I get a GTX280 (or 380, considering I won’t upgrade GPU’s in the near future). If the GTX280 were reasonably priced (under $300), I would most certainly get at least one. Unfortunately, affordability comes into play.
About coalescing, I am still confused on wether to coalesce on the G200, the threads have to access 8, 16,32,or 64-bit words exclusively, or wether threads accesing a 12-byte word each, for example, would still be coalesced. Judging from the example diagrams, I’m assuming it should always be coalesced if data falls within the same memory segment, so in my example the device should read a block of 128 bytes, and a block of 64 bytes. Is that correct?
A single cubin instruction can only fetch either 8,16,32,or 64 bits. All threads in a half-warp have to issue the same instruction, and coalescing can only happen on that one issued instruction. If a thread wants to fetch 12 bytes, it’ll actually have to issue two instructions (or three). The first instruction will fetch 8 bytes with a 12 byte stride, causing two transactions, and the second instruction will fetch 4 bytes with a 12 byte stride, causing those same two transactions to happen again.
Unfortunately, the new coalescing rules still ain’t as good as a cache.
Why don’t you buy a GTX260? I see I can order one for $200 on newegg, after rebate. (wow) (you know what, I just did)
Thanks for the clarification on coalescing. So, if I just read a float3 per thread, it will automatically generate 4 or 6 memory transactions, but if I read sequential 8 bytes, and then 4 bytes, it will only generate 2 (128 and 64 bytes) memory transactions. Is that correct?
I got a pair of 9800GT’s a couple of months ago, and I don’t think a GTX260 would provide that much extra performance to justify the cost. Plus, there’s also the personal preference that I never was a big fan of 600 cards. The only 600 I ever had was an FX5600, after which I leapt to 6800/6800SLI/9800SLI.
And the kernels that I run only take a few seconds to finish, so I don’t mind the wait.
Add to that the fact that I’m running a S939 Athlon64X2 3800+ on a nForce 4 SLI, I’d much rather change the platform first.
Oh, and I won’t be able to run 2 260’s in SLI with my current power supply (Which I most likely would want to do if I were to get a 260 instead of a 280).
Now that I’ve been thinking more about it, I’d raher wait for the 380 (with a new PC), with the hope that it will be less power hungry. The GeForce 7 was less hungry than the 6; the same applies for the GeForce 9 with resect to 8.
The 260 is not a “x600” card. Those are usually 1/4x the performance of the flagship (really a disappointment and marketing gimmick, and jeez, the 5600 was the worst). The 260 is like an “x800 GT” (ie, 7/8x or 3/4x the performance compared to the flagship). That said, it’s not really faster than a 9800GT for games, but for CUDA it’s got the new tweaks. You said you’re now optimizing for g200’s coalescing rules, but don’t have a g200. I dunno… you can’t really check that you’re doing the right thing.
I’m not sure what you’re saying with the ‘380’, but yeah, there’s gonna be a low-power die shrink of 280 real soon now…
Btw, yeah, if you read 8 bytes with 8 byte stride coalesced, then 4 bytes with 4 byte stride, you’ll have 2 transactions total.
I do game, and not seldom, and I don’t want to run the risk of getting a lower gaing perfomance after spending a few hundred dollars on a GTX260. Being a student, money is not an abundant resource for me. I’ve been gathering some savings, but those are for a completely new PC, not just graphics.
The way I see it, optimizing for the G200, is a partial optimization for previous architectures; you just don’t need to go as extreme to get the best performance. So, even though I’m developing on a 9800GT, in some cses I should be able to see some benefits. I agree that I really can’t tell whether or not that’s the best case until I get my hands on a G200 or later, but seeing small improvements in my current platform is satisfactory enough for me.
I’d like a GPU at least as fast as the GTX280 but not that power hungry. I am suspecting it would be the ‘GT/GTX 380’. And since my CPU is so outdated, I would also update it when getting a ‘380’
If a low power die shrink is on the way, I would at least wait for that.
BTW, what’s up with the 216 streaming processors GTX260’s? are those just GTX280’ with three disabled multiprocessors?
All 260s are 280s with disabled SMs and DRAM channels. (They use the same actual die, the G200.)
The thing about optimizing for G200 coalescing rules, is that you won’t see any performance improvements on a 9800, and you can’t be sure that you’ll see them once you get a G200. I dunno, I think you should just try to optimize for 9800 rules whenever it’s not too difficult, because the old rules are still the best rules on G200.
You are right. If it doesn’t glitch up the code too much, I should optimize for the 9800.
The only non-coalesced read and write is that of the 12-byte vector I mentioned earlier, where I’m reading/writing straight to/from registers. I guess I could sacrifice a bit of shared memory and coalesce reads/writes. I’ll see how turns.