In a NVIDIA slide named “CUDA and Fermi Update” from Aug, 2010 (sry, cant find the link now but was pasted here on the forums) they stated that the optimal ratio between instruction and gmem accesses for the C2050 would be 3.5. If less than that and you are bandwith bound, more and you are instruction bound.
Im aware this is only a rough estimation but would come in handy in underlining the results I write about in my final thesis when seeing e.g. serious divergences from this ideal number showing that a kernel is clearly bandwidth bound. Now how would you calculate this (ideal) ratio for any other card (say my GTX 480)? Looks to be peak GFLOPS divided by (gmem bandwith in GB * 2). But Im not sure if this is correct - especially the “times 2” cause for calculating the bandwidth you already use a factor of 2 due to the DRAM (allowing bi-directional transfers). And for instructions, is it correct to count the computational operations (approximately cause of e.g. modulo or FMAD) in a kernel in order to get this number for comparison to the ideal ration? Maybe someone knows about another paper about this topic. This was the first time I saw an actual number covering this. Knowing something more about this estimations makes it easier to show someone else hard facts about why a kernel behaves as it does.
I’d assume that the 2 in the denominator comes from the fact that peak instruction throughput is half of the peak GFLOP/s value (1 MAD/FMAD instruction = 2 FLOP). Bidirectional transfers are possible over PCIe, but not in device memory.
I’d assume that the 2 in the denominator comes from the fact that peak instruction throughput is half of the peak GFLOP/s value (1 MAD/FMAD instruction = 2 FLOP). Bidirectional transfers are possible over PCIe, but not in device memory.
I think you might be right about the MAD. And of course you are right about the single-way transfers in gmem :) The first point looks right but is still an assumption. Maybe someone can clear this up?
I think you might be right about the MAD. And of course you are right about the single-way transfers in gmem :) The first point looks right but is still an assumption. Maybe someone can clear this up?
The ratio is not between instructions and memory accesses (as the first post assumes), it’s between instructions and bytes accessed. It’s simply the ratio between instruction throughput and memory throughput. Instruction throughput looks at the peak instruction issue rate (counting instructions, not flops), likewise memory bandwidth is the peak bus bandwidth. There is no factor of 2 for the memory (and you definitely wouldn’t add a factor of 2 to account for bi-directional transfers, the bus isn’t duplex), I’m not sure where it came from.
The ratio is not between instructions and memory accesses (as the first post assumes), it’s between instructions and bytes accessed. It’s simply the ratio between instruction throughput and memory throughput. Instruction throughput looks at the peak instruction issue rate (counting instructions, not flops), likewise memory bandwidth is the peak bus bandwidth. There is no factor of 2 for the memory (and you definitely wouldn’t add a factor of 2 to account for bi-directional transfers, the bus isn’t duplex), I’m not sure where it came from.