Hi, sorry for flooding the forum with questions these days…
I think I understand that… if the application is bandwidth-limited, then performance will suffer due to stalling of the SM, just waiting for operands to arrive from DRAM.
However, I think I am not that clear… on computation-limited…
no real application can fully utilize GPU’s over 300 GFlops truly… and is there any problem, or issues with computation-limited?
If there is, how can you determine computationally-limited benchmark ?
Thanks…