Fermi. Any DRAM-intensive benchmarks ? or any suggestions how to achieve cache-misses mostly ?
Becomes complicated since each thread is simultaneously accessing the cache structures together.
And profiler seems to suggest that L1-transaction is always 128-byte, however, if L1 misses, then L2 becomes four 32-bytes transactions.
Is this true ?
Thanks
How about memcpy?
You’re right about L1 and L2 line-size.
For the memory benchmark, it depends if you want to check peak memory bandwidth (say reading 100%, writing 100% or read+write) or if you want to bench 32bit, 64bit, 128bit random memory read/write and latency. with or without coalescence…
I suggest strongly that you create a prototype of your application with same memory IO pattern, then check the performance-level.