Question about global mem throughput


I saw from cudaprof that my global read throughput is about 40G/s, and write throughput is 400G/s…!

The chip I use is tesla 1060.

Does it mean that the read is not coarse but write is better?

Thanks in advance.

I think that means the measurement is wrong. No CUDA device can write 400 GB/sec.

This is very true with C1060 and in fact all G80/G90/G200 GPUs. But be careful, in Fermi, you can get even better than 1 TB/sec with global writes… if they stay in cache. It would be interesting to measure this… even crudely by restricting access to perhaps 1K regions (to test L1 cache speed) and 200K regions (to test L2 cache speed).

This bug, amongst other regressions, appeared in the 3.0 release profiler. I reported them when it was released.