Revisiting this topic…
If you’re writing to a CUDA surface, the “l2_tex_write_throughput” metric should probably be looked at along with “dram_write_throughput”.
I experimented with writing to an OpenGL RBO GL_RGBA8 surface with a kernel that maps a grid’s thread idx to either a surface row (x-order) or a column (y-order) write. The opposing coordinate is, respectively, modulo the surface width or height. i.e. surf2DWrite(rgbx,surf,x*sizeof(rgbx),y,…)
The FBO containing the RBO surface is then blitted to the default framebuffer and the loop repeats.
The y-order kernel exhibits an impressive l2_tex_write_throughput but overall kernel execution time is slower than the x-order kernel and the l2_tex_write_throughput : dram_write_throughput ratio is 4:1.
The x-order kernel throughput ratio is 2:1 and the kernel execution time is faster.
Additionally, the overall dram_write_throughput (GB/s) of the x-order kernel is higher than the y-order kernel.
It’s unclear if nvprof is measuring the FBO-to-default blit.
So who here knows if writing to OpenGL RBOs on Maxwell via a CUDA surface is best performed using a specific tiled/swizzled order?
I suppose I could try some more experiments but I’m looking to shortcut that effort. :)