Writing Performance of Surface in CUDA

Surface can provide the ability of reading/writing with global memory.
I am wondering the writing performance of Surface.

Since Surface has texture read-only cache, whether is there a mechanism that the block of data in cache can write into global memory together. For example, I have lots of arrays with 100 dimension. each is continuous (100 dim) in global memory. each thread needs to write the array with Surface. (different thread corresponding to different arrays). Whether can the time be reduced compared to directly write the global memory without Surface?

Thanks a lot.

I wouldn’t expect writing to a surface to be faster than an ordinary global write. Writes should be pretty quick in any event (they shouldn’t have a long latency like a read), assuming you don’t overload the LD/ST units. The best way to avoid overloading the LD/ST units is to do efficient (e.g. coalesced) writes.

Thanks, txbob!

Adding to @txbob’s comments:

My benchmarking shows that global stores are faster than surf2Dwrite() operations… but for my application it was a tiny difference in performance.

This is no surprise since the surface instructions also support bounds checking on the store coordinates. The checking appears to be implemented in hardware or microcode since I didn’t spot any bounds checking in the SASS.

If you want to measure your app I believe the proper nvprof metric to look at is “l2_tex_write_throughput”:

> nvprof -m l2_tex_write_throughput interop.exe
==2836== NVPROF is profiling process 2836, command: interop
GL   : GeForce GTX 750 Ti       ( 5)
CUDA : GeForce GTX 750 Ti       ( 5)
==2836== Profiling application: interop
==2836== Profiling result:
==2836== Metric result:
Invocations   Metric Name               Metric Description              Min         Max         Avg
Device "GeForce GTX 750 Ti (2)"
        Kernel: pxl_kernel
        444   l2_tex_write_throughput   L2 Throughput (Texture Writes)  37.181GB/s  38.696GB/s  38.540GB/s

A GTX 980 has the following throughput:

Invocations   Metric Name               Metric Description              Min         Max         Avg
Device "GeForce GTX 980 (0)"
        Kernel: pxl_kernel
        267   l2_tex_write_throughput   L2 Throughput (Texture Writes)  146.19GB/s  152.40GB/s  151.85GB/s

Revisiting this topic…

If you’re writing to a CUDA surface, the “l2_tex_write_throughput” metric should probably be looked at along with “dram_write_throughput”.

I experimented with writing to an OpenGL RBO GL_RGBA8 surface with a kernel that maps a grid’s thread idx to either a surface row (x-order) or a column (y-order) write. The opposing coordinate is, respectively, modulo the surface width or height. i.e. surf2DWrite(rgbx,surf,x*sizeof(rgbx),y,…)

The FBO containing the RBO surface is then blitted to the default framebuffer and the loop repeats.

The y-order kernel exhibits an impressive l2_tex_write_throughput but overall kernel execution time is slower than the x-order kernel and the l2_tex_write_throughput : dram_write_throughput ratio is 4:1.

The x-order kernel throughput ratio is 2:1 and the kernel execution time is faster.

Additionally, the overall dram_write_throughput (GB/s) of the x-order kernel is higher than the y-order kernel.

It’s unclear if nvprof is measuring the FBO-to-default blit.

So who here knows if writing to OpenGL RBOs on Maxwell via a CUDA surface is best performed using a specific tiled/swizzled order?

I suppose I could try some more experiments but I’m looking to shortcut that effort. :)