Hey guys, just to put my current findings to discussion:
Below are some screenshots of my testbed for imaging application performance in CUDA. They are two high dynamic range tone mapping operators.
- A very cheap operator, consisting just of some log and pow for every pixel
- A more expensive adaptive operator that computes the result accroding to a global model for every pixel
2a. The same operator as 2, but this time the adaptation model is rebuild for every pixel which means it considers a 9 point stencil around the pixel
This application is particularly well suited for CUDA. All operators run at 100% occupancy, have decent arithmetic to do and use fully coalesced memory access. The screenshots actually show a greyscale image of the CUDA kernel clock() timings for every pixel. They have been computed as follows:
screenshot1: Operator 1 using device memory read/write
screenshot2: Operator 2 using texfetches, device memory write
screenshot3: Operator 2a using texfetches, device memory write
screenshot4: Operator 2 using device memory read/write
screenshot5: Operator 2a using device memory read/write
The grey values have been scaled to min/max so the absolute time is not visible in the shading (yes operator 1 is faster than 2). What is funny is how the timings vary across the image (1k x 1k x XYZ x 32bit float input, RGBA8 output).
Looks like when using the device memory accesses, there can be huge variations and as the bright line in the upper left corner suggests, the G80 has a hard time to start up. See screenshots 1,2,4
texfetches really do help you only if you can make use of the cache. The screenshot 3 shows a more average grey which means that the timings have less variation. The texfetches do not help in screenshot 2 as this variant reads only a single input pixel.
What is also nice is that screenshot 5 is also relatively smooth. Looks like the device mem fetch in the stencil can also contribute some averaging as the texcache does.
The funny low start up performance directly means that you need a massive amount of threads to amortize it.
Looking forward to your replies.