improved texfetch to exploit all of texture hardware

Currently, the texture fetch in CUDA is rather poor in functionality compare to what one can do in all shader languages. Are there plans to support the following in a future release?

  • special textures (3D already seems to work out-of-spec now, what about cube maps)
  • mip-mapping
  • anisotropic filtering
  • projective mappings (homogeneous/shadow)
  • S3TC formats
  • multisample buffers

Given the incredible performance for potentially very expensive stuff like compressed mipmaps, I suspect that there is dedicated hardware for that anyway on the chip. Will we be able to access these modes of the texunits in CUDA soon? Thanks.

Peter

Yes, we are considering adding support for more texture features to CUDA in future releases.

The syntax is also likely to change to be more like the Cg/GLSL texture functions.

In what order would you prioritize these features?

Mipmaps and aniso would only work if you explicitly specified the LOD / derivatives.

What do you mean by multisample buffers exactly?

Any chance we’ll be seeing CUDA and Cg converge to the same programming environments in general?

I would like to see a texture type “feature complete” first before introducing new types. That is

  1. mipmap + S3TC

  2. projective mappings + aniso

  3. 3D + cube maps

The reason is that doing (trilinear) mipmap + decompression by hand in CUDA is just awful, especially when you know that the GPU has hardware for it that is also faster than your own crappy implementation :)

Yeah, that is obvious. I wouldn’t mind too much if not all functionality is reached with the texfetch command. Cg-style command name augmentation is fine.

I mean attaching an OpenGL ARB_multisample buffer as a texture to CUDA.

Peter

3-D (volumetric) texture maps are of critical importance to us…

I believe MIP mapping of 3-D (volumetric) textures would also quite useful for one of our numerical methods.

John

3D textures are convenient for getting trilinear interpolation done. Everything else can be done equally with 2D tiled textures (“flat volumes”). The texture cache is 2D only anyway. Plus using image processing libs for doing high quality mip levels and storing the thing in usual (lossless compressing) image formats is straight forward. That also goes for floating point formats (using Vigra for example).

Peter

Indeed, trilinear interpolation is precisely what I’m after…

Our application isn’t image processing, we want to use these features for other purposes. :)

John

Interesting question. I think this is unlikely, since CUDA and Cg have different goals and different target audiences, but we’re interested in hearing developer’s feedback on this.

Cg and CUDA are similar languages in that they’re both based on C, but Cg is much more graphics-oriented - it has different program types (vertex/fragment), semantics for interpolated values etc., whereas CUDA is more like regular C.

If you’re talking about the APIs, note that the CUDA driver API (as opposed to the runtime API that most of the SDK samples use) is more like the Cg runtime, in that you have more control over program loading and execution, explicitly setting parameters etc.

It would be interesting to find out how many developers here are porting existing Cg GPGPU code to CUDA vs. how many are new GPU programmers porting C code.

I’ll cast another vote for 3D texture maps. I could really use trilinear filtering and 3D texmap addressing.

I’m porting Fortran and a little bit of C to CUDA.

I’m for prioritizing 3-D texture maps and addressing as well

Guys, seriously what do you need 3D textures for if you don’t use compressed formats? Or do you use DXT? In that case you should vote for that first.

Trilinear interpolation in a flat volume is very easy to implement. Use the bilinear interpolation hardware + 2D texture cache on the slice and do a simple linear for the missing dimension. In my experience this is usually as fast as the 3D tex lookup + it can be much faster if you can exploit parallelism (= cache coherence) in case you know that you need to fetch known neighboring samples.

Peter

Peter,

If the texture unit already supports 3-D texturing and addressing, why on earth would I want to re-implement it myself, using up more of the precious registers and adding additional addressing arithmetic? Perhaps for your use cases there are plenty of registers sitting around unused, but that’s not the case for everyone. I can imagine there are plenty of cases where you could make use of the 2-D locality pattern when writing your own interpolation, and I agree that is a good strategy if doing it yourself pays off. That said, if you’re already up against the wall on the number of registers and shared memory you’re using, doing interpolations for yourself won’t help matters. I think everyone know how to implement the last dimension interpolation themselves, but it is just throwing registers and FP ops out the window to do so if the texture unit could be doing this for us. If the G80 doesn’t actually implement 3-D texturing in hardware, that would be a reason to do it ourselves, but I’ve been assured that it does…

Cheers,

John

I agree with you John, 3D textures would be usefull in terms of performance for me too.

John, if you read my initial request carefully you see that I do want 3D texturing. The question was what priority to give it. Implementing DXT (see SDK) + mipmapping costs a lot more registers than lerp, which is easily written incrementally.

So don’t get me wrong, I do like NVIDIA to allow CUDA access all texture modes the texunit supports !

Peter

Peter,

Ok, I misunderstood the tone of your previous note. Yes, I fully understand your preference prioritizing these other features since they aren’t practical to do yourself.

I have to be honest though and say that for the non-graphical applications we’ve been working on, things like texture compression aren’t a priority. (these aren’t graphical data that we’re fetching, interpolating, etc, and compression would not be acceptable) A few of the computational kernels we’re working with are chewing through registers like there’s no tomorrow. Some of this results from weaknesses in the beta compiler and might improve “for free” down the road, in other cases there’s likely no escape and the algorithm is just that nasty. Splitting the kernel into multiple passes can work quite well in cases where there’s a natural division in the algorithm. For the others, anything we can get by offloading work to the texture unit would be a great help.

I’ll let the NVIDIA guys decide what features will help the most CUDA applications the soonest. I can only represent my own needs and priorities. There are so many different interesting CUDA projects in the works all over the world right now it’s really hard to guess what features will make the biggest impact. My own feeling is that in order to woo the computational community to CUDA, they may want to initially focus on features that are not provided by the existing APIs and shading languages, to bring more of the number crunching crowd to CUDA. I think in the end we probably all want the same things though, and I respect that your short-term needs are different from mine.

Cheers,

John

I am porting (and refining) an existing raytracer. I might also port some PDEs in the future.

John,

I totally understand what you are saying. I also did some work for non-graphics stuff lately and I also looked at texfetch for it. Regarding the register pressure however I found that using texfetch actually increases it. If you look at the .ptx, it needs to setup a lot of registers for the tex call for the texunit to use as configuration registers. Doing a ld.global is pretty lean in contrast. If the compiler improves with regard to register optimization, only the global mem access approach might benefit from it.

Peter

BTW: For minimizing register usage, I currently force certain variables into shared mem. This works at the expense of some ld.shared / st.shared in most cases. The code runs only slightly slower (per thread that is) but if you need to shrink the reg requirements because your occupancy is bounded by them, this can make a huge difference overall. Any other techniques, somebody?

Peter,

Indeed the current beta does consume some registers when doing texfetch, but when you need to use texturing with interpolation there may not be a convenient alternative. Since we were talking about the differences between emulating 3-D texturing by doing the remaining interpolation ourselves, the question at hand is more one of whether using the hardware to do the interpolation costs more registers than doing it ourselves. I agree that as a whole doing any texturing currently eats several registers. For my purposes, the important question was whether a built-in 3-D texfetch would cost more (registers) than implementing it ourselves with two 2-D texfetches? I suspect not, and that was my initial point of concern. When I get a little free time I’ll see how lean I can make a pure software implementation and get back to you on how many registers it uses and how fast it’ll run.

John

Yeah, good point. Using ld.global however together with forcing the variables to shared mem, I can get around the register spill very well.

Cool. I would like to see your findings.

Peter

Hey guys, just to put my current findings to discussion:

Below are some screenshots of my testbed for imaging application performance in CUDA. They are two high dynamic range tone mapping operators.

  1. A very cheap operator, consisting just of some log and pow for every pixel
  2. A more expensive adaptive operator that computes the result accroding to a global model for every pixel
    2a. The same operator as 2, but this time the adaptation model is rebuild for every pixel which means it considers a 9 point stencil around the pixel

This application is particularly well suited for CUDA. All operators run at 100% occupancy, have decent arithmetic to do and use fully coalesced memory access. The screenshots actually show a greyscale image of the CUDA kernel clock() timings for every pixel. They have been computed as follows:

screenshot1: Operator 1 using device memory read/write
screenshot2: Operator 2 using texfetches, device memory write
screenshot3: Operator 2a using texfetches, device memory write
screenshot4: Operator 2 using device memory read/write
screenshot5: Operator 2a using device memory read/write

The grey values have been scaled to min/max so the absolute time is not visible in the shading (yes operator 1 is faster than 2). What is funny is how the timings vary across the image (1k x 1k x XYZ x 32bit float input, RGBA8 output).

  • Looks like when using the device memory accesses, there can be huge variations and as the bright line in the upper left corner suggests, the G80 has a hard time to start up. See screenshots 1,2,4

  • texfetches really do help you only if you can make use of the cache. The screenshot 3 shows a more average grey which means that the timings have less variation. The texfetches do not help in screenshot 2 as this variant reads only a single input pixel.

  • What is also nice is that screenshot 5 is also relatively smooth. Looks like the device mem fetch in the stencil can also contribute some averaging as the texcache does.

  • The funny low start up performance directly means that you need a massive amount of threads to amortize it.

Looking forward to your replies.

Peter