improved texfetch to exploit all of texture hardware

prkipfer · April 10, 2007, 1:17pm

Currently, the texture fetch in CUDA is rather poor in functionality compare to what one can do in all shader languages. Are there plans to support the following in a future release?

special textures (3D already seems to work out-of-spec now, what about cube maps)
mip-mapping
anisotropic filtering
projective mappings (homogeneous/shadow)
S3TC formats
multisample buffers

Given the incredible performance for potentially very expensive stuff like compressed mipmaps, I suspect that there is dedicated hardware for that anyway on the chip. Will we be able to access these modes of the texunits in CUDA soon? Thanks.

Peter

Simon_Green · April 10, 2007, 6:45pm

Yes, we are considering adding support for more texture features to CUDA in future releases.

The syntax is also likely to change to be more like the Cg/GLSL texture functions.

In what order would you prioritize these features?

Mipmaps and aniso would only work if you explicitly specified the LOD / derivatives.

What do you mean by multisample buffers exactly?

JaredHoberock · April 10, 2007, 6:56pm

Any chance we’ll be seeing CUDA and Cg converge to the same programming environments in general?

prkipfer · April 11, 2007, 1:25pm

I would like to see a texture type “feature complete” first before introducing new types. That is

mipmap + S3TC
projective mappings + aniso
3D + cube maps

The reason is that doing (trilinear) mipmap + decompression by hand in CUDA is just awful, especially when you know that the GPU has hardware for it that is also faster than your own crappy implementation :)

Yeah, that is obvious. I wouldn’t mind too much if not all functionality is reached with the texfetch command. Cg-style command name augmentation is fine.

I mean attaching an OpenGL ARB_multisample buffer as a texture to CUDA.

Peter

tachyon_john · April 11, 2007, 2:42pm

3-D (volumetric) texture maps are of critical importance to us…

I believe MIP mapping of 3-D (volumetric) textures would also quite useful for one of our numerical methods.

John

prkipfer · April 11, 2007, 3:14pm

3D textures are convenient for getting trilinear interpolation done. Everything else can be done equally with 2D tiled textures (“flat volumes”). The texture cache is 2D only anyway. Plus using image processing libs for doing high quality mip levels and storing the thing in usual (lossless compressing) image formats is straight forward. That also goes for floating point formats (using Vigra for example).

Peter

tachyon_john · April 11, 2007, 6:19pm

Indeed, trilinear interpolation is precisely what I’m after…

Our application isn’t image processing, we want to use these features for other purposes. :)

John

Simon_Green · April 12, 2007, 1:46pm

Interesting question. I think this is unlikely, since CUDA and Cg have different goals and different target audiences, but we’re interested in hearing developer’s feedback on this.

Cg and CUDA are similar languages in that they’re both based on C, but Cg is much more graphics-oriented - it has different program types (vertex/fragment), semantics for interpolated values etc., whereas CUDA is more like regular C.

If you’re talking about the APIs, note that the CUDA driver API (as opposed to the runtime API that most of the SDK samples use) is more like the Cg runtime, in that you have more control over program loading and execution, explicitly setting parameters etc.

It would be interesting to find out how many developers here are porting existing Cg GPGPU code to CUDA vs. how many are new GPU programmers porting C code.

jimh · April 23, 2007, 8:51pm

I’ll cast another vote for 3D texture maps. I could really use trilinear filtering and 3D texmap addressing.

I’m porting Fortran and a little bit of C to CUDA.

wumpus · April 25, 2007, 7:47am

I’m for prioritizing 3-D texture maps and addressing as well

prkipfer · April 25, 2007, 10:22am

Guys, seriously what do you need 3D textures for if you don’t use compressed formats? Or do you use DXT? In that case you should vote for that first.

Trilinear interpolation in a flat volume is very easy to implement. Use the bilinear interpolation hardware + 2D texture cache on the slice and do a simple linear for the missing dimension. In my experience this is usually as fast as the 3D tex lookup + it can be much faster if you can exploit parallelism (= cache coherence) in case you know that you need to fetch known neighboring samples.

Peter

tachyon_john · April 25, 2007, 6:14pm

Peter,

If the texture unit already supports 3-D texturing and addressing, why on earth would I want to re-implement it myself, using up more of the precious registers and adding additional addressing arithmetic? Perhaps for your use cases there are plenty of registers sitting around unused, but that’s not the case for everyone. I can imagine there are plenty of cases where you could make use of the 2-D locality pattern when writing your own interpolation, and I agree that is a good strategy if doing it yourself pays off. That said, if you’re already up against the wall on the number of registers and shared memory you’re using, doing interpolations for yourself won’t help matters. I think everyone know how to implement the last dimension interpolation themselves, but it is just throwing registers and FP ops out the window to do so if the texture unit could be doing this for us. If the G80 doesn’t actually implement 3-D texturing in hardware, that would be a reason to do it ourselves, but I’ve been assured that it does…

Cheers,

John

benoit · April 26, 2007, 1:44am

I agree with you John, 3D textures would be usefull in terms of performance for me too.

prkipfer · April 26, 2007, 1:36pm

John, if you read my initial request carefully you see that I do want 3D texturing. The question was what priority to give it. Implementing DXT (see SDK) + mipmapping costs a lot more registers than lerp, which is easily written incrementally.

So don’t get me wrong, I do like NVIDIA to allow CUDA access all texture modes the texunit supports !

Peter

tachyon_john · April 26, 2007, 2:25pm

Peter,

Ok, I misunderstood the tone of your previous note. Yes, I fully understand your preference prioritizing these other features since they aren’t practical to do yourself.

I have to be honest though and say that for the non-graphical applications we’ve been working on, things like texture compression aren’t a priority. (these aren’t graphical data that we’re fetching, interpolating, etc, and compression would not be acceptable) A few of the computational kernels we’re working with are chewing through registers like there’s no tomorrow. Some of this results from weaknesses in the beta compiler and might improve “for free” down the road, in other cases there’s likely no escape and the algorithm is just that nasty. Splitting the kernel into multiple passes can work quite well in cases where there’s a natural division in the algorithm. For the others, anything we can get by offloading work to the texture unit would be a great help.

I’ll let the NVIDIA guys decide what features will help the most CUDA applications the soonest. I can only represent my own needs and priorities. There are so many different interesting CUDA projects in the works all over the world right now it’s really hard to guess what features will make the biggest impact. My own feeling is that in order to woo the computational community to CUDA, they may want to initially focus on features that are not provided by the existing APIs and shading languages, to bring more of the number crunching crowd to CUDA. I think in the end we probably all want the same things though, and I respect that your short-term needs are different from mine.

Cheers,

John

Johan.Seland · April 27, 2007, 8:04am

I am porting (and refining) an existing raytracer. I might also port some PDEs in the future.

prkipfer · April 30, 2007, 1:41pm

John,

I totally understand what you are saying. I also did some work for non-graphics stuff lately and I also looked at texfetch for it. Regarding the register pressure however I found that using texfetch actually increases it. If you look at the .ptx, it needs to setup a lot of registers for the tex call for the texunit to use as configuration registers. Doing a ld.global is pretty lean in contrast. If the compiler improves with regard to register optimization, only the global mem access approach might benefit from it.

Peter

BTW: For minimizing register usage, I currently force certain variables into shared mem. This works at the expense of some ld.shared / st.shared in most cases. The code runs only slightly slower (per thread that is) but if you need to shrink the reg requirements because your occupancy is bounded by them, this can make a huge difference overall. Any other techniques, somebody?

tachyon_john · April 30, 2007, 4:27pm

Peter,

Indeed the current beta does consume some registers when doing texfetch, but when you need to use texturing with interpolation there may not be a convenient alternative. Since we were talking about the differences between emulating 3-D texturing by doing the remaining interpolation ourselves, the question at hand is more one of whether using the hardware to do the interpolation costs more registers than doing it ourselves. I agree that as a whole doing any texturing currently eats several registers. For my purposes, the important question was whether a built-in 3-D texfetch would cost more (registers) than implementing it ourselves with two 2-D texfetches? I suspect not, and that was my initial point of concern. When I get a little free time I’ll see how lean I can make a pure software implementation and get back to you on how many registers it uses and how fast it’ll run.

John

John,

I totally understand what you are saying. I also did some work for non-graphics stuff lately and I also looked at texfetch for it. Regarding the register pressure however I found that using texfetch actually increases it. If you look at the .ptx, it needs to setup a lot of registers for the tex call for the texunit to use as configuration registers. Doing a ld.global is pretty lean in contrast. If the compiler improves with regard to register optimization, only the global mem access approach might benefit from it.

Peter

BTW: For minimizing register usage, I currently force certain variables into shared mem. This works at the expense of some ld.shared / st.shared in most cases. The code runs only slightly slower (per thread that is) but if you need to shrink the reg requirements because your occupancy is bounded by them, this can make a huge difference overall. Any other techniques, somebody?

[snapback]191368[/snapback]

prkipfer · April 30, 2007, 4:52pm

Yeah, good point. Using ld.global however together with forcing the variables to shared mem, I can get around the register spill very well.

Cool. I would like to see your findings.

Peter

prkipfer · May 3, 2007, 4:47pm

Hey guys, just to put my current findings to discussion:

Below are some screenshots of my testbed for imaging application performance in CUDA. They are two high dynamic range tone mapping operators.

A very cheap operator, consisting just of some log and pow for every pixel
A more expensive adaptive operator that computes the result accroding to a global model for every pixel
2a. The same operator as 2, but this time the adaptation model is rebuild for every pixel which means it considers a 9 point stencil around the pixel

This application is particularly well suited for CUDA. All operators run at 100% occupancy, have decent arithmetic to do and use fully coalesced memory access. The screenshots actually show a greyscale image of the CUDA kernel clock() timings for every pixel. They have been computed as follows:

screenshot1: Operator 1 using device memory read/write
screenshot2: Operator 2 using texfetches, device memory write
screenshot3: Operator 2a using texfetches, device memory write
screenshot4: Operator 2 using device memory read/write
screenshot5: Operator 2a using device memory read/write

The grey values have been scaled to min/max so the absolute time is not visible in the shading (yes operator 1 is faster than 2). What is funny is how the timings vary across the image (1k x 1k x XYZ x 32bit float input, RGBA8 output).

Looks like when using the device memory accesses, there can be huge variations and as the bright line in the upper left corner suggests, the G80 has a hard time to start up. See screenshots 1,2,4
texfetches really do help you only if you can make use of the cache. The screenshot 3 shows a more average grey which means that the timings have less variation. The texfetches do not help in screenshot 2 as this variant reads only a single input pixel.
What is also nice is that screenshot 5 is also relatively smooth. Looks like the device mem fetch in the stencil can also contribute some averaging as the texcache does.
The funny low start up performance directly means that you need a massive amount of threads to amortize it.

Looking forward to your replies.

Peter

Topic		Replies	Views
CUDA 1.0 FAQ (OBSOLETE) Frequently asked questions about CUDA Announcements	2	75909	February 9, 2009
Is CUDA better than GLSLang? I need to know more... CUDA Programming and Performance	30	38752	July 13, 2007
Textures: Please provide some feedback CUDA Programming and Performance	2	1197	February 24, 2013
CUDA 4.1 suggested improvements. CUDA Programming and Performance	32	45697	October 8, 2011
CUDA vs DX execution times DX GPGPU code --> CUDA = slower CUDA Programming and Performance	15	13385	January 30, 2008
Global Memory vs Constant vs Texture Fetch Performance CUDA Programming and Performance	12	7418	March 10, 2009
About texture cache and spatial locality CUDA Programming and Performance	15	11355	July 24, 2009
I am trying to compare the performance of texture fetch and usual memory fetch CUDA Programming and Performance	10	2342	July 19, 2010
Cuda good practices for image processing CUDA Programming and Performance	8	15624	February 12, 2009
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	205141	April 13, 2009

improved texfetch to exploit all of texture hardware

Related topics