Multiple textures vs Single Multichannel texures Which is faster?

Hi there,

My code has to read from three (equally-sized) matrices simultaneously, and I’ve chosen to map them to textures. Now I’m facing a choice between using one texture with multiple channels and multiple textures with a single channel each. The access pattern is such that I could use all data out of the multichannel fetch. My gut feeling is that the multichannel way should be faster–am I right?

As an additional consideration, I would need three channels, but only 1,2 or 4 seem to be supported, leaving one unused. How much am I hurting myself there, bandwidth-wise?

Thanks,
Andreas

In my application, using a single float4 tex fetch is faster than using 3 float texture fetches. You may find the same results in yours. I would suggest writing a quick microbenchmark that times kernels using both methods on a representative data set.

same for us, it seems like in hardware single layer texture reads are represented as cutting out a single chunk from a multi-channel read.

so we would suggest multi-layer textures instead of multiple textures.
Also consider packing multi-dimensional data into a single texture with offsets.

To my knowledge, there is no particular overhead in using a single channel texture. I have written kernels using single channel textures that have utilized the full device memory bandwidth available on the device.

When it comes to multiple single channel texture reads vs one multi-channel read, I think there is some extra overhead just from the fact that multiple texture reads are being performed.

you right, I did not say reading single channel texture is slow. It’s just reading multi-channel is as fast as single, so the question is – in your code that achieves theoretical limits in bandwidth, can you try to insert multi-channel memory reads instead of single-channel

maybe you’d be able to get x3 of the theoretical bandwidth, ah? :)

and one more thing: when using only textures, one should get better performance than memory-limited algorithms, because textures are cached , so memory peak bandwidth limit doesn’t much apply to textures reads

No, you do not understand bandwith. A single channel texture read can achieve close to 80 Gib/s. A multichannel texture read can reach also close to 80 Gib/s (but a little closer than single channel). mutli-channel is as fast as single channel in throughput, but reading 1 float4 takes (almost) 4 times as long as reading a float.

Also peak memory bandwith applies just the same to textures, if you are hitting the cache all the time, you should have read the data coalesced, stored it in shared memory, and perform you calculations on shared mem instead.

I’ve never noticed the cache provide more than ~70 GiB/s. Consider this: the cache has enough room for 2000 floats and a multiprocessor is capable of executing 786 threads concurrently. If every thread reads a float4, that is 3072 floats read => values are likely to be flushed from the cache as warps work there way through the scheduler.

The cache acts more as an “almost coalesced memory reader”. If you have all threads within a warp read spatially local values with a texture, you will achieve 70 GiB/s. Temporal locality and spatial locality between warps matters little.

Has anyone compared reading a small (< 1000 element) lookup table from the texture cache (and/or the constant cache) to reading the same table from shared memory? I expect shared memory to be faster, but if the texture cache is within a factor of 2 of the shared memory, that could be a reasonable tradeoff if it frees up shared memory for other uses.

Also, does anyone know if the cache is shared between all the blocks on the multiprocessor? That is, if all of the blocks are reading similar regions (say from a lookup table), can one block obtain cached values that were fetched by another block? That would be another reason to consider putting lookup tables in textures/const memory.

I made that comparison a while ago. http://forums.nvidia.com/index.php?showtop…76&#entry256376

Looking back at it now, the memory access pattern is a little weird. A more application specific access pattern may have different results.