Texture Memory? How do you use it?

I’m fairly new to OpenCL so please bare with me.

In the first iteration of my code, I used basic memory buffers for large datasets and declared them global. However now that I’m looking to improve the timing, I wanted to use texture memory for this. In the CUDA version, we use cudaBindTexture and tex1Dfetch to obtain the data for a large 1D float array. From my understanding of the specification, texture memory is the same thing as image memory. However, since there are only 2D and 3D image objects with max heights and widths, I run into some issues. My array larger than max height/width, but not max height * max width. Must I convert my 1D array into 2D? Or is there a better way to do it?

Or am I completely off?

I did read http://forums.nvidia.com/index.php?showtopic=151743 and http://forums.nvidia.com/index.php?showtopic=150454 but they weren’t exactly conclusive in whether the texture memory referred to in Best Practices and Programming Guide was in fact image objects.

Thanks and any help/suggestions are greatly welcome!

Yes, conversion to 2 or 3D is required. You can just wrap you 1D addresses into 2D. You may wish to write a routine which takes your prior 1D address & converts it to either int2 / int4, or float2 / float4, to keep your main logic readable.

The terms texture and image are basically inter-changeable. Texture implies a small swatch that is used to put a skin over a fragment, in OpenGL terminology. Neither need to actually contain image info.

In performance terms, you really need to be reading more than one value, in multiples of 4, in each work unit to really feel the biggest advantages. Of course, you need to organize your texels, such that data that goes together is in the same texel to get 4x throughput. I am not sure if textures provide more throughput when a work unit writes more than 1 value, but would not be surprised.

Reading the Best Practices Guide, they talk about getting up to 16x with global memory, but your access to data has to be pretty precise. Not only do all problems not fit so neatly to use that, but any resulting kernel is likely to be very NVidia optimized. Not sure this will even do anything on the same NVidia hardware on OSX, as opposed to a more random pattern. Everybody has textures though and they are likely to work the same way, so it might be better to get 4x everywhere, unless you do not care about other implementations.

If you are only reading/writing 1 value, then you are pinning most of your performance hopes on caching. You also get out of the sequential / aligned access restrictions, but this is not the texture sweet spot.