What's the instruction throughput for texture fetches? How to refresh texture object without recopying?

I have some scattered memory access(actually most can be coalesced), and sometimes linear interpolation is needed, so the texture memory may be a good and easy choice. But some parts of my kernel are not quite bounded by memory throughput, and I’m not sure whether the texture instructions have significant lower instruction throughput than other high throughput instructions (such as FP32 ADD), thus I don’t know how to make a good balance between memory bound parts and instruction bound parts. Can anyone be so kind to provide a reference about the instruction throughput for texture fetch instructions?

Another problem is how to refresh texture contents of texture object without recopying. As I know, after the texture reference is binded to some memory contents, it should not be changed during kernel execution. But for different calls to kernels, the texture will be updated automatically once the binded memory contents have changed. For example, array A binded to TexA, array B binded to TexB, so first kernel call can use TexA to update B, and second kernel call can use TexB to update A. But for texture objects, the contents are copied to CudaArray, which seems not likely to be updated when the original memeory contents are changed. And CudaArray seems not even writable, thus I have to re-memcpy the modified contents to update it. Is it possible to do this without recopying?

Thanks very much~

Texture objects support linear memory (1D), pitch linear memory (2D) and cudaArrays (1D, 2D, 3D) as data source.

Only the latter (cudaArray) may require another expicit data copy, as the data needs to be reordered into a space filling curve for optimized access.

Maybe look into cuda surfaces if you need to have both write access and cached read access.

There is no need to re-copying, just use linear memory for 1D vectors and pitch linear memory for matrices (or images) and bind texture object to them . I am not sure how useful CudaArrays really are for latest GPU architectures. I never used them so far due to the complications (additional copying) as described by you.
It is even possibe to implement certain ‘inplace’ (read/write) functions with texture objects if one takes some care - e.g. for multiplying all values of an array by a constant factor.

For more information regarding usage of texture object search for ‘texture’ on the GTC-on-demand website (Search | NVIDIA On-Demand)
My GTC 2018 presentation with id “S8111” will be also available there in the next few weeks.

That would be my recommendation as well. I have never encountered a use case where cudaArrays provided a noticeable benefit (which doesn’t mean there couldn’t be cases where they provide a benefit).

If you use one of the newer architectures (Maxwell, Pascal) it is worth exploring whether using textures provides any performance benefit at all. If you want “free” fp16 → fp32 conversion or can make do with low-precision linear interpolation, the use of textures may still provide benefits.

Thanks for cbuchner1, HannesF99 and njuffa ~ I’ll try pitched linear memory~

After checking the api ,‎ only four types are available for texture object data source. Does the cudaResourceTypePitch2D also work for 3D cases?

enum cudaResourceType {
                  cudaResourceTypeArray          = 0x00,
                  cudaResourceTypeMipmappedArray = 0x01,
                  cudaResourceTypeLinear         = 0x02,
                  cudaResourceTypePitch2D        = 0x03
              };

I think 3D access requires CUDA arrays (cudaMalloc3DArray(), cudaMemcpy3DParms, cudaMemcpy3D() … ). Would you need trilinear interpolation in the 3D texture? Otherwise you could use a 2D texture, but place the individual layers side by side…