Global load optimization using texture

Hi,
In my kernel, I need to load data from global to shared memory in strided access pattern.
I have an array of 1024X1024X64 elements of type short2.
Every thread-block need to load 1024 elements to shared memory and convert them to float2.
The elements that each threads block need to load are located in strided pattern in global memoy - the threads block loads every 64’th element from the array (meaning the distance between two sequential loaded elements is 64Xsizeof(short2) )
I am trying to improve the performance of the data loading by using texture memory, but as for now, the results of using texture are almost identical to the results without using texture (23 GB/s on RTX5000).
I have tried to divide the array to tiles of 1024*64 elements and to use 1D or 2D texture for each tile, but both 1D and 2D texture fetch results with the same results as without using texture.
In my understanding, texture hardware resource suppose to have a cache that might benefit spatial access pattern to global memory (like in my kernel) and I want to take advantage of this cache.
Is there a way to use the texture cache to improve the performance of such loading pattern?
In addition, I found in the Texture Object API documentation that the tex1Dfetch “may optionally promote the integer to single precision floating point” . as I said, in my program I need to converts the array elements from short2 to float2. How can I use the texture hardware for that purpose ?
Thank you in advance for your time,
Ori

Bullet point 4 here may be of use.

Thank you. is there a way to perform the conversion to float without normalizing?