texture or surface memory with byte addressing to get short4 ( or int4 / float4) on one time read

I have searched a solution or example for this problem. I still can’t figure it out.

Based on CUDA C programming guide, it says “Unlike texture memory, surface memory uses byte addressing”. I should approach the target index by byte unit; each target index should not be multiple of 4. then using short4, I should read 4 subsequent elements from the target index. But it does not work properly.

Can you give me hands, please ?