I have searched a solution or example for this problem. I still can’t figure it out.
Based on CUDA C programming guide, it says “Unlike texture memory, surface memory uses byte addressing”. I should approach the target index by byte unit; each target index should not be multiple of 4. then using short4, I should read 4 subsequent elements from the target index. But it does not work properly.
Can you give me hands, please ?