i have an array of unsigned short. Each thread of my kernel needs to read 8 consecutive elements from this array. I think it will be efficient to read in these 8 16bit elemnts as 64 bit chunks (since if i read these 8 elemnts one by one then adjacent elemnts in the cache may get erased by the time the thread comes back to read it thus causing cache misses)
So for improving cache hit rate I plan to read 4 elemnts of 16bit size in one go. For that is it ok if i define the texture channel as 64 bit? Can i apply normal pointer arithmetic to it while reading 64 bits at a time (offset etc…) ?
Once the 64 bit data has been read in… I would be extracting 16 bit data from it by using shift and AND operator.?
So if all the above makes sense then – how do 16bit elements get arranged in a single 64bit element… if i am reading 1st to 4th element then 1st element will lie on left most or right most part of 64bit element.