THIS WAS MY INITIAL GUESS. THANK YOU. THE SOLUTION: I USED A SINGLE CHANNEL TEXTURE AND THEN WITHIN THE KERNEL USED 3-sequential texture accesses. I think it’s slower than a single 4-byte read…
however, I suspect that re-packing host RGB image into device RGBA texture would take even longer.
From the other hand, if each thread is accessing three sequential texture bytes, does it mean the access is non-coalesced?
Found this thread from a personal problem…
I don’t think that using 4-tuple for an RGB image is good because the tex2D function will read 3-tuple data as 4-tuple. So indexes will get messy.
If you imply to transform the image and put a zero value to the fourth element, I think its quite memory wasting, specially if you deal with large images.