I am profilering my program, found that using the texture memory in fact slightly slow down the performance. I have checked the texture cache hit ratio, in fact it is quite high (~80%). Anyone have ideas that why using the texture slowdown the overall performance? what should i check through the profiler?
You could also use constant memory that have low-latency (similar to L1 cache), keeping compatibility with pre-Fermi CUDA technology and offering good performance to access constant data.
Notice that “constant” is relative to kernel execution, so you could launch your kernel many times (or other kernels) with the same context, to build data, inject it in constant memory and use it for processing External Image
One word of warning: Constant memory is designed for broadcast of the same word to all threads in a warp. If different threads access different constant data, it is kind of like having automatic bank conflicts in shared memory.