I ported my opencl kernel to use texture memory instead of global memory.
Unfortunately, the new kernel doesn’t seem to achieve any speedups against the kernel using global memory.
The nvidia profiler says the new kernel is performing just about 20% of the global memory accesses done in the older kernel version.
But the instruction and branching counter increased to about 4-5 times than it was before.
How can I understand this? This doesn’t make sense to me at all.
Any advice appreciated. Thanks in advance.