I have written a kernel for a GeForce 8800 GT card implementing a 2D image convolution (actually correalation) with the image residing in texture memory and the filter coefficients in constant memory. The image is segmented into blocks where each thread computes 1 output pixel (computing the inner product between coefficients and pixel values) in an O(n^2), usually n=15, serial loop-structure.
No matter how many threads I assign to each block (thereby also how many blocks I commence), the occupancy as computed with the visual profiler never exceeds 50%. This sounds really low for a problem which is so inherently parallel.
My question is, how can this be? Is it because of read conflicts to the texture memory, the amount of serialism in each thead or something different? I have seen that NVidia has published a white paper on 2D image convolution utilizing shared memory instead of texture memory. Is this really so much faster than texture memory in this case, where so much of the same data is used by all threads.
Thanks in advance for any comments.