Low occupancy ratio using texture memory Image correlation using texture memory


I have written a kernel for a GeForce 8800 GT card implementing a 2D image convolution (actually correalation) with the image residing in texture memory and the filter coefficients in constant memory. The image is segmented into blocks where each thread computes 1 output pixel (computing the inner product between coefficients and pixel values) in an O(n^2), usually n=15, serial loop-structure.

No matter how many threads I assign to each block (thereby also how many blocks I commence), the occupancy as computed with the visual profiler never exceeds 50%. This sounds really low for a problem which is so inherently parallel.

My question is, how can this be? Is it because of read conflicts to the texture memory, the amount of serialism in each thead or something different? I have seen that NVidia has published a white paper on 2D image convolution utilizing shared memory instead of texture memory. Is this really so much faster than texture memory in this case, where so much of the same data is used by all threads.

Thanks in advance for any comments.

Occupancy has nothing to do with how “inherently parallel” an algorithm is. It is related only to the number of registers and amount of shared memory used in the kernel.

And 50% is a pretty good occupancy, especially for a bandwidth bound kernel. Typically, a boost from 50% to 66% occupancy only increases performance a few percent. A further boost to 100% occupancy rarely increases performance further from that.

Using shared memory is ideal to texture memory because each pixel is guaranteed to only be read from global memory once. With the texture memory, you are at the mercy of the small cache size so each pixel is being read from the slow global memory many times.

Since you are only computing inner products, your kernel is most certainly bandwidth bound and all of your threads are almost always waiting for memory reads anyways. Increasing the occupancy only increases the number of threads that can be waiting at the same time.

I agree, IMO absolutely make sure you have at least 30% occupancy, after that optimizing other things is likely to be a better use of your time (just my experience, mostly with operations involving matrices).

Look at the register usage though, maybe it is quite high (in most cases I try to stay under 16), which can be much improved by making rarely accessed things (e.g. the size variable if you have

for (i = 0; i < size; i++)

as your outer loop) a “volatile shared” variable.