Low occupancy ratio using texture memory Image correlation using texture memory

mumlebux · September 16, 2008, 8:04pm

Hi.

I have written a kernel for a GeForce 8800 GT card implementing a 2D image convolution (actually correalation) with the image residing in texture memory and the filter coefficients in constant memory. The image is segmented into blocks where each thread computes 1 output pixel (computing the inner product between coefficients and pixel values) in an O(n^2), usually n=15, serial loop-structure.

No matter how many threads I assign to each block (thereby also how many blocks I commence), the occupancy as computed with the visual profiler never exceeds 50%. This sounds really low for a problem which is so inherently parallel.

My question is, how can this be? Is it because of read conflicts to the texture memory, the amount of serialism in each thead or something different? I have seen that NVidia has published a white paper on 2D image convolution utilizing shared memory instead of texture memory. Is this really so much faster than texture memory in this case, where so much of the same data is used by all threads.

Thanks in advance for any comments.

MisterAnderson42 · September 20, 2008, 1:42am

Occupancy has nothing to do with how “inherently parallel” an algorithm is. It is related only to the number of registers and amount of shared memory used in the kernel.

And 50% is a pretty good occupancy, especially for a bandwidth bound kernel. Typically, a boost from 50% to 66% occupancy only increases performance a few percent. A further boost to 100% occupancy rarely increases performance further from that.

Using shared memory is ideal to texture memory because each pixel is guaranteed to only be read from global memory once. With the texture memory, you are at the mercy of the small cache size so each pixel is being read from the slow global memory many times.

Since you are only computing inner products, your kernel is most certainly bandwidth bound and all of your threads are almost always waiting for memory reads anyways. Increasing the occupancy only increases the number of threads that can be waiting at the same time.

Reimar · September 20, 2008, 7:42am

I agree, IMO absolutely make sure you have at least 30% occupancy, after that optimizing other things is likely to be a better use of your time (just my experience, mostly with operations involving matrices).

Look at the register usage though, maybe it is quite high (in most cases I try to stay under 16), which can be much improved by making rarely accessed things (e.g. the size variable if you have

for (i = 0; i < size; i++)

as your outer loop) a “volatile shared” variable.

Topic		Replies	Views
Advices about a program CUDA Programming and Performance	6	2362	December 8, 2007
CUDA texture memory performance CUDA Programming and Performance	4	33543	January 13, 2009
Texture memory fetch extremely slow CUDA Programming and Performance	13	3100	December 21, 2017
Kernel bound by instruction and memory latency. CUDA Programming and Performance	3	1894	November 24, 2017
Question regarding warp efficiency... CUDA Programming and Performance	9	15108	March 13, 2007
Shared Memory usage slows kernel with texture fetch CUDA Programming and Performance	8	4143	June 20, 2011
Question about texture/shared memory enhance the computing efficiency CUDA Programming and Performance	3	5381	December 4, 2007
CUDA OpenGL post-processing example CUDA Programming and Performance	9	13242	May 27, 2007
Profiling Interpretation CUDA Programming and Performance	6	5651	July 31, 2010
First attempt - convolution Non-seperable image convolution CUDA Programming and Performance	4	5392	April 13, 2008

Low occupancy ratio using texture memory Image correlation using texture memory

Related topics