CUDA occupancy - beginner question

Hi All,

I’m a beginner at CUDA programming - so please forgive my possible misunderstandings! thanks

First the algorithm I try to implement will work as follow.


  • transfer two images (of +/- 10MP) from CPU to GPU


  • from given seeds in the two image, extract two NxN sub-image (say N=32)
  • perform element-wise multiplication between the two sub images, store the temporary result in shared memory (shared memory usage = NxNxsizeof(float) )
  • perform a box filter (once in direction x, once in direction y) on the temporary result (-> note that this constraint me of having thread blocks of N threads, or of a multiple of N threads at least to keep all threads busy)
  • perform a correlation, given a formula, not relevant here.
  • copy memory from GPU to HOST.

(one would have to consider border effect as well).

My very issue here is that the amount of shared memory is proportional to N^2, which will result in an extremely poor occupancy…

How would you approach such kind of problem? Or are this kind of algorithms not well suited to CUDA?

Any help would be greatly appreciated. Thanks



Poor occupancy does not always imply poor performance. A high occupancy means that the device is more capable of overlapping global memory reads and computations among different warps. As your algorithm will be performing a significant number of computations all in shared memory, it will most likely not be global memory access bound and the effect of occupancy on performance should be minimal.