CrossCorr Valid increase thread block

I’m using cuda 9.2 on GTX 1050Ti Win 7.

Using 1D cross correlation between two images. Only respective columns. To find each column’s shift. GPU usage is 5.2%, theoretical is 100%.

sz_dst.width = 639;
sz_dst.height = 1;
sz_src_new.width = 1278;
sz_src_new.height = 33;
sz_tpl_new.width = 640;
sz_tpl_new.height = 33;

nppsts = nppiCrossCorrValid_NormLevel_32f_C1R(m_d_Src_new, sz_src_new.width * sizeof(float), sz_src_new,
			m_d_Tpl, sz_tpl_new.width * sizeof(float), sz_tpl_new,
			m_d_Dst, sz_dst.width * sizeof(float), m_pScratchBuffer1);

On profiling, Grid size = 20,1,1 Block size = 32,8,1
https://www.dropbox.com/s/j4w98q2bqwuv4e6/CrossCorr.bmp?dl=0

Why is block size 256? I’m compiling with compute_61,sm_61 code generation flag.
Can I increase block size by a flag?

Should I add more streams?

Correction:
2D correlation, not 1D.

You don’t have any direct control over threadblock sizes. I doubt more streams would help, although you haven’t really indicated what you mean by that.

The overall size of the grid here will be at least partially determined by your work sizes, which may be small.

It’s possible you don’t really understand the GPU utilization figure. You might wish to research that, eg. here:

[url]cuda - nvidia-smi Volatile GPU-Utilization explanation? - Stack Overflow

It tells you nothing at all about how many threads or blocks are in a kernel launch. I can devise codes that launch 1 block of 1 thread, and have nearly 100% utilization. It’s essentially just a statement of for what portion of the overall timeline, were one or more kernels running.

Hi Robert,

Thanks for getting back quickly.

There are over 600 images. Need to find correlation between each image. Intended to create 599 streams to calc correlation between adjacent images.

Thanks for the post. Agree that GPU utilization is not the best performance indicator.

Theoretical Occupancy = 100%. Achieved Occupancy = 5.2% ==> How can I increase achieved occupancy given the Theoretical is 100%?

Here are few more screenshots.
[url]Dropbox - File Deleted

The nvvp provided suggestions in your first picture posting provide a good overview, in my opinion, of the issue:

“Grid size too small to hide compute and memory latency”

“Occupancy is not limiting kernel performance”

In other words:

  • your problem size is too small
  • attempts to ignore this and focus on achieved occupancy may be misguided (“misguided” == “may not lead to expected or desired results”)

If the work associated with a single image (pair) correlation is too small, then your motivation as a CUDA programmer should be to seek to expose more parallel work to the GPU.

The usual approach to do this would be to process batches of images, rather than a single image. I don’t know that the npp library has any such facility, but I haven’t studied it carefully.

In any event, you don’t have direct control over threadblock size, grid size, when using a library routine like this one. These determinations may be arbitrarily chosen by the library, and/or by the size of the problem you are giving to the library routine.

To increase efficiency, it might be necessary to find another library routine (I don’t have any suggestion) or else write your own CUDA kernel that exposes more parallel work, perhaps by processing images in batches.

Writing your own cross-correlation code may be somewhat involved, but doesn’t sound terribly difficult to me, and the operation is well specified in a variety of places (plus you can compare results to “golden” output from the library function). It looks like a fairly standard 2D stencil problem to me. I note from your profiler output that the npp library routine does not appear to be using shared memory, so it may not be super-optimized anyway. There might be substantial performance gains possible by carefully crafting a batch processing kernel of your own.

And of course you may wish to try the streams approach.