Very poor performance with NPP CrossCorrValid

The issue reported in this thread was addressed in the CUDA 8.0 timeframe. Note the statement about very large template sizes.

If you wish to report an issue, and have a complete test case, you can file a bug.