help me help you with modern CMake and CUDA: mwe for NPP

Hello,

I am working on resolving a CMake issue to merge a module into CMake proper that will allow us all to use the various (and glorious) NVIDIA libraries in modern CMake, such as cuRAND and cuSOLVER etc. I have been able to scrap together working examples from the docs for almost every library (small bug with cufft_static but I can figure that one out), such that in your own projects you will be able to do

find_package(CUDALibs REQUIRED COMPONENTS ...)
# ...
target_link_libraries(myexe CUDA::cusolver)
# or
target_link_libraries(myexe CUDA::cusolver_static)

This is still in an early stage, but my aim is to have it ready by 3.13-rc1. I have not used the NPP library before, and do not have time to learn all of its glory. The main docs page lists what the various libraries are for: https://docs.nvidia.com/cuda/npp/index.html

But I do not understand how to create some small minimal working examples for each individual library. For example, calling functions exclusive to nppi. They can be very simple, but each of these need to be tested individually to make sure the packaging and dependencies are mapped correctly, both dynamically and statically.

The samples in the CUDA SDK are a little too convoluted / bring in way more extra code than is needed.

If you work on or work with NPP, please consider helping me get started in supporting this library with CMake. I believe that with this support will come increased usage, as currently you have to manually specify linking flags which is not ideal.

Note the current minimal working examples for all other libraries are pure C++ – no CUDA code – and should be able to compile and link against the libraries using just the API calls. Our goal is to make sure that we do not require enable_language(CUDA). This module will operate independently of CUDA as a language / FindCUDA.cmake, enabling non-CUDA applications to still use / link against these libraries.

Thank you in advance for any help / pointers / explanation (e.g., even if it’s just to say “NPP actually must be compiled in CUDA source files”). I look forward to getting support into CMake for all of these libraries!

Here is a minimal test case for libnppial:

$ cat ut_nppial.cpp
#include <nppi_arithmetic_and_logical_operations.h>
#include <cuda_runtime_api.h>
#include <assert.h>

int main(){

/**
 * One 8-bit unsigned char channel in place image add constant, scale, then clamp to saturated value.
 */
  const int imgrows = 32;
  const int imgcols = 32;
  const Npp8u nConstant = 4;
  Npp8u *d_pSrcDst;
  const int nScaleFactor = 2;
  const int nResult = nConstant >> nScaleFactor;
  NppiSize oSizeROI;  oSizeROI.width = imgcols;  oSizeROI.height = imgrows;
  const int imgsize = imgrows*imgcols*sizeof(d_pSrcDst[0]);
  const int imgpix  = imgrows*imgcols;
  const int nSrcDstStep = imgcols*sizeof(d_pSrcDst[0]);
  cudaError_t err = cudaMalloc((void **)&d_pSrcDst, imgsize);
  assert(err == cudaSuccess);
  // set image to 0 initially
  err = cudaMemset(d_pSrcDst, 0, imgsize);
  assert(err == cudaSuccess);
  // add nConstant to each pixel, then multiply each pixel by 2^-nScaleFactor
  NppStatus ret =  nppiAddC_8u_C1IRSfs(nConstant, d_pSrcDst, nSrcDstStep, oSizeROI, nScaleFactor);
  assert(ret == NPP_NO_ERROR);
  Npp8u *h_imgres = new Npp8u[imgpix];
  err = cudaMemcpy(h_imgres, d_pSrcDst, imgsize, cudaMemcpyDeviceToHost);
  assert(err == cudaSuccess);
  // test that result = nConstant * 2^-nScaleFactor
  for (int i = 0; i < imgpix; i++) assert(h_imgres[i] == nResult);
  return 0;
}

$ cat bld_nppial
#!/bin/bash
# static linking to CUDA libraries
g++ -I/usr/local/cuda/include ut_nppial.cpp -L/usr/local/cuda/lib64 -lnppial_static -lnppc_static -lculibos -lcudart_static -lpthread -ldl -lrt -o ut_nppial_static
# dynamic linking
g++ -I/usr/local/cuda/include ut_nppial.cpp -L/usr/local/cuda/lib64 -lnppial -lcudart  -o ut_nppial_dynamic

$ ./bld_nppial
$ ./ut_nppial_static
$ ./ut_nppial_dynamic
$

Tested on CUDA 9.2, CentOS 7, g++ 4.8.5-4
https://docs.nvidia.com/cuda/npp/index.html
this assumes a standard, proper CUDA install
This is a very simple test case to demonstrate minimal library linking. For more complete NPP sample codes, refer to the CUDA sample codes:

https://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries

Here is a minimal test case for libnppicc:

$ cat ut_nppicc.cpp
#include <nppi_color_conversion.h>
#include <cuda_runtime_api.h>
#include <assert.h>

int main(){

/**
 * 3 channel 8-bit unsigned packed RGB to 1 channel 8-bit unsigned packed Gray conversion.
 */

  const int imgrows = 32;
  const int imgcols = 32;
  Npp8u *d_pSrc, *d_pDst;
  const int pixval = 9;
  float R, G, B;
  R = B = G = pixval;
  const int nGray =  (int)(0.299F * R + 0.587F * G + 0.114F * B);
  NppiSize oSizeROI;  oSizeROI.width = imgcols;  oSizeROI.height = imgrows;
  const int srcimgsize = imgrows*imgcols*3*sizeof(Npp8u);
  const int dstimgsize = imgrows*imgcols*sizeof(Npp8u);
  const int imgpix  = imgrows*imgcols;
  const int nSrcStep = imgcols*3*sizeof(Npp8u);
  const int nDstStep = imgcols*sizeof(Npp8u);
  cudaError_t err = cudaMalloc((void **)&d_pSrc, srcimgsize);
  assert(err == cudaSuccess);
  err = cudaMalloc((void **)&d_pDst, dstimgsize);
  assert(err == cudaSuccess);
  // set image (all components) to pixval initially
  err = cudaMemset(d_pSrc, pixval, srcimgsize);
  assert(err == cudaSuccess);
  // convert image to gray
  NppStatus ret = nppiRGBToGray_8u_C3C1R(d_pSrc, nSrcStep, d_pDst, nDstStep, oSizeROI);
  assert(ret == NPP_NO_ERROR);
  Npp8u *h_imgres = new Npp8u[imgpix];
  err = cudaMemcpy(h_imgres, d_pDst, dstimgsize, cudaMemcpyDeviceToHost);
  assert(err == cudaSuccess);
  // test result
  for (int i = 0; i < imgpix; i++) assert(h_imgres[i] == nGray);
  return 0;
}

$ cat bld_nppicc
#!/bin/bash
# static linking to CUDA libraries
g++ -I/usr/local/cuda/include ut_nppicc.cpp -L/usr/local/cuda/lib64 -lnppicc_static -lnppc_static -lculibos -lcudart_static -lpthread -ldl  -lrt -o ut_nppicc_static

# dynamic linking
g++ -I/usr/local/cuda/include ut_nppicc.cpp -L/usr/local/cuda/lib64 -lnppicc -lcudart -o ut_nppicc_dynamic

$ ./bld_nppicc
$ ./ut_nppicc_static
$ ./ut_nppicc_dynamic
$

Tested on CUDA 9.2, CentOS 7, g++ 4.8.5-4
https://docs.nvidia.com/cuda/npp/index.html
this assumes a standard, proper CUDA install
This is a very simple test case to demonstrate minimal library linking. For more complete NPP sample codes, refer to the CUDA sample codes:

https://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries

Here is a minimal test case for libnppicom:

$ cat ut_nppicom.cpp
#include <nppi_compression_functions.h>
#include <assert.h>

int main(){

/**
 * Returns the length of the NppiDecodeHuffmanSpec structure.
**/
  int pSize = 0;
  NppStatus ret = nppiDecodeHuffmanSpecGetBufSize_JPEG(&pSize);
  assert(ret == NPP_NO_ERROR);
  assert(pSize > 0);
  return 0;
}


$ cat bld_nppicom
#!/bin/bash
# static linking to CUDA libraries
g++ -I/usr/local/cuda/include ut_nppicom.cpp -L/usr/local/cuda/lib64 -lnppicom_static -o ut_nppicom_static

# dynamic linking
g++ -I/usr/local/cuda/include ut_nppicom.cpp -L/usr/local/cuda/lib64 -lnppicom -o ut_nppicom_dynamic

$ ./bld_nppicom
$ ./ut_nppicom_static
$ ./ut_nppicom_dynamic
$

Tested on CUDA 9.2, CentOS 7, g++ 4.8.5-4
https://docs.nvidia.com/cuda/npp/index.html
this assumes a standard, proper CUDA install

This is obviously a very minimal test case for purposes of demonstrating minimal library linking. If you desire a more complete test case of JPEG compression functionality, I refer you to the relevant CUDA sample codes:

https://docs.nvidia.com/cuda/cuda-samples/index.html#jpeg-encode-decode-and-resize-with-npp

Here is a minimal test case for libnppidei:

$ cat ut_nppidei.cpp
#include <nppi_data_exchange_and_initialization.h>
#include <cuda_runtime_api.h>
#include <assert.h>

int main(){


/**
 * 8-bit image copy.
 */

  const int imgrows = 32;
  const int imgcols = 32;
  Npp8s *d_pSrc, *d_pDst;
  NppiSize oSizeROI;  oSizeROI.width = imgcols;  oSizeROI.height = imgrows;
  const int imgsize = imgrows*imgcols*sizeof(d_pSrc[0]);
  const int imgpix  = imgrows*imgcols;
  const int nSrcStep = imgcols*sizeof(d_pSrc[0]);
  const int nDstStep = imgcols*sizeof(d_pDst[0]);
  const int pixval = 1;
  cudaError_t err = cudaMalloc((void **)&d_pSrc, imgsize);
  assert(err == cudaSuccess);
  err = cudaMalloc((void **)&d_pDst, imgsize);
  assert(err == cudaSuccess);
  // set image to pixval initially
  err = cudaMemset(d_pSrc, pixval, imgsize);
  assert(err == cudaSuccess);
  err = cudaMemset(d_pDst, 0, imgsize);
  assert(err == cudaSuccess);
  // copy src to dst
  NppStatus ret =  nppiCopy_8s_C1R(d_pSrc, nSrcStep, d_pDst, nDstStep, oSizeROI);
  assert(ret == NPP_NO_ERROR);
  Npp8s *h_imgres = new Npp8s[imgpix];
  err = cudaMemcpy(h_imgres, d_pDst, imgsize, cudaMemcpyDeviceToHost);
  assert(err == cudaSuccess);
  // test that dst = pixval
  for (int i = 0; i < imgpix; i++) assert(h_imgres[i] == pixval);
  return 0;
}


$ cat bld_nppidei
#!/bin/bash
# static linking to CUDA libraries
g++ -I/usr/local/cuda/include ut_nppidei.cpp -L/usr/local/cuda/lib64 -lnppidei_static -lnppc_static -lculibos -lcudart_static -lpthread -ldl -lrt -o ut_nppidei_static
# dynamic linking
g++ -I/usr/local/cuda/include ut_nppidei.cpp -L/usr/local/cuda/lib64 -lnppidei  -lcudart -o ut_nppidei_dynamic
$ ./bld_nppidei
$ ./ut_nppidei_static
$ ./ut_nppidei_dynamic
$

Tested on CUDA 9.2, CentOS 7, g++ 4.8.5-4
https://docs.nvidia.com/cuda/npp/index.html
this assumes a standard, proper CUDA install
This is a very simple test case to demonstrate minimal library linking. For more complete NPP sample codes, refer to the CUDA sample codes:

https://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries

Here is a minimal test case for libnppif:

$ cat ut_nppif.cpp
#include <nppi_filtering_functions.h>
#include <cuda_runtime_api.h>
#include <assert.h>

int main(){

/**
 * 8-bit unsigned single-channel 1D row convolution.
 */

  const int simgrows = 32;
  const int simgcols = 32;
  Npp8u *d_pSrc, *d_pDst;
  const int nMaskSize = 3;
  NppiSize oROI;  oROI.width = simgcols - nMaskSize;  oROI.height = simgrows;
  const int simgsize = simgrows*simgcols*sizeof(d_pSrc[0]);
  const int dimgsize = oROI.width*oROI.height*sizeof(d_pSrc[0]);
  const int simgpix  = simgrows*simgcols;
  const int dimgpix  = oROI.width*oROI.height;
  const int nSrcStep = simgcols*sizeof(d_pSrc[0]);
  const int nDstStep = oROI.width*sizeof(d_pDst[0]);
  const int pixval = 1;
  const int nDivisor = 1;
  const Npp32s h_pKernel[nMaskSize] = {pixval, pixval, pixval};
  Npp32s *d_pKernel;
  const Npp32s nAnchor = 2;
  cudaError_t err = cudaMalloc((void **)&d_pSrc, simgsize);
  assert(err == cudaSuccess);
  err = cudaMalloc((void **)&d_pDst, dimgsize);
  assert(err == cudaSuccess);
  err = cudaMalloc((void **)&d_pKernel, nMaskSize*sizeof(d_pKernel[0]));
  assert(err == cudaSuccess);
  // set image to pixval initially
  err = cudaMemset(d_pSrc, pixval, simgsize);
  assert(err == cudaSuccess);
  err = cudaMemset(d_pDst, 0, dimgsize);
  assert(err == cudaSuccess);
  err = cudaMemcpy(d_pKernel, h_pKernel, nMaskSize*sizeof(d_pKernel[0]), cudaMemcpyHostToDevice);
  assert(err == cudaSuccess);
  // copy src to dst
  NppStatus ret =  nppiFilterRow_8u_C1R(d_pSrc, nSrcStep, d_pDst, nDstStep, oROI, d_pKernel, nMaskSize, nAnchor, nDivisor);
  assert(ret == NPP_NO_ERROR);
  Npp8u *h_imgres = new Npp8u[dimgpix];
  err = cudaMemcpy(h_imgres, d_pDst, dimgsize, cudaMemcpyDeviceToHost);
  assert(err == cudaSuccess);
  // test for filtering
  for (int i = 0; i < dimgpix; i++) assert(h_imgres[i] == (pixval*pixval*nMaskSize));
  return 0;
}

$ cat bld_nppif
#!/bin/bash
# static linking to CUDA libraries
g++ -I/usr/local/cuda/include ut_nppif.cpp -L/usr/local/cuda/lib64 -lnppif_static -lnppc_static -lculibos -lcudart_static -lpthread -ldl -lrt -o ut_nppif_static
# dynamic linking
g++ -I/usr/local/cuda/include ut_nppif.cpp -L/usr/local/cuda/lib64 -lnppif  -lcudart -o ut_nppif_dynamic
$ ./bld_nppif
$ ./ut_nppif_static
$ ./ut_nppif_dynamic
$

Tested on CUDA 9.2, CentOS 7, g++ 4.8.5-4
https://docs.nvidia.com/cuda/npp/index.html
this assumes a standard, proper CUDA install
This is a very simple test case to demonstrate minimal library linking. For more complete NPP sample codes, refer to the CUDA sample codes:

https://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries

Here is a minimal test case for libnppig:

$ cat ut_nppig.cpp
#include <nppi_geometry_transforms.h>
#include <cuda_runtime_api.h>
#include <assert.h>

int main(){

/**
 * 1 channel 8-bit unsigned image mirror.
 */
  const int simgrows = 32;
  const int simgcols = 32;
  Npp8u *d_pSrc, *d_pDst;
  NppiSize oROI;  oROI.width = simgcols;  oROI.height = simgrows;
  const int simgsize = simgrows*simgcols*sizeof(d_pSrc[0]);
  const int dimgsize = oROI.width*oROI.height*sizeof(d_pSrc[0]);
  const int simgpix  = simgrows*simgcols;
  const int dimgpix  = oROI.width*oROI.height;
  const int nSrcStep = simgcols*sizeof(d_pSrc[0]);
  const int nDstStep = oROI.width*sizeof(d_pDst[0]);
  const NppiAxis flip = NPP_VERTICAL_AXIS;
  Npp8u *h_img = new Npp8u[simgpix];
  for (int i = 0; i < simgrows; i++)
    for (int j = 0; j < simgcols; j++) h_img[i*simgcols+j] = simgcols-j-1;
  cudaError_t err = cudaMalloc((void **)&d_pSrc, simgsize);
  assert(err == cudaSuccess);
  err = cudaMalloc((void **)&d_pDst, dimgsize);
  assert(err == cudaSuccess);
  err = cudaMemcpy(d_pSrc, h_img, simgsize, cudaMemcpyHostToDevice);
  assert(err == cudaSuccess);
  // set image to pixval initially
  err = cudaMemset(d_pDst, 0, dimgsize);
  assert(err == cudaSuccess);
  // perform mirror op
  NppStatus ret =  nppiMirror_8u_C1R(d_pSrc, nSrcStep, d_pDst, nDstStep, oROI, flip);
  assert(ret == NPP_NO_ERROR);
  err = cudaMemcpy(h_img, d_pDst, dimgsize, cudaMemcpyDeviceToHost);
  assert(err == cudaSuccess);
  // test for R to L flip
  for (int i = 0; i < oROI.height; i++)
    for (int j = 0; j < oROI.width; j++) assert(h_img[i*oROI.width+j] == j);
  return 0;
}

$ cat bld_nppig
#!/bin/bash
# static linking to CUDA libraries
g++ -I/usr/local/cuda/include ut_nppig.cpp -L/usr/local/cuda/lib64 -lnppig_static -lnppc_static -lculibos -lcudart_static -lpthread -ldl -lrt -o ut_nppig_static
# dynamic linking
g++ -I/usr/local/cuda/include ut_nppig.cpp -L/usr/local/cuda/lib64 -lnppig  -lcudart -o ut_nppig_dynamic
$ ./bld_nppig
$ ./ut_nppig_static
$ ./ut_nppig_dynamic
$

Tested on CUDA 9.2, CentOS 7, g++ 4.8.5-4
https://docs.nvidia.com/cuda/npp/index.html
this assumes a standard, proper CUDA install
This is a very simple test case to demonstrate minimal library linking. For more complete NPP sample codes, refer to the CUDA sample codes:

https://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries

Here is a minimal test case for libnppim:

$ cat ut_nppim.cpp
#include <nppi_morphological_operations.h>
#include <cuda_runtime_api.h>
#include <assert.h>

int main(){

/**
 * Single-channel 8-bit unsigned integer 3x3 dilation.
 */

  const int simgrows = 32;
  const int simgcols = 32;
  const int maxval = 5;
  Npp8u *d_pSrc, *d_pDst;
  NppiSize oROI;  oROI.width = simgcols-2;  oROI.height = simgrows-2;
  const int simgsize = simgrows*simgcols*sizeof(d_pSrc[0]);
  const int dimgsize = oROI.width*oROI.height*sizeof(d_pSrc[0]);
  const int simgpix  = simgrows*simgcols;
  const int dimgpix  = oROI.width*oROI.height;
  const int nSrcStep = simgcols*sizeof(d_pSrc[0]);
  const int nDstStep = oROI.width*sizeof(d_pDst[0]);
  Npp8u *h_img = new Npp8u[simgpix];
  for (int i = 0; i < simgpix; i++) h_img[i] = (i%2)*maxval;
  cudaError_t err = cudaMalloc((void **)&d_pSrc, simgsize);
  assert(err == cudaSuccess);
  err = cudaMalloc((void **)&d_pDst, dimgsize);
  assert(err == cudaSuccess);
  err = cudaMemcpy(d_pSrc, h_img, simgsize, cudaMemcpyHostToDevice);
  assert(err == cudaSuccess);
  err = cudaMemset(d_pDst, 0, dimgsize);
  assert(err == cudaSuccess);
  // do 3x3 max finding
  NppStatus ret =  nppiDilate3x3_8u_C1R(d_pSrc+simgrows+1, nSrcStep, d_pDst, nDstStep, oROI);
  assert(ret == NPP_NO_ERROR);
  err = cudaMemcpy(h_img, d_pDst, dimgsize, cudaMemcpyDeviceToHost);
  assert(err == cudaSuccess);
  // test for alll pixels at maxval
  for (int i = 0; i < dimgpix; i++) assert(h_img[i] == maxval);
  return 0;
}

$ cat bld_nppim
#!/bin/bash
# static linking to CUDA libraries
g++ -I/usr/local/cuda/include ut_nppim.cpp -L/usr/local/cuda/lib64 -lnppim_static -lnppc_static -lculibos -lcudart_static -lpthread -ldl -lrt -o ut_nppim_static
# dynamic linking
g++ -I/usr/local/cuda/include ut_nppim.cpp -L/usr/local/cuda/lib64 -lnppim  -lcudart -o ut_nppim_dynamic
$ ./bld_nppim
$ ./ut_nppim_static
$ ./ut_nppim_dynamic
$

Tested on CUDA 9.2, CentOS 7, g++ 4.8.5-4
https://docs.nvidia.com/cuda/npp/index.html
this assumes a standard, proper CUDA install
This is a very simple test case to demonstrate minimal library linking. For more complete NPP sample codes, refer to the CUDA sample codes:

https://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries

Here is a minimal test case for libnppist:

$ cat ut_nppist.cpp
#include <nppi_statistics_functions.h>
// note that functions from nppi_linear_transforms.h would be built/linked similarly
#include <cuda_runtime_api.h>
#include <assert.h>

int main(){

/**
 * One-channel 8-bit unsigned image sum.
 */

  const int simgrows = 32;
  const int simgcols = 32;
  const int pixval = 1;
  Npp8u *d_pSrc, *d_pBuf;
  NppiSize oROI;  oROI.width = simgcols;  oROI.height = simgrows;
  const int simgsize = simgrows*simgcols*sizeof(d_pSrc[0]);
  const int simgpix  = simgrows*simgcols;
  const int nSrcStep = simgcols*sizeof(d_pSrc[0]);
  Npp64f *d_pSum, h_Sum;
  cudaError_t err = cudaMalloc((void **)&d_pSrc, simgsize);
  assert(err == cudaSuccess);
  err = cudaMalloc((void **)&d_pBuf, 8*simgsize);
  assert(err == cudaSuccess);
  err = cudaMalloc((void **)&d_pSum, sizeof(h_Sum));
  assert(err == cudaSuccess);
  err = cudaMemset(d_pSrc, pixval, simgsize);
  assert(err == cudaSuccess);
  // find sum of all pixels
  NppStatus ret =  nppiSum_8u_C1R(d_pSrc, nSrcStep, oROI, d_pBuf, d_pSum);
  assert(ret == NPP_NO_ERROR);
  err = cudaMemcpy(&h_Sum, d_pSum, sizeof(h_Sum), cudaMemcpyDeviceToHost);
  // test for proper sum
  assert(h_Sum == pixval*simgrows*simgcols);
  return 0;
}

$ cat bld_nppist
#!/bin/bash
# static linking to CUDA libraries
g++ -I/usr/local/cuda/include ut_nppist.cpp -L/usr/local/cuda/lib64 -lnppist_static -lnppc_static -lculibos -lcudart_static -lpthread -ldl -lrt -o ut_nppist_static
# dynamic linking
g++ -I/usr/local/cuda/include ut_nppist.cpp -L/usr/local/cuda/lib64 -lnppist  -lcudart -o ut_nppist_dynamic
$ ./bld_nppist
$ ./ut_nppist_static
$ ./ut_nppist_dynamic
$

Tested on CUDA 9.2, CentOS 7, g++ 4.8.5-4
https://docs.nvidia.com/cuda/npp/index.html
this assumes a standard, proper CUDA install
This is a very simple test case to demonstrate minimal library linking. For more complete NPP sample codes, refer to the CUDA sample codes:

https://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries

Here is a minimal test case for libnppisu:

$ cat ut_nppisu.cpp
#include <nppi_support_functions.h>
#include <assert.h>

int main(){

/**
 * One-channel 8-bit unsigned allocation
 */
  const int simgrows = 32;
  const int simgcols = 32;
  int pitch;
  Npp8u *d_ptr = NULL;
  d_ptr =  nppiMalloc_8u_C1 (simgcols, simgrows, &pitch);
  assert(d_ptr != NULL);
  return 0;
}

$ cat bld_nppisu
#!/bin/bash
# static linking to CUDA libraries
g++ -I/usr/local/cuda/include ut_nppisu.cpp -L/usr/local/cuda/lib64 -lnppisu_static -lcudart_static -lpthread -ldl -lrt -o ut_nppisu_static
# dynamic linking
g++ -I/usr/local/cuda/include ut_nppisu.cpp -L/usr/local/cuda/lib64 -lnppisu  -o ut_nppisu_dynamic
$ ./bld_nppisu
$ ./ut_nppisu_static
$ ./ut_nppisu_dynamic
$

Tested on CUDA 9.2, CentOS 7, g++ 4.8.5-4
https://docs.nvidia.com/cuda/npp/index.html
this assumes a standard, proper CUDA install
This is a very simple test case to demonstrate minimal library linking. For more complete NPP sample codes, refer to the CUDA sample codes:

https://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries

Here is a minimal test case for libnppitc:

$ cat ut_nppitc.cpp
#include <nppi_threshold_and_compare_operations.h>
#include <cuda_runtime_api.h>
#include <assert.h>
int main(){

/**
 * 1 channel 8-bit unsigned char image compare.
 * Compare pSrc1's pixels with corresponding pixels in pSrc2.
 */
  const int simgrows = 32;
  const int simgcols = 32;
  const int pixval = 1;
  Npp8u *d_pSrc1, *d_pSrc2, *d_pDst;
  NppiSize oROI;  oROI.width = simgcols;  oROI.height = simgrows;
  const int simgsize = simgrows*simgcols*sizeof(d_pSrc1[0]);
  const int simgpix  = simgrows*simgcols;
  const int nSrcStep = simgcols*sizeof(d_pSrc1[0]);
  cudaError_t err = cudaMalloc((void **)&d_pSrc1, simgsize);
  assert(err == cudaSuccess);
  err = cudaMalloc((void **)&d_pSrc2, simgsize);
  assert(err == cudaSuccess);
  err = cudaMalloc((void **)&d_pDst, simgsize);
  assert(err == cudaSuccess);
  err = cudaMemset(d_pSrc1, 0, simgsize);
  assert(err == cudaSuccess);
  err = cudaMemset(d_pSrc2, pixval, simgsize);
  assert(err == cudaSuccess);
  err = cudaMemset(d_pDst, 0, simgsize);
  assert(err == cudaSuccess);
  NppCmpOp eCompOp = NPP_CMP_LESS;
  // compare images
  NppStatus ret =  nppiCompare_8u_C1R(d_pSrc1, nSrcStep, d_pSrc2, nSrcStep, d_pDst,  nSrcStep, oROI, eCompOp);
  assert(ret == NPP_NO_ERROR);
  Npp8u *h_img = new Npp8u[simgpix];
  err = cudaMemcpy(h_img, d_pDst, simgsize, cudaMemcpyDeviceToHost);
  // test for proper compare
  for (int i = 0; i < simgpix; i++) assert(h_img[i]);
  return 0;
}

$ ./bld_nppitc
$ ./ut_nppitc_static
$ ./ut_nppitc_dynamic
$

Tested on CUDA 9.2, CentOS 7, g++ 4.8.5-4
https://docs.nvidia.com/cuda/npp/index.html
this assumes a standard, proper CUDA install
This is a very simple test case to demonstrate minimal library linking. For more complete NPP sample codes, refer to the CUDA sample codes:

https://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries

All npps functions are currently included in a single library.
Here is a minimal test case for libnpps:

$ cat ut_npps.cpp
#include <npps.h>
// currently the above header file and this method also include use of:
//
// npps_arithmetic_and_logical_operations.h  npps_support_functions.h
// npps_conversion_functions.h               npps_initialization.h
// npps_filtering_functions.h                npps_statistics_functions.h
//
#include <cuda_runtime_api.h>
#include <assert.h>
int main(){

/**
 * 1 channel 8-bit unsigned char zero vector
 */

  const int len = 32;
  Npp8u *d_pDst;
  const int vsize = len*sizeof(d_pDst[0]);
  cudaError_t err = cudaMalloc((void **)&d_pDst, vsize);
  assert(err == cudaSuccess);
  err = cudaMemset(d_pDst, 1, vsize);
  assert(err == cudaSuccess);
  NppStatus ret = nppsZero_8u(d_pDst, len);
  assert(ret == NPP_NO_ERROR);
  Npp8u *h_img = new Npp8u[len];
  err = cudaMemcpy(h_img, d_pDst, vsize, cudaMemcpyDeviceToHost);
  // test for zeroing
  for (int i = 0; i < len; i++) assert(!(h_img[i]));
  return 0;
}

$ cat bld_npps
#!/bin/bash
# static linking to CUDA libraries
g++ -I/usr/local/cuda/include ut_npps.cpp -L/usr/local/cuda/lib64 -lnpps_static -lnppc_static -lculibos -lcudart_static -lpthread -ldl -lrt -o ut_npps_static
# dynamic linking
g++ -I/usr/local/cuda/include ut_npps.cpp -L/usr/local/cuda/lib64 -lnpps  -lcudart -o ut_npps_dynamic
$ ./bld_npps
$ ./ut_npps_static
$ ./ut_npps_dynamic
$

Tested on CUDA 9.2, CentOS 7, g++ 4.8.5-4
https://docs.nvidia.com/cuda/npp/index.html
this assumes a standard, proper CUDA install
This is a very simple test case to demonstrate minimal library linking. For more complete NPP sample codes, refer to the CUDA sample codes:

https://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries

Dear @txbob,

THANK YOU SO MUCH!!! This is incredibly helpful, I have all of these linkage-test programs incorporated now. I am very grateful for you taking the time to include complete examples for every single one.

I have a lot more work to do on this CMake package, so you going as far as you did to complete the NPP tests for me is very very VERY helpful. Thank you again!

All of these samples and almost every other CUDA Toolkit library are working on Linux now, I will begin platform testing for Windows and OSX soon.

https://github.com/svenevs/cmake-cuda-targets

Slowly but surely this will become a valid CMake find module with COMPONENTS etc. So as a followup, do you think we should make a “metatarget” called CUDA::NPP that just links against everything? I think that would make people’s lives easier, but not really sure how appropriate that is.

It has the advantage of simplicity. Some disadvantages are:

  • code/executable bloat
  • library initialization/startup time
  • potentially triggering extra long JIT delays, in some circumstances

I can’t offer a judgement, and probably there will be some folks who prefer one method over another.