Problems of JpegEncoder with NPP library

Hello guys.

I’m trying to make Jpeg encoder(from YUV420Planar to JPEG) using NPP library.
I have practiced with jpegNPP sample code, which is about jpeg decoding and jpeg encoding.

The problem I faced is that, In nppiEncodeOptimizeHuffmanScan_JPEG_8u16s_P3R function, I got a crash with returned [-1000]. (It means NPP_CUDA_KERNEL_EXECUTION_ERROR)

  • I have checked non-null memory area and the given pointers placed in the GPU memory space.

The below log information is the result of cuda_memcheck utils.
Although the information what I’ve given you is very limited, Could you tell me why nppiEncodeOptimizeHuffmanScan_JPEG_8u16s_P3R function results in a code [-1000]?

(*Additional) To be more specific, calling the NPP API in the order shown below, and there was no error in steps 1 to 2.

  1. (DCT, Quantization)nppiDCTQuantFwd8x8LS_JPEG_8u16s_C1R_NEW
  2. (HuffmanEnc Initialization)nppiEncodeHuffmanSpecInitAlloc_JPEG
  3. (Huffman Encoding )nppiEncodeOptimizeHuffmanScan_JPEG_8u16s_P3R

Refer to API documentation(https://docs.nvidia.com/cuda/npp/group__image__compression.html#ga57fc16fbd6dd3e6b5d441eeb1b2c4332),
the return value of ‘nppiEncodeOptimizeHuffmanScan_JPEG_8u16s_P3R’ function doesn’t have a problematic ‘NPP_CUDA_KERNEL_EXECUTION_ERROR’ entry.
I would like to know what action should be taken to solve or to avoid this ‘NPP_CUDA_KERNEL_EXECUTION_ERROR’.

My machine spec is below:

  • Nvidia Tesla P4
  • CentOS Linux release 7.4.1708 (Core)
  • NVIDIA-SMI 396.26
  • cuda-9.0
  • Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz * 40
  • MemTotal: 131727700 kB

Thanks a lot.

========= Invalid global write of size 4
========= at 0x00001338 in void ACKernel<int=8, WARP_COM=0>(bool, unsigned int*, int*, int, int, short const *, int, int, uint2 const *, int, int, int)
========= by thread (11,5,0) in block (39,0,0)
========= Address 0x7ff212502474 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24c3ad]
========= Host Frame:/usr/local/cuda/lib64/libnppicom.so.9.0 [0x48eeb]
========= Host Frame:/usr/local/cuda/lib64/libnppicom.so.9.0 [0x6664e]
========= Host Frame:/usr/local/cuda/lib64/libnppicom.so.9.0 [0x21c45]
========= Host Frame:/usr/local/cuda/lib64/libnppicom.so.9.0 (nppiEncodeOptimizeHuffmanScan_JPEG_8u16s_P3R + 0xb08) [0x27ec8]
========= Host Frame:./PymTest (jpegenc_encode + 0xe76) [0xacd2]
========= Host Frame:./PymTest (transcodeHevcToYuv + 0xc6) [0x3378]
========= Host Frame:./PymTest (hevcToYUV + 0xf9) [0x2d83]
========= Host Frame:./PymTest (test + 0x20a) [0xbc55]
========= Host Frame:/lib64/libpthread.so.0 [0x7e25]
========= Host Frame:/lib64/libc.so.6 (clone + 0x6d) [0xf834d]

========= Invalid global write of size 4
========= at 0x00001338 in void ACKernel<int=8, WARP_COM=0>(bool, unsigned int*, int*, int, int, short const *, int, int, uint2 const *, int, int, int)
========= by thread (10,5,0) in block (39,0,0)
========= Address 0x7ff212502474 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24c3ad]
========= Host Frame:/usr/local/cuda/lib64/libnppicom.so.9.0 [0x48eeb]
========= Host Frame:/usr/local/cuda/lib64/libnppicom.so.9.0 [0x6664e]
========= Host Frame:/usr/local/cuda/lib64/libnppicom.so.9.0 [0x21c45]
========= Host Frame:/usr/local/cuda/lib64/libnppicom.so.9.0 (nppiEncodeOptimizeHuffmanScan_JPEG_8u16s_P3R + 0xb08) [0x27ec8]
========= Host Frame:./PymTest (jpegenc_encode + 0xe76) [0xacd2]
========= Host Frame:./PymTest (transcodeHevcToYuv + 0xc6) [0x3378]
========= Host Frame:./PymTest (hevcToYUV + 0xf9) [0x2d83]
========= Host Frame:./PymTest (test + 0x20a) [0xbc55]
========= Host Frame:/lib64/libpthread.so.0 [0x7e25]
========= Host Frame:/lib64/libc.so.6 (clone + 0x6d) [0xf834d]

========= Invalid global write of size 4
========= at 0x00001338 in void ACKernel<int=8, WARP_COM=0>(bool, unsigned int*, int*, int, int, short const *, int, int, uint2 const *, int, int, int)
========= by thread (9,5,0) in block (39,0,0)
========= Address 0x7ff212502474 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24c3ad]
========= Host Frame:/usr/local/cuda/lib64/libnppicom.so.9.0 [0x48eeb]
========= Host Frame:/usr/local/cuda/lib64/libnppicom.so.9.0 [0x6664e]
========= Host Frame:/usr/local/cuda/lib64/libnppicom.so.9.0 [0x21c45]
========= Host Frame:/usr/local/cuda/lib64/libnppicom.so.9.0 (nppiEncodeOptimizeHuffmanScan_JPEG_8u16s_P3R + 0xb08) [0x27ec8]
========= Host Frame:./PymTest (jpegenc_encode + 0xe76) [0xacd2]
========= Host Frame:./PymTest (transcodeHevcToYuv + 0xc6) [0x3378]
========= Host Frame:./PymTest (hevcToYUV + 0xf9) [0x2d83]
========= Host Frame:./PymTest (test + 0x20a) [0xbc55]
========= Host Frame:/lib64/libpthread.so.0 [0x7e25]
========= Host Frame:/lib64/libc.so.6 (clone + 0x6d) [0xf834d]

========= Invalid global write of size 4
========= at 0x00001338 in void ACKernel<int=8, WARP_COM=0>(bool, unsigned int*, int*, int, int, short const *, int, int, uint2 const *, int, int, int)
========= by thread (8,5,0) in block (39,0,0)
========= Address 0x7ff212402474 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24c3ad]
========= Host Frame:/usr/local/cuda/lib64/libnppicom.so.9.0 [0x48eeb]
========= Host Frame:/usr/local/cuda/lib64/libnppicom.so.9.0 [0x6664e]
========= Host Frame:/usr/local/cuda/lib64/libnppicom.so.9.0 [0x21c45]
========= Host Frame:/usr/local/cuda/lib64/libnppicom.so.9.0 (nppiEncodeOptimizeHuffmanScan_JPEG_8u16s_P3R + 0xb08) [0x27ec8]
========= Host Frame:./PymTest (jpegenc_encode + 0xe76) [0xacd2]
========= Host Frame:./PymTest (transcodeHevcToYuv + 0xc6) [0x3378]
========= Host Frame:./PymTest (hevcToYUV + 0xf9) [0x2d83]
========= Host Frame:./PymTest (test + 0x20a) [0xbc55]
========= Host Frame:/lib64/libpthread.so.0 [0x7e25]
========= Host Frame:/lib64/libc.so.6 (clone + 0x6d) [0xf834d]

[func:jpegenc_encode][line:640] > nppiEncodeOptimizeHuffmanScan_JPEG_8u16s_P3R(dstDCT, dstDCTStep, 0, nSs, nSe, nA >> 4, nA & 0x0f, pdScan, &nScanLength, hpCodesDC, hpTableDC, hpCodesAC, hpTableAC, apHuffmanDCTable, apHuffmanACTable, sizeOfROI, pJpegEncoderTemp) : 223678 microsec
[func:jpegenc_encode][line:640] > [nppiEncodeOptimizeHuffmanScan_JPEG_8u16s_P3R] Failed performing. result[-1000]
[func:jpegenc_encode][line:642] > nScanLength: [3]