NVJPEG_STATUS_EXECUTION_FAILED on a simple JPEG compression example

I’m currently developing a program for live JPEG encoding images from a camera. After a lot of fumbling around, errors, and testing, I reduced my program down to a CLI that encodes a single “interlaced” RGB8 image (so no separated channels) and writes it to disk. However, after executing, I consistently get an NVJPEG_STATUS_EXECUTION_FAILED error when I call nvjpegEncodeImage(). Unforunately, I can’t find any documentation on what this error is supposed to mean, how to interpret it, and how to resolve this issue. Could I get some help? You can find the relevant code below:

JpegEncoderCuda::JpegEncoderCuda(const nvjpegInputFormat_t input_format, const uint32_t quality)
    : input_format_(input_format), quality_(static_cast<int>(quality)) {
  // initialize nvjpeg structures
  checkCudaErrors(nvjpegEncoderStateCreate(nv_handle_, &nv_enc_state_, nullptr));
  checkCudaErrors(nvjpegEncoderParamsCreate(nv_handle_, &nv_enc_params_, nullptr));
  checkCudaErrors(nvjpegEncoderParamsSetQuality(nv_enc_params_, quality_, nullptr));
  checkCudaErrors(nvjpegEncoderParamsSetSamplingFactors(nv_enc_params_, NVJPEG_CSS_444, nullptr));
  checkCudaErrors(nvjpegEncoderParamsSetOptimizedHuffman(nv_enc_params_, 0, nullptr));
void JpegEncoderCuda::encode(const ImageBase::SharedPtr& input_image, ImageData::SharedPtr jpeg_out) {
  nvjpegImage_t nv_image = nvFromImageBase(input_image);
  // Compress image

  // get compressed stream size
  size_t length;
  checkCudaErrors(nvjpegEncodeRetrieveBitstream(nv_handle_, nv_enc_state_, nullptr, &length, nullptr));
  // get stream itself
  // checkCudaErrors(cudaStreamSynchronize(stream_));
  std::unique_ptr<std::vector<uint8_t>> jpeg = std::make_unique<std::vector<uint8_t>>(length);
  checkCudaErrors(nvjpegEncodeRetrieveBitstream(nv_handle_, nv_enc_state_, jpeg->data(), &length, nullptr));
  // Return the image in jpeg_out
  // [...]

nvjpegImage_t JpegEncoderCuda::nvFromImageBase(const ImageBase::SharedPtr& input_image) {
  nvjpegImage_t nv_image;
  const uint8_t* img_data = input_image->getDataPtr();

  // Zero out all channels
  for (int i = 0; i < NVJPEG_MAX_COMPONENT; i++) nv_image.channel[i] = nullptr;
  for (int i = 0; i < NVJPEG_MAX_COMPONENT; i++) nv_image.pitch[i] = 0;

  // For interlaced images like this one, we just put everything in channel 0 as here. You can see the example here:
  // https://docs.nvidia.com/cuda/nvjpeg/index.html#using-nvjpegEncodeImage

  // nvidia didn't make the `channel` pointers const, likely because this struct is used for outputing decoded images as
  // well. Since we're using it to encode, this const cast should be safe
  nv_image.channel[0] = const_cast<unsigned char*>(img_data);
  nv_image.pitch[0] = input_image->getWidth() * 3;

  return nv_image;
1 Like

No solution here, but I am encountering the exact same problem with virtually identical setup and function calls, which are all correct as far as I can tell.

I forgot to update this issue, but I’ve found the solution. You’re supposed to copy it to a CUDA buffer beforehand, as I’m guessing otherwise the GPU would have no access to the memory. I’ve modified the nvFromImageBase() function as follows:

struct nvImageData JpegEncoderCuda::nvFromImageBase(const ImageBase::SharedPtr& input_image) {
  struct nvImageData result;
  result.img_buffer = nullptr;
  result.img_buffer_size = input_image->getWidth() * input_image->getHeight() * NVJPEG_MAX_COMPONENT;

  // Copy image into CUDA buffer
  cudaError_t eCopy = cudaMalloc(reinterpret_cast<void**>(&result.img_buffer), result.img_buffer_size);
  if (cudaSuccess != eCopy) {
    throw RuntimeError(fmt::format("cudaMalloc failed when JPEG encoding: {}", cudaGetErrorString(eCopy)));
      cudaMemcpy(result.img_buffer, input_image->getDataPtr(), input_image->getSize(), cudaMemcpyHostToDevice));

  // Setup nv_image
  for (int i = 0; i < NVJPEG_MAX_COMPONENT; i++)
    result.nv_image.channel[i] = result.img_buffer + input_image->getWidth() * input_image->getHeight() * i;
  for (int i = 0; i < NVJPEG_MAX_COMPONENT; i++) result.nv_image.pitch[i] = input_image->getWidth();

  // For interlaced images like this one, we just put everything in channel 0 as here. You can see the example here:
  // https://docs.nvidia.com/cuda/nvjpeg/index.html#using-nvjpegEncodeImage
  result.nv_image.channel[0] = result.img_buffer;
  result.nv_image.pitch[0] = input_image->getWidth() * 3;

  return result;

This is of course not the only possible solution, as I’m likely doing more copies than I need, but you get the gist

Thank you!

If you expect to do a lot of copies from CPU to GPU, I suggest you try using pinned memory for optimal performance.

cudaMallocHost CUDA Runtime API :: CUDA Toolkit Documentation