NVJPEG -- a few questions abour decoupled decoding

levicki · November 26, 2019, 4:45pm

I used simple decode and it works great, however I wanted to avoid (assumed) setup/teardown cost for each decode and preserve as much context across decodes so I started looking into decoupled decoding. API documentation could definitely be improved.

Is there a companion destroy function for nvjpegDecoderStateCreate()? If not, when and how is this object supposed to be destroyed?
Is there a companion detach function for nvjpegStateAttachPinnedBuffer() and nvjpegStateAttachDeviceBuffer()? If not, is it safe to destroy them using nvjpegBufferPinnedDestroy() and nvjpegBufferDeviceDestroy() and when is it safe to do so?
Is it safe to use nvjpegGetImageInfo() on the same NVJPEG handle when using decoupled decoding or we must use stream API for that as well?
On initialization of my program I am creating a CUDA stream for decoder operations using cudaStreamCreateWithFlags(). I am calling cudaStreamDestroy() on program exit. When this happens under debugger without anything else happening with this stream there is an exception (access violation) being thrown by cudaStreamDestroy(). I double-checked my code and I am sure I am not trying to destroy twice. Could this be a bug in cudaStreamDestroy()?

levicki · December 7, 2019, 1:00pm

Is there anyone from NVIDIA who can clarify?

levicki · December 10, 2019, 10:48am

BUMP!

TomNVIDIA · December 10, 2019, 6:32pm

Hi Igor,

I am looking for a technical resource for your questions. Please stand by.

Thanks,
Tom

TomNVIDIA · December 11, 2019, 7:54pm

Hi Igor,

I received the following answers from one of our engineers.

Use nvjpegJpegStateDestroy for this.

There is no companion detach function for nvjpegStateAttachPinnedBuffer() and nvjpegStateAttachDeviceBuffer(). These functions only pass reference to the work buffers.
The buffers should be destroyed using nvjpegBufferPinnedDestroy() and nvjpegBufferDeviceDestroy(). This can be called at the time of destroying nvjpegJpegState_t(using nvjpegJpegStateDestroy).

It should be safe to use that.

It would be hard to comment without looking at the code. One thing to try would be to set the CUDA_LAUNCH_BLOCKING=1(environment variable) and see if the error goes away. Also what it is return code that for cudaStreamDestroy()?

If user can share a code snippet and images, the engineer can dig deeper.

Best,
Tom

levicki · December 13, 2019, 1:33pm

Hello Tom,

Thanks for the response from the engineer.

As for CUDA issue the code was literally something like this:

int main()
{
    cudaStream_t Stream;

    cudaStreamCreateWithFlags(&Stream, cudaStreamNonBlocking);

    // other code that does not even use the stream yet

    if (Stream != nullptr) {
        cudaStreamDestroy(Stream); // access violation here when run under debugger only
    }
}

TomNVIDIA · December 13, 2019, 6:08pm

Hi Igor,

The engineer stated that the code looks incomplete.
Line no: 10 m_Stream not created and trying to destroy.

The nvJPEG decoder sample will resolve his issue:

https://github.com/NVIDIA/CUDALibrarySamples/tree/master/nvJPEG/nvJPEG-Decoder

This will provide the ability test the single image or decoupled API usage.

Best regards,
Tom

levicki · December 13, 2019, 7:04pm

Sorry, that was a typo when I copied and edited the code – the actual code uses same variable name everywhere. I edited the code to reflect that.

levicki · December 15, 2019, 3:22pm

@TomK@NVIDIA

Tom, I checked the code sample and now the process of cleanup and order of destroy operations is clear to me.

However, I believe that belongs to API documentation, not the code sample. I hope NVIDIA will consider improving the documentation for the next CUDA release. Documentation also does not explain at which points during decode cudaStreamSynchronize() must be called to ensure correct results.

Your code demonstrates several decoding approaches in a single method, so it’s hard to separate what is needed for which decoding method and why.

Based on the API parameters, I am assuming that I need the following cudaStreamSynchronize() calls:

Before calling nvjpegDecodeJpegTransferToDevice() (rationale: that’s the first nvjpeg call in which I am passing a CUDA stream handle to it)
After calling nvjpegDecodeJpegTransferToDevice() and before calling nvjpegDecodeJpegDevice()
After calling nvjpegDecodeJpegDevice() and before passing the result to OpenGL interop

Are there any other places that I missed? Is perhaps the first sync superfluous?

Now about CMYK decoding – it seems broken to me (yes, I enabled CMYK decoding in params).

First, nvjpegJpegStreamGetChromaSubsampling() is returning NVJPEG_CSS_UNKNOWN for CMYK image.

The actual image has this information in it:

*** Marker: APP14 (xFFEE) ***
  OFFSET: 0x0008D719
  Length            = 14
  DCTEncodeVersion  = 100
  APP14Flags0       = 16384
  APP14Flags1       = 0
  ColorTransform    = 2 [YCCK]

*** Marker: SOF0 (Baseline DCT) (xFFC0) ***
  OFFSET: 0x0008D7AF
  Frame header length = 20
  Precision = 8
  Number of Lines = 4773
  Samples per Line = 6000
  Image Size = 6000 x 4773
  Raw Image Orientation = Landscape
  Number of Img components = 4
    Component[1]: ID=0x01, Samp Fac=0x11 (Subsamp 1 x 1), Quant Tbl Sel=0x00 (Y)
    Component[2]: ID=0x02, Samp Fac=0x11 (Subsamp 1 x 1), Quant Tbl Sel=0x01 (Cb)
    Component[3]: ID=0x03, Samp Fac=0x11 (Subsamp 1 x 1), Quant Tbl Sel=0x01 (Cr)
    Component[4]: ID=0x04, Samp Fac=0x11 (Subsamp 1 x 1), Quant Tbl Sel=0x00 (K)

To me that means subsampling is known (i.e. there is no subsampling on any of the components), not unknown.

Moreover colorspace information is not available and it seems NVJPEG assumes that colorspace is always YCbCr. For this particular image colorspace is recorded in APP14 marker as YCCK (Adobe specific CMYK).

Moreover, nvjpegJpegStreamGetFrameDimensions() only initializes Widths[0] and Heights[0] for CMYK image even though there are 4 components (of the same size in this case).

If I request decoding with NVJPEG_OUTPUT_BGRI the decoded result is wrong (and I don’t mean colors differ a little bit because ICC profile conversion was not used).

If I request decoding with NVJPEG_OUTPUT_UNCHANGED I must use four separate buffers. Why there is no support to return unchanged components as interleaved?

Finally, in the unchanged mode, the only channel that is passed correctly without transformation is K, the other channels are all different from the original.

TomNVIDIA · December 16, 2019, 5:35pm

Hi Igor,

I have asked the engineer to comment on your questions.

Best,
Tom

levicki · December 16, 2019, 6:49pm

@TomK@NVIDIA

I submitted a bug report #2788539 for broken CMYK decoding.

I sent an email with a minimal reproducible case attached to CUDAIssues[at]nvidia[dot]com.

Let me know if there is anything else I can do.

TomNVIDIA · December 16, 2019, 6:53pm

Thanks, will do.

levicki · December 17, 2019, 9:34am

@TomK@NVIDIA

I must admit I was wrong on NVJPEG_OUTPUT_UNCHANGED – it gives correct results. However, NVJPEG_OUTPUT_BGRI and NVJPEG_OUTPUT_RGBI produce wrong results. I provided additional details and steps to the engineer working on the bug case.

TomNVIDIA · December 17, 2019, 4:24pm

Hi Igor,

Here is the response from the engineer before you posted the additional details.

===============================================================================

[i]Below is the sequence of decoupled API if you are decoding a single image.

nvjpegDecodeJpegHost()
nvjpegDecodeJpegTransferToDevice()
nvjpegDecodeJpegDevice()
cudaStreamSynchronize()

This will work as long as the same cudaStream_t parameter is passed to all API

If you are decoding the a batch of images, then the sequence of calls should be as described in the new sample on github under pipelined mode:
https://github.com/NVIDIA/CUDALibrarySamples/blob/master/nvJPEG/nvJPEG-Decoder/nvjpegDecoder.cpp#L56-L82

By having cudaStreamSynchronize after nvjpegDecodeJpegHost, we are able to start the host stage if image ‘n’ when the gpu stage of image ‘n-1’ is in progress
[/i]

Now about CMYK decoding – it seems broken to me (yes, I enabled CMYK decoding in params).
Thanks for creating bugs - 2788539 and engineer working on this.

levicki · December 17, 2019, 5:39pm

Hi TomK@NVIDIA,

Thanks for the input.

So, I only need synchronize at the end provided everything in the application uses the same CUDA stream? Good to know.

But I need some additional clarification :-)

If I want to run a CUDA kernel to process decoded image on device before displaying it, do I need to do have cudaStreamSynchronize() after nvjpegDecodeJpegDevice() and before said kernel call or is this synchronized automatically?
I am using cudaGraphicsGLRegisterBuffer(), cudaGraphicsMapResources(), and cudaGraphicsResourceGetMappedPointer() to get access to an OpenGL 2D texture which I then pass to NVJPEG as a destination for decoding. After I call nvjpegDecodeJpegDevice(), do I need cudaStreamSynchronize() call if what follows is cudaGraphicsUnmapResources() assuming they use the same CUDA stream?
It is not clear from the NVJPEG documentation whether NVJPEG API can use default CUDA stream (i.e. is it ok to pass stream = 0 to NVJPEG functions)?

For the sake of clarity, the sequence in my code is:

01. cudaGraphicsGLRegisterBuffer()
02. cudaGraphicsMapResources()
03. cudaGraphicsResourceGetMappedPointer()
04. nvjpegDecodeJpegHost()
05. nvjpegDecodeJpegTransferToDevice()
06. nvjpegDecodeJpegDevice()
07. (optional)cuda_kernel()
08. cudaGraphicsUnmapResources()
09. glBindBuffer(GL_PIXEL_UNPACK_BUFFER, buffer)
10. glTexImage2D()
11. glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0)
12. glGenerateMipmap()

Am I right to assume that if everything above is using one stream I might not need a cudaStreamSynchronize() at all?

AlexBaraba · July 4, 2023, 3:57pm

Hi @levicki, did you find the answer to your question?

Topic		Replies	Views
NVJPEG issues and inconsistencies with transcoding GPU-Accelerated Libraries nvjpeg	15	2358	June 11, 2023
Error generated while running the code after connecting the camera Jetson Xavier NX gstreamer , nvbugs	45	1287	January 2, 2024
Nvjpeg_status_execution_failed Jetson AGX Xavier graphics	5	653	February 15, 2024
NVDEC hardware CUDA_ERROR_INVALID_VALUE cuvidDecodePicture call CUDA Programming and Performance	0	2644	March 28, 2019
Grayscale lossless JPEG encoding with nvJPEG GPU-Accelerated Libraries nvjpeg	2	59	June 24, 2025
EGLStream(CUDA) -> cv::cuda::GpuMat using Argus & nppi Computer Vision & Image Processing opencv , cuda	16	1837	August 31, 2023
Nvjpeg_status_execution_failed GPU-Accelerated Libraries nvjpeg	3	471	February 2, 2024
NVJPEG_STATUS_EXECUTION_FAILED on a simple JPEG compression example GPU-Accelerated Libraries nvjpeg	5	2040	October 19, 2021
gstreamer NVMM <-> opencv gpuMat Jetson TX2 opencv	58	21800	October 18, 2021
fill CUVIDPICPARAMS problem CUDA Programming and Performance	1	1926	March 28, 2019

NVJPEG -- a few questions abour decoupled decoding

Related topics