NVJPEG -- a few questions abour decoupled decoding

I used simple decode and it works great, however I wanted to avoid (assumed) setup/teardown cost for each decode and preserve as much context across decodes so I started looking into decoupled decoding. API documentation could definitely be improved.

  1. Is there a companion destroy function for nvjpegDecoderStateCreate()? If not, when and how is this object supposed to be destroyed?

  2. Is there a companion detach function for nvjpegStateAttachPinnedBuffer() and nvjpegStateAttachDeviceBuffer()? If not, is it safe to destroy them using nvjpegBufferPinnedDestroy() and nvjpegBufferDeviceDestroy() and when is it safe to do so?

  3. Is it safe to use nvjpegGetImageInfo() on the same NVJPEG handle when using decoupled decoding or we must use stream API for that as well?

  4. On initialization of my program I am creating a CUDA stream for decoder operations using cudaStreamCreateWithFlags(). I am calling cudaStreamDestroy() on program exit. When this happens under debugger without anything else happening with this stream there is an exception (access violation) being thrown by cudaStreamDestroy(). I double-checked my code and I am sure I am not trying to destroy twice. Could this be a bug in cudaStreamDestroy()?

Is there anyone from NVIDIA who can clarify?

BUMP!

Hi Igor,

I am looking for a technical resource for your questions. Please stand by.

Thanks,
Tom

Hi Igor,

I received the following answers from one of our engineers.

Use nvjpegJpegStateDestroy for this.

There is no companion detach function for nvjpegStateAttachPinnedBuffer() and nvjpegStateAttachDeviceBuffer(). These functions only pass reference to the work buffers.
The buffers should be destroyed using nvjpegBufferPinnedDestroy() and nvjpegBufferDeviceDestroy(). This can be called at the time of destroying nvjpegJpegState_t(using nvjpegJpegStateDestroy).

It should be safe to use that.

It would be hard to comment without looking at the code. One thing to try would be to set the CUDA_LAUNCH_BLOCKING=1(environment variable) and see if the error goes away. Also what it is return code that for cudaStreamDestroy()?

If user can share a code snippet and images, the engineer can dig deeper.

Best,
Tom

Hello Tom,

Thanks for the response from the engineer.

As for CUDA issue the code was literally something like this:

int main()
{
    cudaStream_t Stream;

    cudaStreamCreateWithFlags(&Stream, cudaStreamNonBlocking);

    // other code that does not even use the stream yet

    if (Stream != nullptr) {
        cudaStreamDestroy(Stream); // access violation here when run under debugger only
    }
}

Hi Igor,

The engineer stated that the code looks incomplete.
Line no: 10 m_Stream not created and trying to destroy.

The nvJPEG decoder sample will resolve his issue:

https://github.com/NVIDIA/CUDALibrarySamples/tree/master/nvJPEG/nvJPEG-Decoder

This will provide the ability test the single image or decoupled API usage.

Best regards,
Tom

Sorry, that was a typo when I copied and edited the code – the actual code uses same variable name everywhere. I edited the code to reflect that.

@TomK@NVIDIA

Tom, I checked the code sample and now the process of cleanup and order of destroy operations is clear to me.

However, I believe that belongs to API documentation, not the code sample. I hope NVIDIA will consider improving the documentation for the next CUDA release. Documentation also does not explain at which points during decode cudaStreamSynchronize() must be called to ensure correct results.

Your code demonstrates several decoding approaches in a single method, so it’s hard to separate what is needed for which decoding method and why.

Based on the API parameters, I am assuming that I need the following cudaStreamSynchronize() calls:

  1. Before calling nvjpegDecodeJpegTransferToDevice() (rationale: that’s the first nvjpeg call in which I am passing a CUDA stream handle to it)
  2. After calling nvjpegDecodeJpegTransferToDevice() and before calling nvjpegDecodeJpegDevice()
  3. After calling nvjpegDecodeJpegDevice() and before passing the result to OpenGL interop

Are there any other places that I missed? Is perhaps the first sync superfluous?

Now about CMYK decoding – it seems broken to me (yes, I enabled CMYK decoding in params).

First, nvjpegJpegStreamGetChromaSubsampling() is returning NVJPEG_CSS_UNKNOWN for CMYK image.

The actual image has this information in it:

*** Marker: APP14 (xFFEE) ***
  OFFSET: 0x0008D719
  Length            = 14
  DCTEncodeVersion  = 100
  APP14Flags0       = 16384
  APP14Flags1       = 0
  ColorTransform    = 2 [YCCK]

*** Marker: SOF0 (Baseline DCT) (xFFC0) ***
  OFFSET: 0x0008D7AF
  Frame header length = 20
  Precision = 8
  Number of Lines = 4773
  Samples per Line = 6000
  Image Size = 6000 x 4773
  Raw Image Orientation = Landscape
  Number of Img components = 4
    Component[1]: ID=0x01, Samp Fac=0x11 (Subsamp 1 x 1), Quant Tbl Sel=0x00 (Y)
    Component[2]: ID=0x02, Samp Fac=0x11 (Subsamp 1 x 1), Quant Tbl Sel=0x01 (Cb)
    Component[3]: ID=0x03, Samp Fac=0x11 (Subsamp 1 x 1), Quant Tbl Sel=0x01 (Cr)
    Component[4]: ID=0x04, Samp Fac=0x11 (Subsamp 1 x 1), Quant Tbl Sel=0x00 (K)

To me that means subsampling is known (i.e. there is no subsampling on any of the components), not unknown.

Moreover colorspace information is not available and it seems NVJPEG assumes that colorspace is always YCbCr. For this particular image colorspace is recorded in APP14 marker as YCCK (Adobe specific CMYK).

Moreover, nvjpegJpegStreamGetFrameDimensions() only initializes Widths[0] and Heights[0] for CMYK image even though there are 4 components (of the same size in this case).

If I request decoding with NVJPEG_OUTPUT_BGRI the decoded result is wrong (and I don’t mean colors differ a little bit because ICC profile conversion was not used).

If I request decoding with NVJPEG_OUTPUT_UNCHANGED I must use four separate buffers. Why there is no support to return unchanged components as interleaved?

Finally, in the unchanged mode, the only channel that is passed correctly without transformation is K, the other channels are all different from the original.

Hi Igor,

I have asked the engineer to comment on your questions.

Best,
Tom

@TomK@NVIDIA

I submitted a bug report #2788539 for broken CMYK decoding.

I sent an email with a minimal reproducible case attached to CUDAIssues[at]nvidia[dot]com.

Let me know if there is anything else I can do.

Thanks, will do.

@TomK@NVIDIA

I must admit I was wrong on NVJPEG_OUTPUT_UNCHANGED – it gives correct results. However, NVJPEG_OUTPUT_BGRI and NVJPEG_OUTPUT_RGBI produce wrong results. I provided additional details and steps to the engineer working on the bug case.

Hi Igor,

Here is the response from the engineer before you posted the additional details.

===============================================================================

[i]Below is the sequence of decoupled API if you are decoding a single image.

nvjpegDecodeJpegHost()
nvjpegDecodeJpegTransferToDevice()
nvjpegDecodeJpegDevice()
cudaStreamSynchronize()

This will work as long as the same cudaStream_t parameter is passed to all API

If you are decoding the a batch of images, then the sequence of calls should be as described in the new sample on github under pipelined mode:
https://github.com/NVIDIA/CUDALibrarySamples/blob/master/nvJPEG/nvJPEG-Decoder/nvjpegDecoder.cpp#L56-L82

By having cudaStreamSynchronize after nvjpegDecodeJpegHost, we are able to start the host stage if image ‘n’ when the gpu stage of image ‘n-1’ is in progress
[/i]

Now about CMYK decoding – it seems broken to me (yes, I enabled CMYK decoding in params).
Thanks for creating bugs - 2788539 and engineer working on this.

Hi TomK@NVIDIA,

Thanks for the input.

So, I only need synchronize at the end provided everything in the application uses the same CUDA stream? Good to know.

But I need some additional clarification :-)

  1. If I want to run a CUDA kernel to process decoded image on device before displaying it, do I need to do have cudaStreamSynchronize() after nvjpegDecodeJpegDevice() and before said kernel call or is this synchronized automatically?

  2. I am using cudaGraphicsGLRegisterBuffer(), cudaGraphicsMapResources(), and cudaGraphicsResourceGetMappedPointer() to get access to an OpenGL 2D texture which I then pass to NVJPEG as a destination for decoding. After I call nvjpegDecodeJpegDevice(), do I need cudaStreamSynchronize() call if what follows is cudaGraphicsUnmapResources() assuming they use the same CUDA stream?

  3. It is not clear from the NVJPEG documentation whether NVJPEG API can use default CUDA stream (i.e. is it ok to pass stream = 0 to NVJPEG functions)?

For the sake of clarity, the sequence in my code is:

01. cudaGraphicsGLRegisterBuffer()
02. cudaGraphicsMapResources()
03. cudaGraphicsResourceGetMappedPointer()
04. nvjpegDecodeJpegHost()
05. nvjpegDecodeJpegTransferToDevice()
06. nvjpegDecodeJpegDevice()
07. (optional)cuda_kernel()
08. cudaGraphicsUnmapResources()
09. glBindBuffer(GL_PIXEL_UNPACK_BUFFER, buffer)
10. glTexImage2D()
11. glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0)
12. glGenerateMipmap()

Am I right to assume that if everything above is using one stream I might not need a cudaStreamSynchronize() at all?