Profiling NvEnc?

nick · November 24, 2020, 11:23pm

I’m adding video encoding to an existing CUDA application and trying to profile it in Nsight as the encoding of a frame takes a bit longer than I need it to. But when I profile in Nsight, there is no sign of any of the NvEnc calls at all. All I see are “alert by thread id” waits while the encoding is happening. There aren’t even device to host memory copies for the encoded data moving from the gpu to the host - I could see NvEnc, since it happens on a specific part of the GPU, being a bit “special”, but seeing nothing at all in the profiler is pretty surprising.

Am I doing something wrong or does NvEnc just not play nice with profiling? All of the other CUDA things I have going on are showing up, including memory copies, npp resizing, and jpeg compression.

jasoncohen · December 2, 2020, 5:30pm

We’re still working on getting instrumentation for the Video Codec SDK into Nsight Systems. This should be available in Q1. We’ll show full API trace and also all the activity of the NVENC and NVDEC hardware units. I’ll reply to this thread when we ship support for NVENC/NVDEC trace.

However, I’m concerned about your comment that the device-to-host memory copies aren’t showing up – you’re doing that through the CUDA API, not the Video API, right? All CUDA activity should be showing up in Nsight Systems. Can you confirm that you’re using a CUDA API call to copy data from device to host, and reply to let us know if either the CUDA API call or the GPU-side work (which may be asynchronous) is not showing up in the timeline? It would be a bug if that event is getting lost.

nick · December 2, 2020, 6:06pm

It may be that NvEnc works differently than I expect. I’m starting with the NvEncoder example/class that ships in the video sdk samples (GitHub - NVIDIA/video-sdk-samples: Samples demonstrating how to use various APIs of NVIDIA Video Codec SDK).

In my setup, I do an HtoD memcpy and then use NPP to resize the image, and then a DtoD memcpy via NvEncoderCuda::CopyToDeviceFrame to move the resized image to the encoder’s next input frame. Both the HtoD and DtoD memcpy prior to encoding show up, along with an NVTX range marker I added around the call to NvEncoder::EncodeFrame. EncodeFrame hands back host memory with an encoded frame (not necessarily the one I’ve passed in of course).

But I would expect there to be a DtoH memcpy to get the encoded data from NVEnc to host memory when nvEncLockBitstream is called and the encoded frame is present for the host in the bitstreamBufferPtr field. I don’t explicitly make a DtoH copy, but expected the API must to get the data out of NVEnc, if that makes sense?

I’m new to both CUDA and NVEnc, so it may just be a misunderstanding about how NVEnc gets the encoded frame back from the GPU to the host. Here’s a pic of one call to EncodeFrame:

abdo.babukr1 · May 11, 2022, 1:42am

was there ever features added to nsight to allow us to profile video api calls?

hwilper · May 11, 2022, 3:57pm

@tcourtney can you respond to this.

tcourtney · May 11, 2022, 5:03pm

Yes, Nsight systems can profile video API calls for encoding and decoding with the Video Codec SDK and encoding and decoding with the nvPEG API.

To do this, download the current version of Nsight Systems from here and enable the “NvVideo” API tracing to profile the Video Code SDK and nvJPEG APIs.

abdo.babukr1 · May 11, 2022, 6:03pm

We see from the nsight user guide support for --trace=nvvideo starts with version 2020.4.1, since we are using nsys command line inside a docker container, we would need to update our container which is using nsight version 2020.2.1.

Thanks you

hwilper · May 11, 2022, 8:22pm

You really should, that one is more than a year and a half out of date now.