Profiling NvEnc?

I’m adding video encoding to an existing CUDA application and trying to profile it in Nsight as the encoding of a frame takes a bit longer than I need it to. But when I profile in Nsight, there is no sign of any of the NvEnc calls at all. All I see are “alert by thread id” waits while the encoding is happening. There aren’t even device to host memory copies for the encoded data moving from the gpu to the host - I could see NvEnc, since it happens on a specific part of the GPU, being a bit “special”, but seeing nothing at all in the profiler is pretty surprising.

Am I doing something wrong or does NvEnc just not play nice with profiling? All of the other CUDA things I have going on are showing up, including memory copies, npp resizing, and jpeg compression.

We’re still working on getting instrumentation for the Video Codec SDK into Nsight Systems. This should be available in Q1. We’ll show full API trace and also all the activity of the NVENC and NVDEC hardware units. I’ll reply to this thread when we ship support for NVENC/NVDEC trace.

However, I’m concerned about your comment that the device-to-host memory copies aren’t showing up – you’re doing that through the CUDA API, not the Video API, right? All CUDA activity should be showing up in Nsight Systems. Can you confirm that you’re using a CUDA API call to copy data from device to host, and reply to let us know if either the CUDA API call or the GPU-side work (which may be asynchronous) is not showing up in the timeline? It would be a bug if that event is getting lost.

It may be that NvEnc works differently than I expect. I’m starting with the NvEncoder example/class that ships in the video sdk samples (https://github.com/NVIDIA/video-sdk-samples).

In my setup, I do an HtoD memcpy and then use NPP to resize the image, and then a DtoD memcpy via NvEncoderCuda::CopyToDeviceFrame to move the resized image to the encoder’s next input frame. Both the HtoD and DtoD memcpy prior to encoding show up, along with an NVTX range marker I added around the call to NvEncoder::EncodeFrame. EncodeFrame hands back host memory with an encoded frame (not necessarily the one I’ve passed in of course).

But I would expect there to be a DtoH memcpy to get the encoded data from NVEnc to host memory when nvEncLockBitstream is called and the encoded frame is present for the host in the bitstreamBufferPtr field. I don’t explicitly make a DtoH copy, but expected the API must to get the data out of NVEnc, if that makes sense?

I’m new to both CUDA and NVEnc, so it may just be a misunderstanding about how NVEnc gets the encoded frame back from the GPU to the host. Here’s a pic of one call to EncodeFrame: