Large gaps between API calls and execution

Hey.

I am working on a Jetson Tx2(Linux). When profiling my code, I noticed big chunks of idle time in on the GPU where nothing seems to be computed or transferred. I know that the profiler adds a small overhead but the gaps seem to be huge and around 10-15ms.

There is no CPU operation happening at this time and I have streamed out most of the processes.

What do you think could be a possible reason for this behaviour?

I have attached a picture of the same

What is the current power settings during this image?

Have you maxed out your clocks?

https://www.jetsonhacks.com/2017/03/25/nvpmodel-nvidia-jetson-tx2-development-kit/

https://devtalk.nvidia.com/default/topic/1031211/jetson-tx2/maximizing-performances-of-tx2/post/5246989/#5246989

Hey mnicely!

I did do that but the gaps still remain.

For example, for the following snippet of code produces a profiler output like the picture attached. Is it normal for cudamemsets to have an average of 60-100 microseconds between them?

// Reset device
    cudaMemsetAsync(d_pointCloudPositions, 0, sizeof(float) * maxPoints * 3,
                    s_process);
    cudaMemsetAsync(d_pointCloudColours, 0, sizeof(byte) * maxPoints * 3,
                    s_process);
    cudaMemsetAsync(d_colourImage, 0, sizeof(byte) * maxPoints * 3, s_process);
    cudaMemsetAsync(d_transformedPointCloud, 0, sizeof(float) * maxPoints * 3,
                    s_process);
    cudaMemsetAsync(d_roiDepthCount, 0, sizeof(int),
                    s_process); // Reset ROI counter

// Process
    CudaProcess(processParams, d_depthImage, d_yuyvImage, d_colourImage,
                d_pointCloudPositions, d_pointCloudColours,
                d_transformedPointCloud, d_roiDepthCount, d_roiDepths,
                d_trackerPointCloud, d_trackerMaskPoints, d_trackerColourImage,
                d_trackerGrayscaleImage, s_process);
    cudaEventRecord(e_process, s_process);
    cudaEventSynchronize(e_process);

Edit: Byte refers to uint8_t, a.k.a 1 byte of memory

First, from the code snippet it looks like you’re call all the cudaMemsetAsyncs from the same stream. There’s no reason you can call them from separate streams. That might help a little. Check out cudaMemsetOneStream.png and cudaMemsetMultipleStream.png.

Image cudaMemsetOneStream.png shows 5 cudaMemsetAsyncs in one stream.
Image cudaMemsetMultipleStream.png shows 5 cudaMemsetAsyncs in 5 stream.

I created a toy sample to test on my Titan V. I’m setting five 16MB arrays.
The first memset in each takes roughly 22us, while the remaining take roughly 5us.

Looking at cudaMemsetOneStream.png, it’s important to note that the actually cudaMemset work can’t occur until the cudaMemsetAsync API call is complete. The amount of time it takes to complete this API call is related to CPU and GPU performance. I would expect the same calls on a TX2 to take longer than my i7@5Gz + Titan V. How much, I can’t say because I don’t have one to test.

Guesstimating from your image, your first call takes 100us and your fifth takes 30us. I can convince myself that seems reasonable.

I’m not sure that sizes you’re setting, but if they’re small and in combination with the power of a TX2 those times don’t seem crazy.

Now, how do we improve this???

If the kernel are indeed launching as soon as the API call is finished, additional streams probably won’t help.

If the issue, is the overhead of cudaMemsetAsync, well then can it be removed?

Is it possible to clear the data in your arrays at the beginning of each kernel?

Let’s say the kernel runs for a stochastic number of times before you can reset the data? In the host code, where you would have executed you cudaMemsetAsyncs, have a reset boolean that you pass to the kernel. If true, then reset you data.

Another advantage of this is that you’re only loading your data from global to register once! With your current method you load from global, clear data, store back to global, and then launch you kernel and reload.

Thanks mnicely!

That did help me understand the limitations. My maxpoints is basically an image of size [640x480], i.e 307200

After changing all the memsets to a kernel and calling it,the gaps between the memsets disappeared. I feel that there seems to be a very big overhead before and after each memset call on the GPU as shown in lagtimes.png

To test this theory again, I moved all memsets and kept only one [one memset of size(int) = 4 bytes] as shown in the code which resulted in the picture memsetonelag.png

Is this normal for a jetson tx2? if so, i would have to not use the cudamemset api call anymore.

memsetone.png

lagtimes.png