The Zero Copy Shared memory mode consumes more CPU resources (jetson Xavier NX)

leo.li3 · December 26, 2024, 1:20pm

The Zero Copy Shared memory does save CPU/GPU data transfer time, but the programs that use that memory become more cpu intensive, and a lot of cpu intensive.

Is this normal? If not, how can I fix or optimize it.

void preprocess2gpu(const cv::Mat& srcImg, float* dstData, const int dstHeight, const int dstWidth, const cudaStream_t& preprocess_s)
{
    int srcHeight = srcImg.rows;
    int srcWidth = srcImg.cols;
    int srcElements = srcHeight * srcWidth * 3;
    int dstElements = dstHeight * dstWidth * 3;


    cudaStreamAttachMemAsync(preprocess_s, srcDevData, 0, cudaMemAttachHost);
    memcpy(srcDevData,srcImg.data,sizeof(uchar) * srcElements);
    cudaStreamAttachMemAsync(preprocess_s, srcDevData, 0, cudaMemAttachGlobal);

    // cudaMemcpy(srcDevData, srcImg.data, sizeof(uchar) * srcElements, cudaMemcpyHostToDevice);

    // calculate width and height after resize
    int w, h, x, y;
    float r_w = dstWidth / (srcWidth * 1.0);
    float r_h = dstHeight / (srcHeight * 1.0);
    if (r_h > r_w) {
        w = dstWidth;
        h = r_w * srcHeight;
        x = 0;
        y = (dstHeight - h) / 2;
    }
    else {
        w = r_h * srcWidth;
        h = dstHeight;
        x = (dstWidth - w) / 2;
        y = 0;
    }

    dim3 blockSize(32, 32);
    dim3 gridSize((dstWidth + blockSize.x - 1) / blockSize.x, (dstHeight + blockSize.y - 1) / blockSize.y);

    // letterbox and resize
    letterboxNorm<<<gridSize, blockSize, 0, preprocess_s>>>(srcDevData, srcHeight, srcWidth, midDevData, dstHeight, dstWidth, h, w, y, x);
    process<<<gridSize, blockSize>>>(midDevData, dstData, dstHeight, dstWidth);

    cudaStreamSynchronize(preprocess_s);
}

environment info:
TensorRT 8.5.2.2
cudnn 8.6.0.166
CUDA 11.4
ubuntu 20.04
jetson Xavier NX
ai mode: yolov8n-pose.engine

AastaLLL · December 27, 2024, 8:30am

Hi,

It looks like you are using unified memory instead of zero-copy memory.

Could you share more info about your use case?
Do you mean the whole preprocess2gpu function consumes more CPU resources?

Could you share the ratio of CPU/GPU with/without using unified memory?
Thanks.

leo.li3 · January 1, 2025, 2:49am

Yes， the preprocess2gpu function consumes more cpu resources.

with unified memory : use cpu 42.5%

without unified memory : use cpu 13.6%

AastaLLL · January 2, 2025, 7:00am

Hi,

Does this issue duplicate to the topic 318083, which relates to the CPU resources?

If yes, recommend using that topic so the information won’t scatter between to tickets.

Thanks.

leo.li3 · January 2, 2025, 7:08am

This is not the same question as the topic 318083. the topic 318083 does not use unified memory. You can focus on the topic 318083 first. Thanks.

AastaLLL · January 6, 2025, 7:08am

Hi,

Unified memory is accomplished by two copies of memory: one on the CPU and the other on the GPU.

Although the underlying synchronization is done via the GPU driver, it can induce some overhead (copy) depending on the memory access pattern.

Thanks.

system · January 29, 2025, 5:16am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Zero-copy still copy data? Jetson AGX Xavier	7	3729	October 18, 2021
Why does it take longer for a program to use Unified Memory than not to use Uuified Memoery? Jetson AGX Xavier cuda	2	283	October 18, 2021
Zero-Copy and Managed memory on Jetson Jetson TX1	9	11681	August 20, 2018
Performance of zero-copy on jetson TX1 Jetson TX1	9	2141	October 18, 2021
AGX Xavier -> Unified Memory questions Jetson AGX Xavier cuda	2	1039	June 25, 2021
How to disable zero-copy on TX1? Jetson TX1	4	762	October 18, 2021
Unified Memory on Jetson Platforms Jetson Xavier NX cuda	4	4629	October 18, 2021
The memory sharing between cpu and gpu in Jetson TX2 Jetson TX2	6	7143	October 18, 2021
Unified Memory has poor performance on Jetson AGX Xavier Jetson AGX Xavier cuda	6	1158	February 9, 2022
The TensorRT inference API consumes more CPU resources（Jetson Xavier NX） Jetson Xavier NX tensorrt , cudnn	9	60	February 26, 2025

The Zero Copy Shared memory mode consumes more CPU resources (jetson Xavier NX)

Related topics