The Zero Copy Shared memory mode consumes more CPU resources (jetson Xavier NX)

The Zero Copy Shared memory does save CPU/GPU data transfer time, but the programs that use that memory become more cpu intensive, and a lot of cpu intensive.

Is this normal? If not, how can I fix or optimize it.

void preprocess2gpu(const cv::Mat& srcImg, float* dstData, const int dstHeight, const int dstWidth, const cudaStream_t& preprocess_s)
{
    int srcHeight = srcImg.rows;
    int srcWidth = srcImg.cols;
    int srcElements = srcHeight * srcWidth * 3;
    int dstElements = dstHeight * dstWidth * 3;


    cudaStreamAttachMemAsync(preprocess_s, srcDevData, 0, cudaMemAttachHost);
    memcpy(srcDevData,srcImg.data,sizeof(uchar) * srcElements);
    cudaStreamAttachMemAsync(preprocess_s, srcDevData, 0, cudaMemAttachGlobal);

    // cudaMemcpy(srcDevData, srcImg.data, sizeof(uchar) * srcElements, cudaMemcpyHostToDevice);

    // calculate width and height after resize
    int w, h, x, y;
    float r_w = dstWidth / (srcWidth * 1.0);
    float r_h = dstHeight / (srcHeight * 1.0);
    if (r_h > r_w) {
        w = dstWidth;
        h = r_w * srcHeight;
        x = 0;
        y = (dstHeight - h) / 2;
    }
    else {
        w = r_h * srcWidth;
        h = dstHeight;
        x = (dstWidth - w) / 2;
        y = 0;
    }

    dim3 blockSize(32, 32);
    dim3 gridSize((dstWidth + blockSize.x - 1) / blockSize.x, (dstHeight + blockSize.y - 1) / blockSize.y);

    // letterbox and resize
    letterboxNorm<<<gridSize, blockSize, 0, preprocess_s>>>(srcDevData, srcHeight, srcWidth, midDevData, dstHeight, dstWidth, h, w, y, x);
    process<<<gridSize, blockSize>>>(midDevData, dstData, dstHeight, dstWidth);

    cudaStreamSynchronize(preprocess_s);
}

environment info:
TensorRT 8.5.2.2
cudnn 8.6.0.166
CUDA 11.4
ubuntu 20.04
jetson Xavier NX
ai mode: yolov8n-pose.engine

Hi,

It looks like you are using unified memory instead of zero-copy memory.

Could you share more info about your use case?
Do you mean the whole preprocess2gpu function consumes more CPU resources?

Could you share the ratio of CPU/GPU with/without using unified memory?
Thanks.

Yes, the preprocess2gpu function consumes more cpu resources.

with unified memory : use cpu 42.5%

without unified memory : use cpu 13.6%

Hi,

Does this issue duplicate to the topic 318083, which relates to the CPU resources?

If yes, recommend using that topic so the information won’t scatter between to tickets.

Thanks.

This is not the same question as the topic 318083. the topic 318083 does not use unified memory. You can focus on the topic 318083 first. Thanks.

Hi,

Unified memory is accomplished by two copies of memory: one on the CPU and the other on the GPU.

Although the underlying synchronization is done via the GPU driver, it can induce some overhead (copy) depending on the memory access pattern.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.