Why the access speed of memory allocated by cudaMallocHost is so slow?

chivstyle · July 8, 2020, 3:47am

Hi:
We’ve developed a product with video. There are 4 input signals connected to the TX2, every signal has a video format YUV422 8bits Packed UYVY. I use cuda to complete the colorspace transformation. But the access speed of memory which allocated by cudaMallocHost is so slow, I can’t fetch the result in time, final, the FPS can’t reach 60.

So, I tested the performance of those memory with codes below.

void stdm_test(size_t cx, size_t cy, int times)
{
    char* src;
    char* dst;
    const size_t sz = cx*cy*4;
    src = new char[sz];
    dst = new char[sz];

    auto t1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < times; ++i) {
        memcpy(dst, src, sz);
    }
    auto t2 = std::chrono::high_resolution_clock::now();
    auto ts = std::chrono::duration<double>(t2 - t1).count();
    printf("stdm takes:%f seconds, Avg speed:%8.3fM/s\n", ts, (double)sz*times/ts/1024/1024.);

    delete[] src;
    delete[] dst;
}

void cuda_test(size_t cx, size_t cy, int times)
{
    char* src;
    char* dst;
    const size_t sz = cx*cy*4;
    cudaHostAlloc((void**)&src, sz, cudaHostAllocMapped);
    cudaHostAlloc((void**)&dst, sz, cudaHostAllocMapped);

    auto t1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < times; ++i) {
        memcpy(dst, src, sz);
    }
    auto t2 = std::chrono::high_resolution_clock::now();
    auto ts = std::chrono::duration<double>(t2 - t1).count();
    printf("cuda takes:%f seconds, Avg speed:%8.3fM/s\n", ts, (double)sz*times/ts/1024/1024.);

    cudaFreeHost(src);
    cudaFreeHost(dst);
}

On TX2, the result is:
stdm: 3783 M/s
cuda: 582 M/s

I did the same testing on my PC with Quadra P620, the result is:
stdm: 4556 M/s
cuda: 4601 M/s

Why? Could you give me some suggestions to improve my program please? Please give me a minute to introduce my program.

I create a thread to poll the video device(/dev/video0~N), when a frame was ready, the thread will copy the frame data to the buffer which was allocated by cudaMallocHost, then start the cuda-kernel to complete colorspace transformation(To RGBA), finally, the thread broadcast the result to it’s subscribers.

I created a sample subscriber for our customers to demonstrate the usage of our product. When the subscriber received a frame, I clone it and share the clone to the renderer thread. The procedure of clone is just a cudaMemcpy invocation, but this operation will take more than 16 milliseconds, it’s really bad.

AastaLLL · July 8, 2020, 8:23am

Hi,

First, have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks.

If the issue goes on, would you mind to share a complete sample with cx, cy info with us?
Thanks.
Thanks.

chivstyle · July 8, 2020, 9:12am

Hi,

sudo jetson_clocks

This command succeeded, jetson_clocks --show printed some texts endup “MAXP_CORE_ARM”.

cx,cy is the resolution of my video, we used 3 formats in our application:

1920x1080
1280x720
720x576

chivstyle · July 8, 2020, 9:18am

//
// (c) 2020 chiv
//
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <memory.h>
#include <chrono>
#include <algorithm>

void stdm_test(size_t cx, size_t cy, int times)
{
    char* src;
    char* dst;
    const size_t sz = cx*cy*4;
    src = new char[sz];
    dst = new char[sz];

    auto t1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < times; ++i) {
        memcpy(dst, src, sz);
    }
    auto t2 = std::chrono::high_resolution_clock::now();
    auto ts = std::chrono::duration<double>(t2 - t1).count();
    printf("stdm takes:%f seconds, Avg speed:%8.3fM/s\n", ts, (double)sz*times/ts/1024/1024.);

    delete[] src;
    delete[] dst;
}

void cuda_test(size_t cx, size_t cy, int times)
{
    char* src;
    char* dst;
    const size_t sz = cx*cy*4;
    cudaHostAlloc((void**)&src, sz, cudaHostAllocMapped);
    cudaHostAlloc((void**)&dst, sz, cudaHostAllocMapped);

    auto t1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < times; ++i) {
        memcpy(dst, src, sz);
    }
    auto t2 = std::chrono::high_resolution_clock::now();
    auto ts = std::chrono::duration<double>(t2 - t1).count();
    printf("cuda takes:%f seconds, Avg speed:%8.3fM/s\n", ts, (double)sz*times/ts/1024/1024.);

    cudaFreeHost(src);
    cudaFreeHost(dst);
}

int main(int argc, char** argv)
{
    size_t cx = 1280, cy = 720;
    int times = 10;
    switch (argc) {
    case 4:
        times = atoi(argv[3]);
    case 3:
        cy = strtoul(argv[2], nullptr, 10);
    case 2:
        cx = strtoul(argv[1], nullptr, 10);
    default:break;
    }
    cx = std::max(cx, 1ul);
    cy = std::max(cy, 1ul);
    times = std::max(times, 1);
    printf("cx = %lu, cy = %lu, times = %d\n", cx, cy, times);

    cuda_test(cx, cy, times);
    stdm_test(cx, cy, times);    

    return 0;
}

I saved these codes to a file named “test.cpp”, then compile it with

g++ -O2 -o test test.cpp -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcudart

Run it

./test 1920 1080 10
./test 1280 720
./test 720 576
./test 720 576 100

.etc

AastaLLL · July 9, 2020, 2:23am

Thanks, we are going to reproduce this in our environment.
Will update here if any progress.

chivstyle · July 9, 2020, 2:56am

Thanks, before this program, I move the result to GL context with CUDA API, so I do not need to access the result from CPU. But, our customers want to process the result themselves, we should provide a block of memory for them, so I use cudaMallocHost to allocate it.

1280x720 reaches 60 FPS. The maximized size is 1920x1080, and our L4T is 32.4.

AastaLLL · July 13, 2020, 5:58am

Hi,

Thanks for your patience.
We found this issue is not a bug but comes from a limitation of TX2.

Please noticed that TX2 doesn’t support IO coherency so the pinned memory is uncached.
Without cache, the performance tends to be much slower than the pageable host memory.
You can find some information in our document: CUDA for Tegra :: CUDA Toolkit Documentation

To improve this, we recommends to use unified memory instead.

...
void cuda_test(size_t cx, size_t cy, int times)
{
    char* src;
    char* dst;
    const size_t sz = cx*cy*4;
    cudaMallocManaged((void**)&src, sz);
    cudaMallocManaged((void**)&dst, sz);

    auto t1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < times; ++i) {
        memcpy(dst, src, sz);
        cudaDeviceSynchronize();
    }
    auto t2 = std::chrono::high_resolution_clock::now();
    auto ts = std::chrono::duration<double>(t2 - t1).count();
    printf("cuda takes:%f seconds, Avg speed:%8.3fM/s\n", ts, (double)sz*times/ts/1024/1024.);

    cudaFree(src);
    cudaFree(dst);
}
...

We can get a much better performance with unified memory on the TX2.

$ ./test 1920 1080 10
cx = 1920, cy = 1080, times = 10
cuda takes:0.028232 seconds, Avg speed:2801.879M/s
stdm takes:0.023944 seconds, Avg speed:3303.550M/s

Thanks.

chivstyle · July 13, 2020, 7:39am

Thanks very much. I got it, and I tested the unified memory, it’s okay to go.

Topic		Replies	Views
Why it is so slow to use cudamemcpy（cudaMemcpyHostToHost）on tx2 Jetson TX2 cuda	5	1166	October 18, 2021
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1712	October 18, 2021
Cuda performance randomly improves in a quantized manner Jetson TX2	3	565	October 18, 2021
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1617	October 18, 2021
How to disable zero-copy on TX1? Jetson TX1	4	758	October 18, 2021
cudaMemcpy leaks on TK1 Jetson TK1	4	1168	February 24, 2016
Memory copy very slow memory copy, image CUDA Programming and Performance	10	12480	April 7, 2011
Unified Memory Access Performance of Arrays of Structures Problem on Jetson TX2 Jetson TX2 cuda	5	620	October 18, 2021
GPU out of memory when the total ram usage is 2.8G Jetson TX2	28	18502	October 18, 2021
Weirdness at the 2 gigabyte boundary Jetson TX1	4	874	July 5, 2016

Why the access speed of memory allocated by cudaMallocHost is so slow?

Related topics