How to disable zero-copy on TX1?

jiakai · October 15, 2017, 8:45am

Hi,

I have a program that utilizes pinned memory allocated by cudaMallocHost to implement simultaneous H2D copy and kernel computation via cudaMemcpyAsync. However, some CPU code in it runs very slowly on TX1. The problem seems to be caused by zero-copy that disables memory cache on both CPU and GPU. This can be verified by the following program, which says reading from pinned memory achieves about 20% the bandwidth of reading a normal CPU buffer.

I think the best solution is to disable zero-copy on TX1; I want to use the same code for both TX1 and non-TX1 platforms.

Thanks for your help!

output on TX1:

pinned: memset: 8.567GiB/s
cpu buf: memset: 9.646GiB/s
cpu buf -> pinned: 3.639GiB/s
pinned -> cpu buf: 0.717GiB/s
cpu buf -> cpu buf: 3.519GiB/s

output on a normal linux PC:

pinned: memset: 6.724GiB/s
cpu buf: memset: 6.581GiB/s
cpu buf -> pinned: 8.910GiB/s
pinned -> cpu buf: 9.106GiB/s
cpu buf -> cpu buf: 8.907GiB/s

#include <cuda_runtime.h>

#include <vector>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <ctime>

class Timer {
    clock_t m_start;

    public:

        Timer():
            m_start(clock())
        {
        }

        double get_secs_reset() {
            clock_t s = m_start;
            m_start = clock();
            return (m_start - s) / double(CLOCKS_PER_SEC);
        }
};

#define CUDA_CHECK(expr) do { \
    cudaError_t _err = (expr); \
    if (_err != cudaSuccess) { \
        fprintf(stderr, "%s failed: %s\n", #expr,  cudaGetErrorString(_err)); \
    } \
} while(0)

int main() {
    constexpr size_t SIZE = 1024ull * 1024 * 1024 / 2;
    constexpr double SIZE_G = SIZE / double(1024 * 1024 * 1024);
    void *ptr;
    CUDA_CHECK(cudaMallocHost(&ptr, SIZE));
    std::vector<uint8_t> buf_cpu(SIZE), buf_cpu1(SIZE);
    auto ptr1 = buf_cpu.data();
    auto ptr2 = buf_cpu1.data();
    Timer timer;
    for (int i = 0; i < 3; ++ i) {
        memset(ptr, i, SIZE);
        printf("pinned: memset: %.3fGiB/s\n",
                SIZE_G / timer.get_secs_reset());
        memset(ptr1, i, SIZE);
        printf("cpu buf: memset: %.3fGiB/s\n",
                SIZE_G / timer.get_secs_reset());
        memcpy(ptr, ptr1, SIZE);
        printf("cpu buf -> pinned: %.3fGiB/s\n",
                SIZE_G / timer.get_secs_reset());
        memcpy(ptr1, ptr, SIZE);
        printf("pinned -> cpu buf: %.3fGiB/s\n",
                SIZE_G / timer.get_secs_reset());
        memcpy(ptr2, ptr1, SIZE);
        printf("cpu buf -> cpu buf: %.3fGiB/s\n",
                SIZE_G / timer.get_secs_reset());
    }
}

AastaLLL · October 16, 2017, 3:29am

Hi,

Zero-copy memory is a mechinism to share memory pointer between CPU and GPU.
If you don’t want to use zero-copy memory, just allocate a CPU memory and GPU memory separately, and do the copy by your self.

Another alternative is the unified memory:
[url]https://devtalk.nvidia.com/default/topic/1014483/jetson-tx2/zero-copy-access-cuda-pipeline/post/5170222/#5170222[/url]
Depends on the physical memory location, a zero-copy memory may yield to slow access.
Unified Memory, on the other hand, decouples memory and execution spaces so that all data accesses are fast.

Thanks.

jiakai · October 19, 2017, 3:19am

Hi AastaLLL,

Thanks for your reply!

I understand that allocating on CPU could avoid zero-copy; however for these two reasons I am looking for a better solution:

I want to use the same code on all platforms; I have not found a macro to detect compiling on TX1
I want to use pinned memory to overlap IO and computing. Although I have not tested this on TX1 and do not know whether it is feasible.

Thanks!

Best,
Kai

AastaLLL · October 20, 2017, 6:16am

Hi,

Not sure which CUDA code do you use. Could you share more information about this?

In TX1, cudaHostRegister() is not supported due to ARM architecture.
More information can be found in this topic:
[url]https://devtalk.nvidia.com/default/topic/998962/jetson-tx1/cuda-zero-copy-on-tx1/post/5112922/#5112922[/url]

Will your buffer be accessed via both CPU and GPU?
If yes, it’s recommended to use unified memory to simplify the implementation.

Thanks.

Topic		Replies	Views
CUDA memory performance Jetson TK1	3	1119	October 18, 2021
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1708	October 18, 2021
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1617	October 18, 2021
CUDA Zero Copy On TX1 Jetson TX1	20	6815	October 18, 2021
Zero-Copy and Managed memory on Jetson Jetson TX1	9	11447	August 20, 2018
Zero Copy vs. CudaMemcpy on Jetson TK1 Jetson TK1	4	1512	May 18, 2016
Cuda Memory Usage TX1 Jetson TX1	8	4524	December 16, 2015
Performance of zero-copy on jetson TX1 Jetson TX1	9	2139	October 18, 2021
Pinned memory slows CPU computation Jetson TK1	5	1408	January 8, 2016
Unified Memory On TX1 Jetson TX1	4	855	October 18, 2021

How to disable zero-copy on TX1?

Related topics