CUDA Zero Copy On TX1

From sample NVIDIA_CUDA-8.0_Samples/0_Simple/simpleZeroCopy/

There are two ways to access memory between CPU and GPU space.

When I tried to cudaHostRegister to map an existed memory alloced from CPU to GPU space.

I was failed. seemed this api is not supported on TX1 with CUDA8.0.

In my case I need to map some existed memory achieve from other place like mmap or malloc.

I’d like to do some accelerated computing on GPU. So cudaMallocHost cannot meet my requirements.

Do I have any other possible way to do zero copy from cpu to gpu space like cudaHostRegister does?

The Io latency is my main delay in my calculations.


Thanks for your question.
This is a sample for zero copy on tx1 with cpu pointer.

Please help us check if it can solve your problem.


__global__ void kernel(float* output, float* input)


int main() {

    int size = 10;

    // Set flag to enable zero copy access
    // Host Arrays
    float* h_in  = NULL;
    float* h_out = NULL;
    h_in  = (float *) malloc(size*sizeof(float));
    h_out = (float *) malloc(size*sizeof(float));

    // Device arrays
    float *d_out, *d_in;
    // Get device pointer from host memory. No allocation or memcpy
    cudaHostGetDevicePointer((void **)&d_in,  (void *) h_in , 0);
    cudaHostGetDevicePointer((void **)&d_out, (void *) h_out, 0);
    // Launch the GPU kernel
    kernel<<<1, 1>>>(d_out, d_in);
    // No need to copy d_out back
    // Continue processing on host using h_out
    return 0;

Hi, AastaLLL

Thanks for your quick response.

I test as you list on TX1.

ubuntu@tegra-ubuntu:~/NVIDIA_CUDA-8.0_Samples/0_Simple/simpleZeroCopy$ ./simpleZeroCopy --use_generic_memory
  Device 0: < NVIDIA Tegra X1 >, Compute SM 5.3 detected
> Using Generic System Paged Memory (malloc)
CUDA error at code=11(cudaErrorInvalidValue) "cudaHostGetDevicePointer((void **)&d_a, (void *)a, 0)"

Seemed getting a device pointer using a cpu addr directly without cudaHostRegister is not allowed.


I tested this sample on tx1 and it should work.
Could you try it again with following command?

nvcc -o test

Hi, AastaLLL

I think your test is ok.

But your kernel will not start.

because the cudaHostGetDevicePointer will not success.

You can check the return status of func cudaHostGetDevicePointer.

Return code is ‘11’ mean some error.

Try replacing

h_in  = (float *) malloc(size*sizeof(float));
h_out = (float *) malloc(size*sizeof(float));


cudaHostAlloc((void **)&h_in, size*sizeof(float), cudaHostAllocMapped);
cudaHostAlloc((void **)&h_out, size*sizeof(float), cudaHostAllocMapped);

Hi, Honey_Patouceul

cudaHostAlloc will assign a new address to replace the origin addr of h_in and h_out.

In such case, I’ll lost the address and its memory which already existed in h_in and h_out.

In most case the data is already in memory, I need to get a according address in GPU to avoid copy it.

Because the memory are shared in TX1, So I think it possible to just get a mapped address without copy.

My suggestion was not to replace cudaHostGetDevicePointer(), but malloc().


Sorry for keeping you waiting.
I just tested this sample and you are correct. cudaHostAlloc returns error 11 and kernel doesn’t execute.
We will check this issue and update to you.
Sorry for the inconvenience and misunderstanding.

But I think you can try Honey_Patouceul’s suggestion that replacing the malloc() with cudaHostAlloc().


For my case. I need to run the object detection with CUDA on GPU. The image I got is from camera in runtime.

mdata->buffers[cam_idx][i].start = mmap (NULL /* start anywhere */,
                PROT_READ | PROT_WRITE /* required */,
                MAP_SHARED /* recommended */,
                fd, buf.m.offset);

while (1)
 cudaMemoryCopy(d_addr, start)


Every frame is stored in start witch locate at memory can only be read write by cpu.

I need to push it to GPU space for cuda algorithm. My kernel only need 5ms to excute, but I need more the 15ms to copy it to gpu. this is unacceptable. In most case IO latency can be no less than 50ms.

Are there any other way to do so with zerocpy?

BTW, What I did in of sample zero copy as below.

a = malloc();
status = cudaHostGetDevicePointer(&d_a, a)

In this case status will be return an error state.

As Honey_Patouceul suggested, I tried another case:

b = malloc();
cudaHostAlloc(&a, b);
cudaHostGetDevicePointer(&d_a, a)

In this case. a will be a new address generated by cudaHostAlloc(). there will be nothing in address a.
a and b are definitely different memory area. All my data is stored in address b. If we are using cudaHostAlloc(), no need to get address d_a, using address a directly in kernel<>> is okay.

In your case, it should be only :

float *h_a;   // address of buffer from CPU side
float *d_a;   // address of buffer from GPU side
cudaHostAlloc((void **)&h_a, size*sizeof(float), cudaHostAllocMapped);  // Allocate buffer and get its CPU side address
cudaHostGetDevicePointer((void **)&d_a,  (void *)h_a, 0);   // Get GPU side address of buffer

...//Fill your buffer from CPU using address h_a

kernel<<<blocks, threads>>>(d_a);   // Execute kernel on GPU using address d_a

...//Read processed buffer from CPU with address h_a

It is important to pass flag cudaHostAllocMapped to cudaHostAlloc so that the memory will be allocated in pinned memory mapped into CUDA address space, accessible from CPU (with address in h_a) or GPU (with address in d_a as returned by cudaHostGetDevicePointer). Memory allocated by malloc is useless.

Hi, Honey_Patouceul

Many Thanks for your suggestion.

...//Fill your buffer from CPU using address h_a

It’s too expensive for me to copy every frame from cpu to h_a.

My algorithm only cost 5ms, but fetch data cost me 50ms. It’s terrible. I need zero copy!

I would suggest to look how you can set the address where your device stores the acquired frame in user process memory space, and set it to h_a (CPU side address of CUDA mapped buffer), so that once a frame is received, it is immediately available for GPU processing at address d_a. It mainly depends on your device driver, not on CUDA, thus probably out of this topic.


Sorry for keeping you waiting.

cudaHostRegister() is not supported on ARM platforms.
This is because the caching attribute of an existing allocation can’t be changed on the fly.

If required, please use cudaHostAlloc() with the flag cudaHostAllocMapped to allocate device-mapped host-accessible memory.

Hi, Honey_Patouceul

Thanks for your suggestion. I think thats the only way on TX1. I have not find a possible way or api from NVIDIA to do so.
Most libs are not opened to us.

The ideal pipeline:

  1. cudaHostAlloc(h_a,)
  2. Pass h_a to camera driver by ioctl.
  3. get d_a by cudaHostGetDevicePointer(), or using h_a directly on TX1.

for step 2, mostly the camera buffer used by DMA for data transfer, so h_a maybe need to be a physical continuous area.
Also need to modify some code to support this requirement.

It is difficult.

Hi, AastaLLL

Thanks for your suggestion.

I found there is a way to pass a dmabuffer fd to the v4l2. so could I get a device(cuda) address from an alloced NvBuffer?

Like this.

  1. NvBufferCreate(&fd, w,h,NvBufferLayout_Pitch,get_nvbuff_color_fmt(ctx->cam_pixfmt)))
  2. cudaAddrFromFD(fd, &d_a)

Hi Allen_Z,

Yes, this is a good way to avoid memory copy.

Flow likes this:
V4L2_buffer -> EGLImageKHR -> CUDA-Array
(dmabuf_fd) (cuGraphicsEGLRegisterImage) (pDevPtr)

Please find MMAPI sample for details:

Hi Folks

You have described a good way, with zero memcpy to move camera images directly into CUDA mem. Could you also describe similar API calls to get output from a CUDA kernel back into CPU space ? How can we get the CUDA output to read in CPU and displayed on screen ?


Hi dumbogeorge,

Please check topic-1017337 for the reply: