CUDA Zero Copy On TX1

Allen_Z · March 14, 2017, 4:54am

From sample NVIDIA_CUDA-8.0_Samples/0_Simple/simpleZeroCopy/simpleZeroCopy.cu.

There are two ways to access memory between CPU and GPU space.

When I tried to cudaHostRegister to map an existed memory alloced from CPU to GPU space.

I was failed. seemed this api is not supported on TX1 with CUDA8.0.

In my case I need to map some existed memory achieve from other place like mmap or malloc.

I’d like to do some accelerated computing on GPU. So cudaMallocHost cannot meet my requirements.

Do I have any other possible way to do zero copy from cpu to gpu space like cudaHostRegister does?

The Io latency is my main delay in my calculations.

AastaLLL · March 14, 2017, 9:23am

Hi,

Thanks for your question.
This is a sample for zero copy on tx1 with cpu pointer.

Please help us check if it can solve your problem.
Thanks

#include<stdio.h>
#include<cuda.h>


__global__ void kernel(float* output, float* input)
{

}


int main() {

    int size = 10;

    // Set flag to enable zero copy access
    cudaSetDeviceFlags(cudaDeviceMapHost);
 
    // Host Arrays
    float* h_in  = NULL;
    float* h_out = NULL;
    
    h_in  = (float *) malloc(size*sizeof(float));
    h_out = (float *) malloc(size*sizeof(float));

    // Device arrays
    float *d_out, *d_in;
    // Get device pointer from host memory. No allocation or memcpy
    cudaHostGetDevicePointer((void **)&d_in,  (void *) h_in , 0);
    cudaHostGetDevicePointer((void **)&d_out, (void *) h_out, 0);
 
    // Launch the GPU kernel
    kernel<<<1, 1>>>(d_out, d_in);
 
    // No need to copy d_out back
    // Continue processing on host using h_out
    return 0;
}

Allen_Z · March 15, 2017, 1:28am

Hi, AastaLLL

Thanks for your quick response.

I test as you list on TX1.

ubuntu@tegra-ubuntu:~/NVIDIA_CUDA-8.0_Samples/0_Simple/simpleZeroCopy$ ./simpleZeroCopy --use_generic_memory
  Device 0: < NVIDIA Tegra X1 >, Compute SM 5.3 detected
> Using Generic System Paged Memory (malloc)
CUDA error at simpleZeroCopy.cu:190 code=11(cudaErrorInvalidValue) "cudaHostGetDevicePointer((void **)&d_a, (void *)a, 0)"

Seemed getting a device pointer using a cpu addr directly without cudaHostRegister is not allowed.

AastaLLL · March 15, 2017, 6:41am

Hi,

I tested this sample on tx1 and it should work.
Could you try it again with following command?

nvcc bug_998962.cu -o test
./test

Allen_Z · March 15, 2017, 9:57am

Hi, AastaLLL

I think your test is ok.

But your kernel will not start.

because the cudaHostGetDevicePointer will not success.

You can check the return status of func cudaHostGetDevicePointer.

Return code is ‘11’ mean some error.

Honey_Patouceul · March 15, 2017, 8:08pm

Try replacing

h_in  = (float *) malloc(size*sizeof(float));
h_out = (float *) malloc(size*sizeof(float));

with

cudaHostAlloc((void **)&h_in, size*sizeof(float), cudaHostAllocMapped);
cudaHostAlloc((void **)&h_out, size*sizeof(float), cudaHostAllocMapped);

Allen_Z · March 16, 2017, 2:50am

Hi, Honey_Patouceul

cudaHostAlloc will assign a new address to replace the origin addr of h_in and h_out.

In such case, I’ll lost the address and its memory which already existed in h_in and h_out.

In most case the data is already in memory, I need to get a according address in GPU to avoid copy it.

Because the memory are shared in TX1, So I think it possible to just get a mapped address without copy.

Honey_Patouceul · March 16, 2017, 6:57am

My suggestion was not to replace cudaHostGetDevicePointer(), but malloc().

AastaLLL · March 17, 2017, 7:00am

Hi,

Sorry for keeping you waiting.
I just tested this sample and you are correct. cudaHostAlloc returns error 11 and kernel doesn’t execute.
We will check this issue and update to you.
Sorry for the inconvenience and misunderstanding.

But I think you can try Honey_Patouceul’s suggestion that replacing the malloc() with cudaHostAlloc().

Allen_Z · March 17, 2017, 10:35am

Hi,

For my case. I need to run the object detection with CUDA on GPU. The image I got is from camera in runtime.

mdata->buffers[cam_idx][i].start = mmap (NULL /* start anywhere */,
                buf.length,
                PROT_READ | PROT_WRITE /* required */,
                MAP_SHARED /* recommended */,
                fd, buf.m.offset);

while (1)
{
 cudaMemoryCopy(d_addr, start)

...
kernel<<<d_addr>>>.
...
}

Every frame is stored in start witch locate at memory can only be read write by cpu.

I need to push it to GPU space for cuda algorithm. My kernel only need 5ms to excute, but I need more the 15ms to copy it to gpu. this is unacceptable. In most case IO latency can be no less than 50ms.

Are there any other way to do so with zerocpy?

Allen_Z · March 20, 2017, 7:14am

BTW, What I did in test.cu of sample zero copy as below.

a = malloc();
status = cudaHostGetDevicePointer(&d_a, a)
kernel<<<d_a>>>

In this case status will be return an error state.

As Honey_Patouceul suggested, I tried another case:

b = malloc();
cudaHostAlloc(&a, b);
cudaHostGetDevicePointer(&d_a, a)
kernel<<<d_a>>>

In this case. a will be a new address generated by cudaHostAlloc(). there will be nothing in address a.
a and b are definitely different memory area. All my data is stored in address b. If we are using cudaHostAlloc(), no need to get address d_a, using address a directly in kernel<>> is okay.

Honey_Patouceul · March 20, 2017, 8:38am

In your case, it should be only :

cudaSetDeviceFlags(cudaDeviceMapHost);
float *h_a;   // address of buffer from CPU side
float *d_a;   // address of buffer from GPU side
cudaHostAlloc((void **)&h_a, size*sizeof(float), cudaHostAllocMapped);  // Allocate buffer and get its CPU side address
cudaHostGetDevicePointer((void **)&d_a,  (void *)h_a, 0);   // Get GPU side address of buffer

...//Fill your buffer from CPU using address h_a

kernel<<<blocks, threads>>>(d_a);   // Execute kernel on GPU using address d_a

...//Read processed buffer from CPU with address h_a

It is important to pass flag cudaHostAllocMapped to cudaHostAlloc so that the memory will be allocated in pinned memory mapped into CUDA address space, accessible from CPU (with address in h_a) or GPU (with address in d_a as returned by cudaHostGetDevicePointer). Memory allocated by malloc is useless.

Allen_Z · March 21, 2017, 2:30am

Hi, Honey_Patouceul

Many Thanks for your suggestion.

...//Fill your buffer from CPU using address h_a

It’s too expensive for me to copy every frame from cpu to h_a.

My algorithm only cost 5ms, but fetch data cost me 50ms. It’s terrible. I need zero copy!

Honey_Patouceul · March 22, 2017, 6:55pm

I would suggest to look how you can set the address where your device stores the acquired frame in user process memory space, and set it to h_a (CPU side address of CUDA mapped buffer), so that once a frame is received, it is immediately available for GPU processing at address d_a. It mainly depends on your device driver, not on CUDA, thus probably out of this topic.

AastaLLL · March 23, 2017, 11:02am

Hi,

Sorry for keeping you waiting.

cudaHostRegister() is not supported on ARM platforms.
This is because the caching attribute of an existing allocation can’t be changed on the fly.

If required, please use cudaHostAlloc() with the flag cudaHostAllocMapped to allocate device-mapped host-accessible memory.

Allen_Z · March 24, 2017, 3:04am

Hi, Honey_Patouceul

Thanks for your suggestion. I think thats the only way on TX1. I have not find a possible way or api from NVIDIA to do so.
Most libs are not opened to us.

The ideal pipeline:

cudaHostAlloc(h_a,)
Pass h_a to camera driver by ioctl.
get d_a by cudaHostGetDevicePointer(), or using h_a directly on TX1.

for step 2, mostly the camera buffer used by DMA for data transfer, so h_a maybe need to be a physical continuous area.
Also need to modify some code to support this requirement.

It is difficult.

Allen_Z · March 24, 2017, 3:12am

Hi, AastaLLL

Thanks for your suggestion.

I found there is a way to pass a dmabuffer fd to the v4l2. so could I get a device(cuda) address from an alloced NvBuffer?

Like this.

NvBufferCreate(&fd, w,h,NvBufferLayout_Pitch,get_nvbuff_color_fmt(ctx->cam_pixfmt)))
cudaAddrFromFD(fd, &d_a)

AastaLLL · March 28, 2017, 7:35am

Hi Allen_Z,

Yes, this is a good way to avoid memory copy.

Flow likes this:
V4L2_buffer → EGLImageKHR → CUDA-Array
(dmabuf_fd) (cuGraphicsEGLRegisterImage) (pDevPtr)

Please find MMAPI sample for details:
‘/home/ubuntu/tegra_multimedia_api/samples/backend/v4l2_backend_main.cpp’

dumbogeorge · July 10, 2017, 1:00pm

Hi Folks

You have described a good way, with zero memcpy to move camera images directly into CUDA mem. Could you also describe similar API calls to get output from a CUDA kernel back into CPU space ? How can we get the CUDA output to read in CPU and displayed on screen ?

Thanks,

AastaLLL · July 11, 2017, 8:22am

Hi dumbogeorge,

Please check topic-1017337 for the reply:
[url]https://devtalk.nvidia.com/default/topic/1017337/jetson-tx1/reading-cudaeglframe-from-host-code-cuda-programming-on-tx1/post/5183737/#5183737[/url]

Thanks.

Topic		Replies	Views
Zero Copy Memory vs Unified memory CUDA processing Jetson TX1	28	20127	October 18, 2021
GPU direct access to DMA memory over PCIe Jetson Xavier NX pcie , cuda	4	2226	April 22, 2022
Device to host data copy may not reflect on host side using graphs CUDA Programming and Performance	5	234	September 6, 2023
Got out of memory from cudaMemcpy CUDA Programming and Performance	13	3929	January 28, 2022
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1617	October 18, 2021
Transfer video frames from a PCIe capture card to Jetson TX1 device memory (for RT video processing) Jetson TX1	20	5828	June 1, 2018
Neural network on GPU, physics on CPU? CUDA Programming and Performance	13	5650	October 6, 2013
Calling CUDA functions from cpp files - scope of variables CUDA Programming and Performance	7	1213	July 18, 2014
Performance of zero-copy on jetson TX1 Jetson TX1	9	2141	October 18, 2021
Cuda 2.2 / Zero-copy access CUDA Programming and Performance	33	42259	May 1, 2009

CUDA Zero Copy On TX1

Related topics