Sorry for keeping you waiting.
I just tested this sample and you are correct. cudaHostAlloc returns error 11 and kernel doesn’t execute.
We will check this issue and update to you.
Sorry for the inconvenience and misunderstanding.
But I think you can try Honey_Patouceul’s suggestion that replacing the malloc() with cudaHostAlloc().
Every frame is stored in start witch locate at memory can only be read write by cpu.
I need to push it to GPU space for cuda algorithm. My kernel only need 5ms to excute, but I need more the 15ms to copy it to gpu. this is unacceptable. In most case IO latency can be no less than 50ms.
BTW, What I did in test.cu of sample zero copy as below.
a = malloc();
status = cudaHostGetDevicePointer(&d_a, a)
kernel<<<d_a>>>
In this case status will be return an error state.
As Honey_Patouceul suggested, I tried another case:
b = malloc();
cudaHostAlloc(&a, b);
cudaHostGetDevicePointer(&d_a, a)
kernel<<<d_a>>>
In this case. a will be a new address generated by cudaHostAlloc(). there will be nothing in address a.
a and b are definitely different memory area. All my data is stored in address b. If we are using cudaHostAlloc(), no need to get address d_a, using address a directly in kernel<>> is okay.
cudaSetDeviceFlags(cudaDeviceMapHost);
float *h_a; // address of buffer from CPU side
float *d_a; // address of buffer from GPU side
cudaHostAlloc((void **)&h_a, size*sizeof(float), cudaHostAllocMapped); // Allocate buffer and get its CPU side address
cudaHostGetDevicePointer((void **)&d_a, (void *)h_a, 0); // Get GPU side address of buffer
...//Fill your buffer from CPU using address h_a
kernel<<<blocks, threads>>>(d_a); // Execute kernel on GPU using address d_a
...//Read processed buffer from CPU with address h_a
It is important to pass flag cudaHostAllocMapped to cudaHostAlloc so that the memory will be allocated in pinned memory mapped into CUDA address space, accessible from CPU (with address in h_a) or GPU (with address in d_a as returned by cudaHostGetDevicePointer). Memory allocated by malloc is useless.
I would suggest to look how you can set the address where your device stores the acquired frame in user process memory space, and set it to h_a (CPU side address of CUDA mapped buffer), so that once a frame is received, it is immediately available for GPU processing at address d_a. It mainly depends on your device driver, not on CUDA, thus probably out of this topic.
Thanks for your suggestion. I think thats the only way on TX1. I have not find a possible way or api from NVIDIA to do so.
Most libs are not opened to us.
The ideal pipeline:
cudaHostAlloc(h_a,)
Pass h_a to camera driver by ioctl.
get d_a by cudaHostGetDevicePointer(), or using h_a directly on TX1.
for step 2, mostly the camera buffer used by DMA for data transfer, so h_a maybe need to be a physical continuous area.
Also need to modify some code to support this requirement.
You have described a good way, with zero memcpy to move camera images directly into CUDA mem. Could you also describe similar API calls to get output from a CUDA kernel back into CPU space ? How can we get the CUDA output to read in CPU and displayed on screen ?