For TX1, are the device and host memories the same and indentical?

For TX1, are the device and host memories the same? Does it still require copy from host to device and vice versa?

You’ll have to specify what you mean by “device” and “host”. GPU is not attached via PCIe, it is attached directly to the memory controller, so that and anything with direct memory controller connection uses the same memory as the kernel and user space. On a desktop system you’d expect video/GPU to have its own memory and require copy back and forth.

Thanks for your reply. So for example if an image is loaded into the main memory using openCV, one would not need to copy it into GPU memory in order to apply a NPP/Cuda function on that, whereas as you mentioned for a desktop system you would have to do that. Right?

I can’t answer OpenCV/CUDA questions, but it may be useful to know that the CUDA code uses pinned memory which is not swapped out and may also be directly accessed via the memory controller or DMA. Data still needs to reach that memory, but there will be no PCIe bus needed. There is no transfer of data across physically separate memory devices. Someone else would need to answer for more detail.

Hi hd_ali,

Would you please provide more details of your use case?
That could help to provide the suggestion in the right direction.


Hi guys

For example, the following snippet is an example of how a filtering would be applied on an image, in a desktop system case. In this case my kernel, here “hostKernel”, has to be transferred to the CUDA memory and then used in “nppiFilterRow_32f_C1R”.
Now my question is: for an embedded system like TX1, do I need still to copy “hostKernel” to CUDA memory as it was the case for a desktop system? Can I directly use “hostKernel” in my “nppiFilterRow_32f_C1R”?

Npp32f hostKernel[3] = {1 ,1, 1} ;
Npp32s kernelSize = 3 ;
Npp32s kernelAnchor= 1 ;

Npp32f* deviceKernel;
NPP_CHECK_CUDA(cudaMalloc((void**)&deviceKernel, kernelSize * sizeof(Npp32f)));
NPP_CHECK_CUDA(cudaMemcpy(deviceKernel, &hostKernel, kernelSize * sizeof(Npp32f), cudaMemcpyHostToDevice));

int pixelSize = 4 ;
NppiSize ROI2 = {380,620} ;
int xROI = 4 ;
int yROI = 4 ;

Npp32f* pSrcOffset = oDeviceSrc->data() + yROI * oDeviceSrc->pitch() + xROI * pixelSize ;

nppiFilterRow_32f_C1R (pSrcOffset, oDeviceSrc->pitch(),
oDeviceDst->data(), oDeviceDst->pitch(),
ROI2, deviceKernel, kernelSize, kernelAnchor)) ;

Sorry “linuxdev”. I just saw your reply. Thank you very much. I made a new reply as well at below.

Hi hd_ali,

We have CUDA 7.0 Toolkit support on TX1, it should be no problem to run the same design concept CUDA program on both TX1 and your desktop system.

Have you met any specific issue while running your code?

For specific CUDA programming issue, you could post to CUDA Programming and Performance to get more assistance:


Hi kaycc,

Thank you very much for your reply and the performance discussion link.