Jetson Nano Device Local Memory Specifications

Hi, I wrote a few basic image processing programs from scratch using CUDA, and was comparing the speed of using the GPU’s local memory vs global memory. However, in instances where the local memory did show an improvement (although small) on a workstation desktop machine, on the Jetson it was actually slower. Here is a link to one of the projects (simple blurr filter, uses local memory for both image chunk and filter storage for each block): [url]https://github.com/SkookumAsFrig/cs344/tree/master/Problem%20Sets/Problem%20Set%202%20Working[/url]
So I was wondering are there any differences between Maxwell desktop card local memory speeds and Jetson Nano’s Tegra SoC local memory speed? The local memory is basically a L2 cache on the same die as the GPU, correct? If the speeds on the Nano are too slow then I won’t even bother trying to use it in the future, as that complicates the parallel program significantly.

Hi,

Here is a tutorial for the memory usage on Jetson:
[url]https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management[/url]

Thanks.

Great, thank you for the link. So from that it seems like there is no benefit to using the shared device memory, as both global unified memory and it share the same (I assume) off chip LPDDR4 storage.

Hi,

The benefit is you can try some zero-copy memory.
Since CPU and GPU memory is shared, the overhead to handle synchronize is much lower.

Thanks.

How does that work? Do I just allocate host memory, and have the CUDA kernels read from the host pointer directly? Or do I allocate device memory, and have host read from that? I am pretty sure the latter gives seg fault.

So google shows me “cudaHostAlloc”, which allocates page-locked memory on the host. How would this be different on the Nano which shares CPU and GPU memory, than say a discrete GPU computer?

Hi,

You check this document again:

https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management
----------------------------------------------------------------------
On Tegra®, because device memory, host memory, and unified memory are allocated on the same physical SoC DRAM, duplicate memory allocations and data transfers can be avoided.
----------------------------------------------------------------------

Thanks.