I am using a Jetson Nano for a project where I want to use CUDA’s parallel processing abilities to optimise the execution time of some functions. However, I run into the problem where the memory transfer time between GPU and CPU significantly outweighs the the benefits of migrating those particular functions to execute on the GPU. If these functions were only being called once, this shouldn’t be a problem at all, but because they are executed hundreds of times until a condition is met, it means the memory transfer delay is playing a big role in the runtime of the overall algorithm.
I’ve asked in another forum if this problem can be solved through my CUDA implementation (post). However, I was wondering if I could use a different hardware system which optimises the memory transfer time between GPU and CPU? Something that either significantly reduces this problem or removes it altogether (like a hybrid or a joined GPU/CPU). I’ve done some research into this, but unfortunately my “hardware domain” limitations kick in and I haven’t progressed very far…
I would really appreciate anyone’s thoughts on this matter!
Thank you so much for this link - I hadn’t come across this before so this is really helpful.
If I could follow up with a couple of questions:
In the example they give, they talk about pinned memory. Just so I’m 100% clear, this is the initial allocation on the device (cudaMallocHost), which enables a faster transfer of memory from GPU to CPU later with cudaMemcpy?
Do you happen to know if I would have any problems with implementing dynamic parallelism if I go with this memory copying method? I just know I’ve run into many problems in the past (typically compatibility wise) once I implemented dynamic parallelism … Will things like atomic_add still work?
I can’t speak to this directly, but the “Dynamic Parallelism” section of the Programming Guide has this to say:
"Zero-copy system memory has identical coherence and consistency guarantees to global memory, and follows the semantics detailed above. A kernel may not allocate or free zero-copy memory, but may use pointers to zero-copy passed in from the host program. " CUDA C++ Programming Guide
So, given the Nano does not meet the I/O Coherency requirement for caching, (it’s Compute Capability is 5.3), you may bump into the negative performance behaviour outlined in the Tegra appnote above.
To get cached I/O, you might consider Jetson/Tegra/Drive devices listed in the last column of the, “GPU’s Supported” table here: CUDA - Wikipedia for Compute Capabilty 7.2 and higher.
Ahh that should be fine then (for my DP implementation).
So, given the Nano does not meet the I/O Coherency requirement for caching, (it’s Compute Capability is 5.3), you may bump into the negative performance behaviour outlined in the Tegra appnote above.
Could you explain what negative performance means in this context? Additionally, what is the difference between the caching on Jetson/Tegra/Drive vs the Nano?
This post is very close to the holiday period so no worries if it doesn’t get answered for a little while! Happy Holidays :)
Longer runtimes than if the I/O was cached. How much is going to be determined by your particular code, but going by what is outlined here: 1. CUDA for Tegra — CUDA for Tegra 12.3 documentation repeated access to uncached buffers is not a desirable situation.
Just that they don’t suffer the aforementioned lack of I/O coherency (caching) and so won’t suffer the issue outlined above.