I’ve decided to assess the zero-copy and managed memory modes in the Jetson board with an optimized matrix multiplication CUDA program that uses shared memory.
During my experiments, I noticed that both of them perform significantly slower than when the buffers were allocated using the standard cudaMalloc. Since the CPU and GPU share the same memory, I found this result to be extremely counter intuitive and decided to investigate this issue, which led me to a GTC2014 talk by Amit Rao, where he said that in zero-copy mode, the caches of GPU and CPU are disabled in zero copy mode. He didn’t explain why, so I can only assume that there’s no hardware to assure cohesion between CPU and GPU, am I correct?
Secondly, I couldn’t find anything explaining how managed memory works on Tegra devices. I noticed that, in managed memory mode, accessing the allocated buffers by the CPU is just as fast as if CPU caching was enabled; however, the matrix multiplication kernel is still just as slow as when using the zero-copy mode. Is there some material that could explain in detail what’s going on here?