The CPU and iGPU share the SoC DRAM memory, so what should I do if I want to cpy an array to another?
cudaMallocManaged((void **)&a, 10 * sizeof(float));
cudaMallocManaged((void **)&b, 10 * sizeof(float));
cudaMemcpy(a, b, 10 * sizeof(float), cudaMemcpyDeviceToDevice);
I mean, what the flag should be? Is cudaMemcpyDeviceToDevice the correct way? Or something else should I do?
Thanks, and I wonder what the execution part is? CPU or GPU when the flag is H2D, D2H and D2D?
Is CPU, CPU and GPU correspondingly?
Because I found a thread here said memcpy is not slower but even faster than naive kernel.
The behavior of Jetson memory is slightly different.
Please check the below document for details:
Thanks, actually I think my question is not only specified on Jetson but all GPU device.
So maybe you could tell me the execution part in most scenarios.
That when flag is H2D, D2H and D2D, who did the copy mission, CPU or GPU?
GPU will copy the memory (for all flags).
For the native kernel, the memory is accessed with the user-implemented pattern.
The performance might decrease if the access pattern causes too much page missing.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.