did someone test cupy with Jeton Nano or Xavier NX ? I mean, is cupy really interesting ? I made some tests some times ago on my laptop and cupy was not fastest than numpy.
I did experiments with CuPy trying to speed up some audio processing where a lot of FFT is involved.
Worked so far and speed increase was noticable. You need to check all your code in order to avoid frequently shifting data between CPU and GPU.
In my specific case a lot of changes were necessary as not only Numpy but other math-libs like SciPy and LibRosa were involved and you will have to rewrite that code and try to replace it with some NumPy functionality.
Jetson Nano/TX1/TX2/Xavier already support CUDA mapped memory and CUDA managed memory, where no CPU/GPU copy is required (because it shares the same physical memory). For example, in jetson-inference I am always using this cudaAllocMapped() wrapper that allocates shared CPU/GPU memory (and as a result I never need to do cudaMemcpy())
However it seems a limitation of CuPy that it doesn’t support these types of memory, so it does the copy anyways.
That is performing an explicit memory copy between CPU and GPU. Typically a GPU has it’s own discrete memory because it is hooked up via PCIe, so it would require this memory copy from system RAM. However on Jetson, all of the memory between CPU/GPU is shared. So if the memory is allocated as ‘CUDA mapped memory’ (aka zero-copy) or ‘CUDA managed’ memory, you don’t need to do the memory copies. But unfortunately I can’t see where CuPy supports allocation of this CUDA mapped memory or CUDA managed memory.
could this be an illustration of what you are talking about, but this time using cupy :
// Copyright 2008-2021 Andreas Kloeckner
// Copyright 2021 NVIDIA Corporation
import pycuda.autoinit # noqa
from pycuda.compiler import SourceModule
import cupy as cp
// Create a CuPy array (and a copy for comparison later)
cupy_a = cp.random.randn(4, 4).astype(cp.float32)
original = cupy_a.copy()
// Create a kernel
mod = SourceModule(“”" global void doublify(float a)
{
int idx = threadIdx.x + threadIdx.y4;
a[idx] *= 2;
}
“”")
func = mod.get_function(“doublify”)
// Invoke PyCUDA kernel on a CuPy array
func(cupy_a, block=(4, 4, 1), grid=(1, 1), shared=0)
// Demonstrate that our CuPy array was modified in place by the PyCUDA kernel
print(“original array:”)
print(original)
print(“doubled with kernel:”)
print(cupy_a)