Cupy crashes on Jetson Nano


I installed Cupy v8.4 on Jetson Nano by command:

pip install Cupy

then Cupy was installed from source code successful after a long run
then i tested Cupy by trying to import Cupy or Numpy:

import numpy as np


import cupy as cp

but i had following error message:

Illegal instruction (core dumped)

Can anyone help me fix this error?

You may need post whole error message for someone have this kind experience to check it.

the whole message is:
Illegal instruction (core dumped)

Hi @linhdh if you upgraded pip, it may have also upgraded numpy, and there is a bug in numpy v1.19.5: Illegal instruction (core dumped) on import for numpy 1.19.5 on ARM64 · Issue #18131 · numpy/numpy · GitHub

As a workaround, export OPENBLAS_CORETYPE=ARMV8 first before running your Python script.

@dusty_nv Thank you.
your guide help me fix this bug.


did someone test cupy with Jeton Nano or Xavier NX ? I mean, is cupy really interesting ? I made some tests some times ago on my laptop and cupy was not fastest than numpy.


I did experiments with CuPy trying to speed up some audio processing where a lot of FFT is involved.
Worked so far and speed increase was noticable. You need to check all your code in order to avoid frequently shifting data between CPU and GPU.

In my specific case a lot of changes were necessary as not only Numpy but other math-libs like SciPy and LibRosa were involved and you will have to rewrite that code and try to replace it with some NumPy functionality.

I see. Not really easy to use GPU with good efficiency.

I had the same problems.

Shifting data between CPU & GPU cost time. It’s the weak point. New Nvidia Grace will bring some real improvements i guess.

Not sure Orin will bring really significant improvement but i hope so.

I won’t have time to modify my software to try to get CuPy improvements. Too bad.

Have a nice day.


Jetson Nano/TX1/TX2/Xavier already support CUDA mapped memory and CUDA managed memory, where no CPU/GPU copy is required (because it shares the same physical memory). For example, in jetson-inference I am always using this cudaAllocMapped() wrapper that allocates shared CPU/GPU memory (and as a result I never need to do cudaMemcpy())

However it seems a limitation of CuPy that it doesn’t support these types of memory, so it does the copy anyways.

Hello Dustin,

what you say looks interesting !

For now, i do this :

import pycuda.driver as drv

img_r_gpu1= drv.mem_alloc(res_r1.size * res_r1.dtype.itemsize)
drv.memcpy_htod(img_r_gpu1, res_r1)

i call my PyCuda function

drv.memcpy_dtoh(res_r1, r_gpu)

So, this would be the bad method ?


That is performing an explicit memory copy between CPU and GPU. Typically a GPU has it’s own discrete memory because it is hooked up via PCIe, so it would require this memory copy from system RAM. However on Jetson, all of the memory between CPU/GPU is shared. So if the memory is allocated as ‘CUDA mapped memory’ (aka zero-copy) or ‘CUDA managed’ memory, you don’t need to do the memory copies. But unfortunately I can’t see where CuPy supports allocation of this CUDA mapped memory or CUDA managed memory.


I will try to find some informations. If i can avoid all this wasting time (copying from host to device and vice versa), it could be great.

Thx Dustin, the man who never sleep like the big apple !


Hello Dustin,

could this be an illustration of what you are talking about, but this time using cupy :

// Copyright 2008-2021 Andreas Kloeckner
// Copyright 2021 NVIDIA Corporation

import pycuda.autoinit # noqa
from pycuda.compiler import SourceModule

import cupy as cp

// Create a CuPy array (and a copy for comparison later)
cupy_a = cp.random.randn(4, 4).astype(cp.float32)
original = cupy_a.copy()

// Create a kernel
mod = SourceModule("""
global void doublify(float a)
int idx = threadIdx.x + threadIdx.y
a[idx] *= 2;

func = mod.get_function(“doublify”)

// Invoke PyCUDA kernel on a CuPy array
func(cupy_a, block=(4, 4, 1), grid=(1, 1), shared=0)

// Demonstrate that our CuPy array was modified in place by the PyCUDA kernel
print(“original array:”)
print(“doubled with kernel:”)