Hey Lucas,

Thanks, that solved the problem for me as well.

I ran the mxnet_numpy_performance_test on my Jetson Nano and got some interesting results:

```
NumPy : Dotted two 512x512 matrices in 0.05 s.
mxnet.numpy : Dotted two 512x512 matrices in 0.08 s.
mxnet.numpy on GPU : Dotted two 512x512 matrices in 0.01 s.
NumPy : Dotted two 1024x1024 matrices in 0.38 s.
mxnet.numpy : Dotted two 1024x1024 matrices in 0.65 s.
mxnet.numpy on GPU : Dotted two 1024x1024 matrices in 0.03 s.
NumPy : Dotted two 2048x2048 matrices in 2.96 s.
mxnet.numpy : Dotted two 2048x2048 matrices in 5.22 s.
mxnet.numpy on GPU : Dotted two 2048x2048 matrices in 0.10 s.
NumPy : Dotted two 4096x4096 matrices in 23.62 s.
mxnet.numpy : Dotted two 4096x4096 matrices in 41.78 s.
mxnet.numpy on GPU : Dotted two 4096x4096 matrices in 0.65 s.
```

Looking at jtop’s cpu utilization meters it looks like that mxnet.numpy uses only one CPU core while NumPy uses all four. Am I missing here something or is there some room for optimizing compile/build parameters?