Pytorch and numpy only run with one core


I’ve installed pytorch and numpy following the instructions here

They work fine it seems but they only use one CPU core at all time instead of the 4 available.

If I run something like this for example, the job stops at 100% usage.

import torch
a = torch.rand(100, 1000, 1000)
b = torch.rand(100, 1000, 1000)

while True:
    c = torch.bmm(a, b)

Same goes for a numpy computation that would spread accross all cores otherwise.

Tensorflow, however, uses all available ressources.
Any idea why?

Do I have to install some special library like openBLAS or MKL for pytorch and numpy to use all available ressources? Or is this a problem with the wheel which was distributed?

Thanks in advance,

The reason is python itself.

You can try using the multiprocessing module:

The threading module will not bypass the GIL.

Also, C extensions can release the GIL and use multiple cores.

Thanks for your answer.

But torch and numpy are calling C extensions which are highly parallelized, and use multiple cores. I’m able to get 1400% CPU usage with the same code snippet on a 32 core machine (x86_64 machine, pytorch installed with standard pip).
So the problem is with the build, not with Python.

In this case it should be working. Please stand by while I install Nvidia’s wheel and try to replicate.

I can confirm your snippet only uses one core on my Xavier with Nvidia’s wheel (same as for Nano), and the same snippet uses all my cores on x86. I don’t know enough about the torch package specifically to say what’s wrong or if this is normal behavior where a GPU is present (my x86 machine has no Cuda or OpenCL set up).

It seems possible/likely that it is related to the BLAS backend or lack thereof (see this recent post).

If you could re-build PyTorch after you have OpenBLAS or the desired multithreaded backend, and confirm if it fixes your issue, that would be helpful for when I go to build the wheels for the PyTorch v1.4.0 release. What is TBD would be if this requires all users of the wheel to install OpenBLAS too.

Hi dusty_nv,

That’s what I thought, I just wanted to be sure before doing it.
I’ll try and get into it tomorrow. I’ve never installed OpenBLAS on an arm64 CPU before, anything I should be aware of to successfully built it?

Could you also give the commands you used to build the pytorch wheel? That would be really helpful



So I installed OpenBLAS using the following commands.

# First Install gcfortran
sudo apt install gfortran
# DL and compile OpenBLAS
git clone
cd OpenBLAS
make FC=gfortran
sudo make install

And installed numpy in three different ways :
This one
This one
And this one
They are pretty similar ways of installing it with small variations.
In the three cases, I only get 100% CPU usage and only returns “NOT AVAILABLE” fields.
I can’t manage to make it use all cores…

This also still doesn’t work for pytorch.

import torch
A = torch.randn(2, 3, 1, 4, 4)
B = torch.randn(2, 3, 1, 4, 6)
X, LU = torch.solve(B, A)

which raises :

Traceback (most recent call last):
  File "", line 4, in <module>
    X, LU = torch.solve(B, A)
RuntimeError: solve: LAPACK library not found in compilation

You might want to try installing ‘sudo apt-get install libopenblas-dev’ (from the Ubuntu repo)
Perhaps when you installed from source, it put it under /usr/local or somewhere where PyTorch didn’t automatically find it.

From reading the PyTorch forums, it should automatically detect OpenBLAS installation during Near the beginning of running, you should seen something along the lines of ‘OpenBLAS…detected’ when it is configuring the build. If it doesn’t find it then, it’s probably not worth proceeding in the build until it’s able to detect it.

I’ve heard PyTorch v1.4.0 should be released rather soon (sometime this month I believe), at which time I can also take a crack at it.

It is detecting it now but the build fails. I’m investigating for a while but didn’t manage to make it work yet.

It has just been released, could you please try and please share your scripts to create the wheel, that would really help.


By the way, numpy is one of the most library in python, we should be able to use it with a pultithreading backend.
Could you also help to do that please? Either with build instructions or with a wheel. I tried 6 different ways and none of them uses more than one core for computing.
This is pretty frustrating.

1 Like

OK, thanks for the heads-up, I am building it now (with libopenblas-dev installed and USE_DISTRIBUTED=1). BTW you can find my build procedure here:

What happened when you tried building numpy after libopenblas-dev had been installed? Did it detect/use OpenBLAS?

It would also seem that not all numpy operations have multithreaded implementations, see here for more info.

Yes sorry I saw that and forgot it was there, my bad.

Finally, pytorch build was successful using the bdist_wheel, as indicated in several threads. Thanks again for the help for that.
Now it can find LAPACK and use solve and runs on multi-core when possible.

I didn’t try to rebuild numpy from source yet, I’m probably going to do it now. I’ll let you know.
This uses 4 cores so I’m pretty confident reinstalling form source will work.

LD_PRELOAD=/usr/lib/ python

Thanks a bunch !

OK thanks, the PyTorch v1.4.0 are now posted here:

These include support for OpenBLAS, from what I can tell it is working.