Pytorch and numpy only run with one core

pariente.mnl · January 14, 2020, 4:08pm

Hi,

I’ve installed pytorch and numpy following the instructions here https://devtalk.nvidia.com/default/topic/1049071/jetson-nano/pytorch-for-jetson-nano-version-1-3-0-now-available/1.

They work fine it seems but they only use one CPU core at all time instead of the 4 available.

If I run something like this for example, the job stops at 100% usage.

import torch
a = torch.rand(100, 1000, 1000)
b = torch.rand(100, 1000, 1000)

while True:
    c = torch.bmm(a, b)

Same goes for a numpy computation that would spread accross all cores otherwise.

Tensorflow, however, uses all available ressources.
Any idea why?

Do I have to install some special library like openBLAS or MKL for pytorch and numpy to use all available ressources? Or is this a problem with the wheel which was distributed?

Thanks in advance,
Manu

mdegans · January 14, 2020, 6:02pm

The reason is python itself.
https://wiki.python.org/moin/GlobalInterpreterLock

You can try using the multiprocessing module:
https://docs.python.org/3/library/multiprocessing.html

The threading module will not bypass the GIL.

Also, C extensions can release the GIL and use multiple cores.

pariente.mnl · January 14, 2020, 8:16pm

Thanks for your answer.

But torch and numpy are calling C extensions which are highly parallelized, and use multiple cores. I’m able to get 1400% CPU usage with the same code snippet on a 32 core machine (x86_64 machine, pytorch installed with standard pip).
So the problem is with the build, not with Python.

mdegans · January 14, 2020, 8:29pm

In this case it should be working. Please stand by while I install Nvidia’s wheel and try to replicate.

mdegans · January 14, 2020, 8:49pm

I can confirm your snippet only uses one core on my Xavier with Nvidia’s wheel (same as for Nano), and the same snippet uses all my cores on x86. I don’t know enough about the torch package specifically to say what’s wrong or if this is normal behavior where a GPU is present (my x86 machine has no Cuda or OpenCL set up).

dusty_nv · January 14, 2020, 9:23pm

It seems possible/likely that it is related to the BLAS backend or lack thereof (see this recent post).

If you could re-build PyTorch after you have OpenBLAS or the desired multithreaded backend, and confirm if it fixes your issue, that would be helpful for when I go to build the wheels for the PyTorch v1.4.0 release. What is TBD would be if this requires all users of the wheel to install OpenBLAS too.

pariente.mnl · January 14, 2020, 9:57pm

Hi dusty_nv,

That’s what I thought, I just wanted to be sure before doing it.
I’ll try and get into it tomorrow. I’ve never installed OpenBLAS on an arm64 CPU before, anything I should be aware of to successfully built it?

Could you also give the commands you used to build the pytorch wheel? That would be really helpful

Thanks,
Manu

pariente.mnl · January 15, 2020, 10:16am

Hi,

So I installed OpenBLAS using the following commands.

# First Install gcfortran
sudo apt install gfortran
# DL and compile OpenBLAS
git clone https://github.com/xianyi/OpenBLAS.git
cd OpenBLAS
make FC=gfortran
sudo make install

And installed numpy in three different ways :
This one https://roman-kh.github.io/numpy-multicore/
This one https://hunseblog.wordpress.com/2014/09/15/installing-numpy-and-openblas/
And this one https://stackoverflow.com/questions/11443302/compiling-numpy-with-openblas-integration?answertab=votes#tab-top
They are pretty similar ways of installing it with small variations.
In the three cases, I only get 100% CPU usage and np.config.show() only returns “NOT AVAILABLE” fields.
I can’t manage to make it use all cores…

This also still doesn’t work for pytorch.

import torch
A = torch.randn(2, 3, 1, 4, 4)
B = torch.randn(2, 3, 1, 4, 6)
X, LU = torch.solve(B, A)

which raises :

Traceback (most recent call last):
  File "pt_solve.py", line 4, in <module>
    X, LU = torch.solve(B, A)
RuntimeError: solve: LAPACK library not found in compilation

dusty_nv · January 15, 2020, 2:42pm

You might want to try installing ‘sudo apt-get install libopenblas-dev’ (from the Ubuntu repo)
Perhaps when you installed from source, it put it under /usr/local or somewhere where PyTorch didn’t automatically find it.

From reading the PyTorch forums, it should automatically detect OpenBLAS installation during setup.py. Near the beginning of running setup.py, you should seen something along the lines of ‘OpenBLAS…detected’ when it is configuring the build. If it doesn’t find it then, it’s probably not worth proceeding in the build until it’s able to detect it.

I’ve heard PyTorch v1.4.0 should be released rather soon (sometime this month I believe), at which time I can also take a crack at it.

pariente.mnl · January 16, 2020, 12:15pm

It is detecting it now but the build fails. I’m investigating for a while but didn’t manage to make it work yet.

It has just been released https://github.com/pytorch/pytorch/releases/tag/v1.4.0, could you please try and please share your scripts to create the wheel, that would really help.

Thanks,
Manu

pariente.mnl · January 16, 2020, 12:27pm

By the way, numpy is one of the most library in python, we should be able to use it with a pultithreading backend.
Could you also help to do that please? Either with build instructions or with a wheel. I tried 6 different ways and none of them uses more than one core for computing.
This is pretty frustrating.

dusty_nv · January 16, 2020, 3:24pm

OK, thanks for the heads-up, I am building it now (with libopenblas-dev installed and USE_DISTRIBUTED=1). BTW you can find my build procedure here: https://devtalk.nvidia.com/default/topic/1049071/jetson-nano/pytorch-for-jetson-nano-version-1-3-0-now-available/

What happened when you tried building numpy after libopenblas-dev had been installed? Did it detect/use OpenBLAS?

It would also seem that not all numpy operations have multithreaded implementations, see here for more info.

pariente.mnl · January 17, 2020, 8:50am

Yes sorry I saw that and forgot it was there, my bad.

Finally, pytorch build was successful using the bdist_wheel, as indicated in several threads. Thanks again for the help for that.
Now it can find LAPACK and use solve and runs on multi-core when possible.

I didn’t try to rebuild numpy from source yet, I’m probably going to do it now. I’ll let you know.
This uses 4 cores so I’m pretty confident reinstalling form source will work.

LD_PRELOAD=/usr/lib/libopenblas.so.0 python numpy_code.py

Thanks a bunch !
Manu

dusty_nv · January 17, 2020, 3:27pm

OK thanks, the PyTorch v1.4.0 are now posted here: https://devtalk.nvidia.com/default/topic/1049071/jetson-nano/pytorch-for-jetson-nano-version-1-4-0-now-available/

These include support for OpenBLAS, from what I can tell it is working.