Accelerating Python on GPUs with nvc++ and Cython

jwitsoe · November 10, 2020, 5:53pm

Originally published at: https://developer.nvidia.com/blog/accelerating-python-on-gpus-with-nvc-and-cython/

The C++ standard library contains a rich collection of containers, iterators, and algorithms that can be composed to produce elegant solutions to complex problems. Most importantly, they are fast, making C++ an attractive choice for writing highly performant code. NVIDIA recently introduced stdpar: a way to automatically accelerate the execution of C++ standard library algorithms…

Aether424 · January 19, 2021, 12:33am

Nice post – it’s great to see access to GPU from Python/Cython.

I’m trying to use the jacobi_solver example in the post as a starting point to write a GPU-accelerated function to perform convolution using the for_each algorithm from standard C++ library.

I don’t want to reinvent the wheel and was wondering if basic convolution of a kernel with a 2D array (e.g., using two 1D convolutions) might already be implemented on GPU in a way similar to the example given in the jacobi_solver.

Thanks!

ashwint · January 19, 2021, 11:48pm

Thanks for your message!

What you’re proposing sounds very much similar to the Jacobi example. You would need to write a functor (similar to avg) that encapsulates your kernel.

In fact, if I’m understanding correctly, couldn’t the Jacobi solver be thought of as repeatedly applying the following kernel?

0   1/4   0
1/4 0   1/4
0   1/4   0

Aether424 · January 20, 2021, 1:07pm

@ashwint - Thanks for the reply – that clarifies it a lot.

In trying to write the code, I’ve noticed NVIDIA HPC SDK doesn’t appear to be available for Windows yet. Any idea when it will reach Windows?

Thanks!

grahamlopez · January 21, 2021, 4:14pm

Hi! Thanks again for your interest. We plan to have Windows support for the HPC SDK later this year.

boehmvanessa · March 24, 2021, 11:26pm

Hi @ashwint, thanks a lot for this great post! I have a quick question regarding the Figure showing the speedup over numpy sort: why is serial CPU processing doing better at smaller sample sizes? I understand that GPU parallel processing capacities are not enfolding their power at small sample sizes - but what creates the overhead?

ashwint · March 30, 2021, 1:03pm

Thanks, @boehmvanessa. Likely, it’s the cost of transferring data from the host (GPU) to the device (GPU) and back.

gebruiker822 · January 21, 2022, 3:26pm

Hi, thanks for the article!
I’m facing following task: I have cuda code in a .cu file which relies on the cublas library, a wrapper.pyx file and a setup.py which is based on this repository. Unfortunately, I don’t have any knowledge about setuptools or creating modules in general.
I want to create a python module, however I do not know how to link my cuda code with the cublas library in this case.
Could you give me a hint how to build a module from the .pyx and the .cu file which includes the cublas funtionalities? Manual build is also fine.
Thanks for your help!

ashwint · January 27, 2022, 3:06pm

Hi - the setup.py file you linked to should largely work. I think you would need to add the cublas to the library_dirs, libraries, runtime_library_dirs, and include_dirs arguments to the Extension constructor on line 112.

sameer2190 · July 27, 2023, 1:34pm

Thanks Ashwin for this nice article.
I am trying to run code from GitHub - anuga-community/anuga_core: ANUGA for the simulation of the shallow water equation which is having compute intensive code written in C and using cython to create extension for Python.
The memory allocation is happening through numpy.
We also have an openmp version of compute intensive code running on CPU.

The challenge im seeing is about memory allocation on GPU.
Can you give us some suggestion with latest trends which will simplify memory management?

Cheers,
Samir