Accelerating Python on GPUs with nvc++ and Cython

Originally published at: https://developer.nvidia.com/blog/accelerating-python-on-gpus-with-nvc-and-cython/

The C++ standard library contains a rich collection of containers, iterators, and algorithms that can be composed to produce elegant solutions to complex problems. Most importantly, they are fast, making C++ an attractive choice for writing highly performant code. NVIDIA recently introduced stdpar: a way to automatically accelerate the execution of C++ standard library algorithms…

Nice post – it’s great to see access to GPU from Python/Cython.

I’m trying to use the jacobi_solver example in the post as a starting point to write a GPU-accelerated function to perform convolution using the for_each algorithm from standard C++ library.

I don’t want to reinvent the wheel and was wondering if basic convolution of a kernel with a 2D array (e.g., using two 1D convolutions) might already be implemented on GPU in a way similar to the example given in the jacobi_solver.

Thanks!

Thanks for your message!

What you’re proposing sounds very much similar to the Jacobi example. You would need to write a functor (similar to avg) that encapsulates your kernel.

In fact, if I’m understanding correctly, couldn’t the Jacobi solver be thought of as repeatedly applying the following kernel?

0   1/4   0
1/4 0   1/4
0   1/4   0

@ashwint - Thanks for the reply – that clarifies it a lot.

In trying to write the code, I’ve noticed NVIDIA HPC SDK doesn’t appear to be available for Windows yet. Any idea when it will reach Windows?

Thanks!

Hi! Thanks again for your interest. We plan to have Windows support for the HPC SDK later this year.

Hi @ashwint, thanks a lot for this great post! I have a quick question regarding the Figure showing the speedup over numpy sort: why is serial CPU processing doing better at smaller sample sizes? I understand that GPU parallel processing capacities are not enfolding their power at small sample sizes - but what creates the overhead?

Thanks, @boehmvanessa. Likely, it’s the cost of transferring data from the host (GPU) to the device (GPU) and back.

Hi, thanks for the article!
I’m facing following task: I have cuda code in a .cu file which relies on the cublas library, a wrapper.pyx file and a setup.py which is based on this repository. Unfortunately, I don’t have any knowledge about setuptools or creating modules in general.
I want to create a python module, however I do not know how to link my cuda code with the cublas library in this case.
Could you give me a hint how to build a module from the .pyx and the .cu file which includes the cublas funtionalities? Manual build is also fine.
Thanks for your help!

Hi - the setup.py file you linked to should largely work. I think you would need to add the cublas to the library_dirs, libraries, runtime_library_dirs, and include_dirs arguments to the Extension constructor on line 112.